Re: How to efficiently walk the DOM tree and its strings

Felipe G Tue, 04 Mar 2014 21:10:24 -0800

The actual translation needs to happen at once, but that's ok if I can work
in the chunks incrementally, and only when everything is ready I send it
off to the translation service.  What I need to find then is a good (and
fast) partitioning algorithm that will give me a list of several blocks to
translate. A CSS block is a good start but I need something more detailed
than that for some of these reasons:


- I can't skip invisible or display:none nodes because websites have
navigation menus and etc. that have text on them and need to be translated
(I don't know what's the correct definition of "CSS block" that you mention
to know if it covers this or not)
- In direct opposition of the first point, I can't blindly just consider
all nodes (including invisible ones) with text content on them because
websites have <script>, <style>... tags in the body which should be skipped


Also, some other properties that I'd like this algorithm to have:

- It would be nice if it can treat a <ul> <li>a</li> <li>b<li> </ul> as one
individual block instead of one-per-<li>   (and other similar constructs)
-- [not a major req]

- It should only give me blocks that have useful text content inside to be
translated. For example, for sites with a lot of <div><div><div><div> (or
worse with <table><tbody><tr><td><table>.. ad infinitum) nesting, I'm only
interested in the blocks that have actual text content on them (which can
probably be defined as with at least one non-whitespace-only child text
node).


The more junk (useless nodes) that this algorithm can skip, the better.
Then I imagine we'll be good in performance if I implement it in C++ and
have all other handling and further filtering be done in JS, one chunk at a
time.


Felipe



On Tue, Mar 4, 2014 at 6:02 PM, Robert O'Callahan <rob...@ocallahan.org>wrote:

> On Wed, Mar 5, 2014 at 8:47 AM, Felipe G <fel...@gmail.com> wrote:
>
>> If I go with the clone route (to work on the snapshot'ed version of the
>> data), how can I later associate the cloned nodes to the original nodes
>> from the document?  One way that I thought is to set a a userdata on the
>> DOM nodes and then use the clone handler callback to associate the cloned
>> node with the original one (through weak refs or a WeakMap).  That would
>> mean iterating first through all nodes to add the handlers, but that's
>> probably fine (I don't need to analyze anything or visit text nodes).
>>
>> I think serializing and re-parsing everything in the worker is not the
>> ideal solution unless we can find a way to also keep accurate associations
>> with the original nodes from content. Anything that introduces a possibly
>> lossy data aspect will probably hurt translation which is already an
>> innacurate science.
>>
>
> Maybe you can do the translation incrementally, and just annotate the DOM
> with custom attributes (or userdata) to record the progress of the
> translation? Plus a reference to the last tranlated node (subtree) to speed
> up finding the next node subtree to translate. I assume it would be OK to
> translate one CSS block at a time.
>
> Rob
> --
> Jtehsauts  tshaei dS,o n" Wohfy  Mdaon  yhoaus  eanuttehrotraiitny  eovni
> le atrhtohu gthot sf oirng iyvoeu rs ihnesa.r"t sS?o  Whhei csha iids  teoa
> stiheer :p atroa lsyazye,d  'mYaonu,r  "sGients  uapr,e  tfaokreg iyvoeunr,
> 'm aotr  atnod  sgaoy ,h o'mGee.t"  uTph eann dt hwea lmka'n?  gBoutt  uIp
> waanndt  wyeonut  thoo mken.o w
>
_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Re: How to efficiently walk the DOM tree and its strings

Reply via email to