[chromium-extensions] Re: Japanese dictionary extension

Andrew Richards Sat, 10 Oct 2009 21:47:52 -0700

On Oct 10, 7:32 am, Akira <ak...@yayakoshi.net> wrote:
> Out of curiosity did you, in the early stages, create a single
> javascript object with all the words as properties? I.e. one object
> with approx 90,000 properties, one for each entry in JDIC. I tried
> this once in a firefox extension and got an awful amount of
> collisions. It seems the max may have been 64k properties and as you
> approach that the collision rate approaches 100%.

Hi Akira. This was my first approach. Each word, in dictionary form,
as a property on an object literal. In Chrome, I didn't notice any
collisions. However, that might simply be due to the fact that I don't
know the language, and my perception ability is low.

I changed approaches after my friend told me about conjugation, and
how many words that appear in use must be transformed back into
dictionary form.

As a single flat object, I have to fetch each word as property on the
object as many times as the number of characters in the longest word
in the dictionary, or until I run out of text that the mouse cursor is
over. For example, copying and pasting from Google News, "島根県出雲市多伎町で出土し
た、...." even though the longest valid match starting at the first
character is only 3 characters, I have to first try 島, then 島根, then 島根
県, then 島根県出, etc. In the actual version of the extension, this was
done in reverse, so that I could stop after the first valid match,
since that would have been the longest (most specific.) However, it
turns out it's useful to display more than one match, so I ended up
having it continue anyway, in order to collect all of the possible
matches.

This is fast and works well. However, when you must also test for
conjugations along each step, and then for conjugations on top of each
previous conjugation, the search space grows dramatically.

The way it is now, a search tree is constructed. The root is a
JavaScript object literal, and it has the first character of every
word as a property. The value for each of these characters is any word
definitions (or none, if a word isn't formed by these characters at
this particular position), plus the next object with all of the
possible next characters, etc. As soon as there are no longer any
matches, I can stop searching. This tree is created for both the kanji
version of each word, and for the hiragana reading, so that colloquial
uses are found. To avoid redundancy, the actual word data is not
stored in the tree. It's in a separate object instead, and the tree
merely contains the index references on the other object.

Still, though, a rather huge number of objects are created, leading to
high memory usage. I've minimized the number of objects by
consolidating redundant definition texts and munging together the
definition fields into a single string with a simple separater
character, to avoid needlessly creating objects in memory. This saves
about 30mb alone. The definitions are split apart into objects after
being looked up. There's probably room for further optimization.

On Oct 10, 5:14 pm, edvakf <taka.atsu...@googlemail.com> wrote:
> As a Japanese native speaker, I thought you can probably adjust it so
> that it won't look up one-letter Hiragana & Katakana characters. (but
> maybe the current approach is good for a learner, I don't know)

Hi edvakf. There's definitely a lot of things like that which should
be added, such as minimum valid word size, etc.

> For the dictionary data, I think Web Database would be THE way to go,
> although it's not working properly on Mac yet (openDatabase returns
> null).

That sounds like a good idea, actually. In a long string of valid
characters, there tend to be around 30 - 20 property lookups in the
current version. The current "lookup rate" is 50ms as you pan the
cursor around, for easy reading (speed was more important than memory
usage.) SQLite is very fast, but I've never used it for something like
this. Would it work? I'd love to be able to use relational data to
reduce redundancy, instead of the current methods. I could always just
decrease the lookup rate; there's a lot of room for it to slow down,
and a lot of memory that it would be nice to regain :)

Decreasing load time on the extension would be nice, too, as it takes
a few seconds for it to read the objects into memory when Chrome
starts or the extension is installed (though it happens once, in the
background, so it's hard to notice.)

Since I'm on Mac and the toolbar stuff doesn't seem to be working yet,
I haven't added any preferences and such just yet. I'll have to wait
for that to make its way here to try it, as well as Web Database.

Thanks!
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups
"Chromium-extensions" group.
To post to this group, send email to chromium-extensions@googlegroups.com
To unsubscribe from this group, send email to
chromium-extensions+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/chromium-extensions?hl=en
-~----------~----~----~----~------~----~------~--~---

[chromium-extensions] Re: Japanese dictionary extension

Reply via email to