[algogeeks] Re: Looking for a suffix tree implementation with Unicode support

Fred Tue, 18 Aug 2009 18:36:13 -0700

Thank you for replying.

Here "Unicode support" refers to allowing non-ascii characters as
input string, which may be Chinese or Japanese.

It is known that by utf-8 encoding a Chinese character is represented
with 3 bytes, for example, 0xe8b685. Then there raises a problem, if a
utf-8 encoded Chinese character is treated as an array of unsigned
char, then suppose we have two string as input of suffix tree, both of
them contain one Chinese character, say,  0xe8b685 and 0xe8b686, after
construction, we'll got following edges: 86$, 85$, b6, 85$, 86$, e8b6,
85$, 86$, e8b6 will be considered as longest-common-string of the two
characters, which is obviously wrong, for it is an illegal utf-8
character.

One solution to above problem is to use wchar_t instead of char during
suffix tree construction, and modify compare function as well.

Any suggestions?

On Aug 19, 4:07 am, Miroslav Balaz <gpsla...@googlemail.com> wrote:
> What you mean by unicode supprot?
> I think only problem is that characters that look the same may have
> different encodings.
> So it is enough in each compare to use the function that resolves above
> problem.
>
> I made 3 suffix tree implementations and it is easy to change string type in
> that.
> But my implementations was not good, it was slow, however in O(n). Suffix
> array was faster.
>
> 2009/8/18 Fred <hn.ft.p...@gmail.com>
>
>
>
> > Does anybody know, by chance, a suffix tree implementation with
> > Unicode support?
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Algorithm Geeks" group.
To post to this group, send email to algogeeks@googlegroups.com
To unsubscribe from this group, send email to 
algogeeks+unsubscr...@googlegroups.com
For more options, visit this group at http://groups.google.com/group/algogeeks
-~----------~----~----~----~------~----~------~--~---

[algogeeks] Re: Looking for a suffix tree implementation with Unicode support

Reply via email to