Sorry Christophe,
I mis-informed you. We did NOT subclass Document, we simply created an
HTMLDocument class with methods that return Lucene Documents with the
required fields added and that is where the content-encoding was set.
Alex.
Alex BOURNE/IBEU/[EMAIL PROTECTED] on 27 May 2004 09:05
Please respond to "Lucene Users List" <[EMAIL PROTECTED]>
To:"Lucene Users List" <[EMAIL PROTECTED]>
cc:
bcc:
Subject:Re: Asian languages
Hi Christophe,
we're currently indexing Chinese pages with little difficulty. You can use
the standard analyzer to index the documents and it will tokenize the
content into individual characters. If you want to create a list of 'stop'
words you will need to create your own analyzer and supply it with a list
of unicode characters to stop. We are indexing HTML pages using a spider to
traverse the site and have subclassed Document into HTML_Document. This
allows us to set the content encoding for the input stream reader - as our
system default is iso_8859-1 in common with most western machines - which
enables it to correctly process the unicode characters. You may need to do
this too.
Hope this helps
Alex.
"Christophe Lombart" <[EMAIL PROTECTED]> on 26 May
2004 19:16
Please respond to "Lucene Users List" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
cc:
bcc:
Subject:Asian languages
Which asian languages are supported by Lucene ?
What about corean, japanese, thaï, ... ?
If they are not yet supported, what I need to do ?
Thanks,
Christophe
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
**
This message originated from the Internet. Its originator may or
may not be who they claim to be and the information contained in
the message and any attachments may or may not be accurate.
**
_
This transmission has been issued by a member of the HSBC Group
("HSBC") for the information of the addressee only and should not be
reproduced and / or distributed to any other person. Each page
attached hereto must be read in conjunction with any disclaimer which
forms part of it. This transmission is neither an offer nor the
solicitation
of an offer to sell or purchase any investment. Its contents are based
on information obtained from sources believed to be reliable but HSBC
makes no representation and accepts no responsibility or liability as to
its completeness or accuracy.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
**
This message originated from the Internet. Its originator may or
may not be who they claim to be and the information contained in
the message and any attachments may or may not be accurate.
**
_
This transmission has been issued by a member of the HSBC Group
("HSBC") for the information of the addressee only and should not be
reproduced and / or distributed to any other person. Each page
attached hereto must be read in conjunction with any disclaimer which
forms part of it. This transmission is neither an offer nor the solicitation
of an offer to sell or purchase any investment. Its contents are based
on information obtained from sources believed to be reliable but HSBC
makes no representation and accepts no responsibility or liability as to
its completeness or accuracy.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]