Memo: Re: Asian languages

2004-05-27 Thread alex . bourne




Sorry Christophe,

I mis-informed you. We did NOT subclass Document, we simply created an
HTMLDocument class with methods that return Lucene Documents with the
required fields added and that is where the content-encoding was set.

Alex.




Alex BOURNE/IBEU/[EMAIL PROTECTED] on 27 May 2004 09:05

Please respond to "Lucene Users List" <[EMAIL PROTECTED]>

To:"Lucene Users List" <[EMAIL PROTECTED]>
cc:
bcc:

Subject:Re: Asian languages






Hi Christophe,

we're currently indexing Chinese pages with little difficulty. You can use
the standard analyzer to index the documents and it will tokenize the
content into individual characters. If you want to create a list of 'stop'
words you will need to create your own analyzer and supply it with a list
of unicode characters to stop. We are indexing HTML pages using a spider to
traverse the site and have subclassed Document into HTML_Document. This
allows us to set the content encoding for the input stream reader - as our
system default is iso_8859-1 in common with most western machines - which
enables it to correctly process the unicode characters. You may need to do
this too.

Hope this helps

Alex.




"Christophe Lombart" <[EMAIL PROTECTED]> on 26 May
2004 19:16

Please respond to "Lucene Users List" <[EMAIL PROTECTED]>

To:    "Lucene Users List" <[EMAIL PROTECTED]>
cc:
bcc:

Subject:Asian languages


Which  asian languages are supported by Lucene ?
What about corean, japanese, thaï, ... ?
If they are not yet supported, what I need to do ?

Thanks,
Christophe

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



**
 This message originated from the Internet. Its originator may or
 may not be who they claim to be and the information contained in
 the message and any attachments may or may not be accurate.
**








_

This transmission has been issued by a member of the HSBC Group
("HSBC") for the information of the addressee only and should not be
reproduced and / or distributed to any other person. Each page
attached hereto must be read in conjunction with any disclaimer which
forms part of it. This transmission is neither an offer nor the
solicitation
of an offer to sell or purchase any investment. Its contents are based
on information obtained from sources believed to be reliable but HSBC
makes no representation and accepts no responsibility or liability as to
its completeness or accuracy.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



**
 This message originated from the Internet. Its originator may or
 may not be who they claim to be and the information contained in
 the message and any attachments may or may not be accurate.
**








_

This transmission has been issued by a member of the HSBC Group 
("HSBC") for the information of the addressee only and should not be 
reproduced and / or distributed to any other person. Each page 
attached hereto must be read in conjunction with any disclaimer which 
forms part of it. This transmission is neither an offer nor the solicitation 
of an offer to sell or purchase any investment. Its contents are based 
on information obtained from sources believed to be reliable but HSBC 
makes no representation and accepts no responsibility or liability as to 
its completeness or accuracy.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Memo: Re: Asian languages

2004-05-27 Thread alex . bourne




Hi Christophe,

we're currently indexing Chinese pages with little difficulty. You can use
the standard analyzer to index the documents and it will tokenize the
content into individual characters. If you want to create a list of 'stop'
words you will need to create your own analyzer and supply it with a list
of unicode characters to stop. We are indexing HTML pages using a spider to
traverse the site and have subclassed Document into HTML_Document. This
allows us to set the content encoding for the input stream reader - as our
system default is iso_8859-1 in common with most western machines - which
enables it to correctly process the unicode characters. You may need to do
this too.

Hope this helps

Alex.




"Christophe Lombart" <[EMAIL PROTECTED]> on 26 May
2004 19:16

Please respond to "Lucene Users List" <[EMAIL PROTECTED]>

To:"Lucene Users List" <[EMAIL PROTECTED]>
cc:
bcc:

Subject:Asian languages


Which  asian languages are supported by Lucene ?
What about corean, japanese, thaï, ... ?
If they are not yet supported, what I need to do ?

Thanks,
Christophe

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



**
 This message originated from the Internet. Its originator may or
 may not be who they claim to be and the information contained in
 the message and any attachments may or may not be accurate.
**








_

This transmission has been issued by a member of the HSBC Group 
("HSBC") for the information of the addressee only and should not be 
reproduced and / or distributed to any other person. Each page 
attached hereto must be read in conjunction with any disclaimer which 
forms part of it. This transmission is neither an offer nor the solicitation 
of an offer to sell or purchase any investment. Its contents are based 
on information obtained from sources believed to be reliable but HSBC 
makes no representation and accepts no responsibility or liability as to 
its completeness or accuracy.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Asian languages

2004-05-26 Thread Chandan Tamrakar
CJKAnalyzer suports chinese , japanese and korean languages , Im not sure
about the thai .
i got a CJKAnalyzer from lucene sandbox
- Original Message - 
From: "Christophe Lombart" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Thursday, May 27, 2004 12:01 AM
Subject: Asian languages


> Which  asian languages are supported by Lucene ?
> What about corean, japanese, thaï, ... ?
> If they are not yet supported, what I need to do ?
>
> Thanks,
> Christophe
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Asian languages

2004-05-26 Thread Christophe Lombart
Which  asian languages are supported by Lucene ?
What about corean, japanese, thaï, ... ?
If they are not yet supported, what I need to do ?
Thanks,
Christophe
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]