CJK Analyzer indexing japanese word document

2004-03-16 Thread Chandan Tamrakar

I am using a CJKAnalyzer from apache sandbox , I have set the java
file.encoding setting to SJIS
and  i am able to index and search the japanese html page . I can see the
index dumps as i expected , However when i index a word document containing
japanese characters it is not indexing as expected . Do I need to change
anything with CJKTokenizer and CJKAnalyzer classes?
I have been able to index a word document with StandardAnalyzers.

thanks in advace
chandan



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: UNIX command-line indexing script?

2004-03-16 Thread Linto Joseph Mathew

I  have wrote one that will index PDF,DOC,XLS,XML,HTML,TXT and plain/text files. I 
wrote this based on demo application and using other 
open soure componets POI by Apache (for doc and exel) and PDFBox. I modified client 
interface also. Now its looks like google. Still i have to do a couple of things.
  1) At present i'm using UNIX 'file' command to check it is plain text.
 This will spwan process and take more time. The advantage this isin unix 
based mechines where file extention is not important.( it uses magic numbers. )
  2) The information such as Index Location, Directory, URL, etc. should  be kept 
in an xml file. So that it cam be dynamic.
  3) Categeory 
  

Since apache guys provided good frame work every thing made easy. Thanks guys!


Linto




On Sat, 13 Mar 2004 Charlie Smith wrote :
Anyone written a simple UNIX command-line indexing script which will read a
bunch off different kinds of docs and index them?  I'd like to make a cron job
out of this so as to be able to come back and read it later during a search.

PERL or JAVA script would be fine.




Search in all fields

2004-03-16 Thread Rosen Marinov
In QueryParser.parse method I must give which is the default field.

Does this means ttah non-adressed queris are executed only over
this field?

The main question is:
How I can search in all fields in all documents in the index?
Note that I don't know field names, there can be thousands field 
names in all documnets.

10x in advance.



Re: Search in all fields

2004-03-16 Thread Grant Ingersoll

You can use the MultiFieldQueryParser, which will generate a query against all of the 
fields you specify, or you could index all of your documents into one or two common 
fields and search against them.  Since you have a lot of fields, I would guess the 
latter is the better choice.  


 [EMAIL PROTECTED] 03/16/04 07:56AM 
In QueryParser.parse method I must give which is the default field.

Does this means ttah non-adressed queris are executed only over
this field?

The main question is:
How I can search in all fields in all documents in the index?
Note that I don't know field names, there can be thousands field 
names in all documnets.

10x in advance.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



order of Field objects within Document

2004-03-16 Thread Sam Hough
Can anybody confirm that no guarantee is given that Fields retain
their order within a Document?

Version 1.3 seems to (although reversing the order
on occasion).

Doesnt seem likely but would be really useful for my current application ;)
Im just asking for clarification not a change of spec although a comment 
in the JavaDoc for Document making it explicit might be handy.

Thanks

Sam

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: CJK Analyzer indexing japanese word document

2004-03-16 Thread Che Dong
some Korean friends tell me they use it successfully for Korean. So I think its also 
work for Japanese. mostly the problem is locale settings

Please check weblucene project for xml indexing samples:
http://sourceforge.net/projects/weblucene/ 

Che Dong
- Original Message - 
From: Chandan Tamrakar [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Tuesday, March 16, 2004 4:31 PM
Subject: CJK Analyzer indexing japanese word document


 
 I am using a CJKAnalyzer from apache sandbox , I have set the java
 file.encoding setting to SJIS
 and  i am able to index and search the japanese html page . I can see the
 index dumps as i expected , However when i index a word document containing
 japanese characters it is not indexing as expected . Do I need to change
 anything with CJKTokenizer and CJKAnalyzer classes?
 I have been able to index a word document with StandardAnalyzers.
 
 thanks in advace
 chandan
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 

FAQ 3.41 (modify index while searching)

2004-03-16 Thread Brandon Lee
Hi.  In the Lucene FAQ, 3.41; it's stated:
  
http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi?file=chapter.searchtoc=faq#q41

  41. Can I modify the index while performing ongoing searches ?

  Yes and no. At the time of writing this FAQ (June 2001), Lucene is not
  thread safe in this regard. Here is a quote from Doug Cutting, the
  creator of Lucene:

The problems are only when you add documents or optimize an
index, and then search with an IndexReader that was constructed
before those changes to the index were made. 

  A possible work around is to perform the index updates in a parable
  and separate index and switch to the new index when its updating is
  ...

Is this still true?  Looking on the mailing list, it seems like people
are doing this w/ the caveat that any Readers created will not include
documents added by any subsequent Writers.

Also, this FAQ does not state the consequences of this - will Lucene
crash if using a Reader where a subsequent Writer adds documents?  How
about in the case of optimize?

Thanks for any help.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: FAQ 3.41 (modify index while searching)

2004-03-16 Thread Brandon Lee
Ooops, sorry, found this on the mailing list referring to 3.41: (maybe
someone should update the FAQ item?)


  From: Doug Cutting [EMAIL PROTECTED]
  Subject: Re: Searching while optimizing
  Date: Wed, 20 Aug 2003 12:25:41 -0700

  That is an old FAQ item.  Lucene has been thread safe for a while now.


* Brandon Lee [EMAIL PROTECTED] [2004-03-16 12:51]:
| Hi.  In the Lucene FAQ, 3.41; it's stated:
|   
http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi?file=chapter.searchtoc=faq#q41
| 
|   41. Can I modify the index while performing ongoing searches ?
| 
|   Yes and no. At the time of writing this FAQ (June 2001), Lucene is not
|   thread safe in this regard. Here is a quote from Doug Cutting, the
|   creator of Lucene:
| 
| The problems are only when you add documents or optimize an
| index, and then search with an IndexReader that was constructed
| before those changes to the index were made. 
| 
|   A possible work around is to perform the index updates in a parable
|   and separate index and switch to the new index when its updating is
|   ...
| 
| Is this still true?  Looking on the mailing list, it seems like people
| are doing this w/ the caveat that any Readers created will not include
| documents added by any subsequent Writers.
| 
| Also, this FAQ does not state the consequences of this - will Lucene
| crash if using a Reader where a subsequent Writer adds documents?  How
| about in the case of optimize?
| 
| Thanks for any help.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: FAQ 3.41 (modify index while searching)

2004-03-16 Thread Otis Gospodnetic
From the quick scan of the entry, I would say the entry is still true.
This is not an issue of being thread safe or not, really.

The jGuru Lucene FAQ is more up to date anyway, so I suggest you check
that one.

Otis

--- Brandon Lee [EMAIL PROTECTED] wrote:
 Ooops, sorry, found this on the mailing list referring to 3.41:
 (maybe
 someone should update the FAQ item?)
 
 
   From: Doug Cutting [EMAIL PROTECTED]
   Subject: Re: Searching while optimizing
   Date: Wed, 20 Aug 2003 12:25:41 -0700
 
   That is an old FAQ item.  Lucene has been thread safe for a while
 now.
 
 
 * Brandon Lee [EMAIL PROTECTED] [2004-03-16 12:51]:
 | Hi.  In the Lucene FAQ, 3.41; it's stated:
 |  

http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi?file=chapter.searchtoc=faq#q41
 | 
 |   41. Can I modify the index while performing ongoing searches ?
 | 
 |   Yes and no. At the time of writing this FAQ (June 2001), Lucene
 is not
 |   thread safe in this regard. Here is a quote from Doug Cutting,
 the
 |   creator of Lucene:
 | 
 | The problems are only when you add documents or optimize an
 | index, and then search with an IndexReader that was
 constructed
 | before those changes to the index were made. 
 | 
 |   A possible work around is to perform the index updates in a
 parable
 |   and separate index and switch to the new index when its updating
 is
 |   ...
 | 
 | Is this still true?  Looking on the mailing list, it seems like
 people
 | are doing this w/ the caveat that any Readers created will not
 include
 | documents added by any subsequent Writers.
 | 
 | Also, this FAQ does not state the consequences of this - will
 Lucene
 | crash if using a Reader where a subsequent Writer adds documents? 
 How
 | about in the case of optimize?
 | 
 | Thanks for any help.
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: CJK Analyzer indexing japanese word document

2004-03-16 Thread Scott Smith
I have used this analyzer with Japanese and it works fine.  In fact, I'm
currently doing English, several western European languages, traditional
and simplified Chinese and Japanese.  I throw them all in the same index
and have had no problem other than my users wanted the search limited by
language.  I solved that problem by simply adding a keyword field to the
Document which has the 2-letter language code.  I then automatically add
the term indicating the language as an additional constraint when the
user specifies the search.  

You do need to be sure that the Shift-JIS gets converted to unicode
before you put it in the Document (and pass it to the analyzer).
Internally, I believe lucene wants everything in unicode (as any good
java program would). Originally, I had problems with Asian languages and
eventually determined my xml parser wasn't translating my Shift-JIS,
Big5, etc. to unicode.  Once I fixed that, life was good.

-Original Message-
From: Che Dong [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, March 16, 2004 8:31 AM
To: Lucene Users List
Subject: Re: CJK Analyzer indexing japanese word document

some Korean friends tell me they use it successfully for Korean. So I
think its also work for Japanese. mostly the problem is locale settings

Please check weblucene project for xml indexing samples:
http://sourceforge.net/projects/weblucene/ 

Che Dong
- Original Message -
From: Chandan Tamrakar [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Tuesday, March 16, 2004 4:31 PM
Subject: CJK Analyzer indexing japanese word document


 
 I am using a CJKAnalyzer from apache sandbox , I have set the java
 file.encoding setting to SJIS
 and  i am able to index and search the japanese html page . I can see
the
 index dumps as i expected , However when i index a word document
containing
 japanese characters it is not indexing as expected . Do I need to
change
 anything with CJKTokenizer and CJKAnalyzer classes?
 I have been able to index a word document with StandardAnalyzers.
 
 thanks in advace
 chandan
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Can lucene index both Big5 and GB2312 encoding character?

2004-03-16 Thread Scott Smith
I believe that Lucene only indexes Unicode (or as Henry Ford might say
you can search any encoding as long as it's Unicode).  Therefore, you
have to translate Big5 and GB1312 to Unicode before you put them in the
index.  Your code that reads in the html needs to be smart enough to
notice how the html is encoded and make the translation before it
creates the Document.  Likewise, your search terms need to be Unicode. 

Btw--you will also need to use something like Che Dong's CJK Analyzer.
I don't believe the standard analyzer's will handle Asian languages in
any kind of reasonable fashion.

Disclaimer: I am new to Lucene (about 6 months) and to processing Asian
languages.  In my defense, I do have a number of happy users using
Lucene to search Japanese and Chinese XML files for things of interest.

-Original Message-
From: Tuan Jean Tee [mailto:[EMAIL PROTECTED] 
Sent: Monday, March 15, 2004 11:07 PM
To: [EMAIL PROTECTED]
Subject: Can lucene index both Big5 and GB2312 encoding character?

Can I find out if I  have both Big5 and GB2312 encoded HTML files in two
separate directories, and when I build the index, does Lucene able to
distinguish the character set? or Lucene only work with single encoding.

Thank you.


IMPORTANT -

This email and any attachments are confidential and may be privileged in
which case neither is intended to be waived. If you have received this
message in error, please notify us and remove it from your system. It is
your responsibility to check any attachments for viruses and defects
before opening or sending them on. Where applicable, liability is
limited by the Solicitors Scheme approved under the Professional
Standards Act 1994 (NSW). Minter Ellison collects personal information
to provide and market our services. For more information about use,
disclosure and access, see our privacy policy at www.minterellison.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: RE: CJK Analyzer indexing japanese word document

2004-03-16 Thread xx28
My experience tells me that CJKAnalyzer needs to be improved somehow

For example, single word X* search works perfectly, however, multiple words wildcard 
XX* never works.

- Original Message -
From: Scott Smith [EMAIL PROTECTED]
Date: Tuesday, March 16, 2004 5:42 pm
Subject: RE: CJK Analyzer indexing japanese word document

 I have used this analyzer with Japanese and it works fine.  In 
 fact, I'm
 currently doing English, several western European languages, 
 traditionaland simplified Chinese and Japanese.  I throw them all 
 in the same index
 and have had no problem other than my users wanted the search 
 limited by
 language.  I solved that problem by simply adding a keyword field 
 to the
 Document which has the 2-letter language code.  I then 
 automatically add
 the term indicating the language as an additional constraint when the
 user specifies the search.  
 
 You do need to be sure that the Shift-JIS gets converted to unicode
 before you put it in the Document (and pass it to the analyzer).
 Internally, I believe lucene wants everything in unicode (as any good
 java program would). Originally, I had problems with Asian 
 languages and
 eventually determined my xml parser wasn't translating my Shift-JIS,
 Big5, etc. to unicode.  Once I fixed that, life was good.
 
 -Original Message-
 From: Che Dong [EMAIL PROTECTED] 
 Sent: Tuesday, March 16, 2004 8:31 AM
 To: Lucene Users List
 Subject: Re: CJK Analyzer indexing japanese word document
 
 some Korean friends tell me they use it successfully for Korean. 
 So I
 think its also work for Japanese. mostly the problem is locale 
 settings
 Please check weblucene project for xml indexing samples:
 http://sourceforge.net/projects/weblucene/ 
 
 Che Dong
 - Original Message -
 From: Chandan Tamrakar [EMAIL PROTECTED]
 To: [EMAIL PROTECTED]
 Sent: Tuesday, March 16, 2004 4:31 PM
 Subject: CJK Analyzer indexing japanese word document
 
 
  
  I am using a CJKAnalyzer from apache sandbox , I have set the java
  file.encoding setting to SJIS
  and  i am able to index and search the japanese html page . I 
 can see
 the
  index dumps as i expected , However when i index a word document
 containing
  japanese characters it is not indexing as expected . Do I need to
 change
  anything with CJKTokenizer and CJKAnalyzer classes?
  I have been able to index a word document with StandardAnalyzers.
  
  thanks in advace
  chandan
  
  
  
  -
 
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
  
  
 
 ---
 --
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



phrases

2004-03-16 Thread Supun Edirisinghe
I have a field called buisnessname and this field contains keywords like
Georgian House Georgian The Georgian House Hotel Georgian blah
blee bloo Hotel along with 10,000s of other documents that have the
word 'Hotel' somewhere in the businessname field. 

When I do a phrase query on Georgian Hotel I get only the one document
back. I would like to get that one back as the top result but also the
other stuff that has Georgian and Hotel too. Also, I'd like to have
Georgian House Hotel show up before Georgian blah blee bloo Hotel 

Right now I do an or'd boolean queary with 
each of the words in the the search string as a Term in business name 
as well as
the entire search string as an exact PhraseQuery and boost that by 3.

But this doesn't allow me to ensure that The Georgian House Hotel will
come before Georgian blah blee bloo Hotel. (there are other fields
queried besides business name) and in my instance of the index,
Georgian blah blee bloo Hotel comes out with a better score because of
other fields). I would like the the closeness of the phrase to be taken
into account. any ideas on constructing a good query for this situation?


thanks


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: CJK Analyzer indexing japanese word document

2004-03-16 Thread Erik Hatcher
On Mar 16, 2004, at 8:39 PM, [EMAIL PROTECTED] wrote:
My experience tells me that CJKAnalyzer needs to be improved 
somehow

For example, single word X* search works perfectly, however, 
multiple words wildcard XX* never works.
Well, in this case it is QueryParser, not the analyzer, as the culprit. 
 QueryParser does not analyze wildcard expressions - that is just the 
nature of the beast.

You could override this behavior by subclassing and overriding 
getPrefixQuery (or getWildcardQuery too, perhaps - a single trailing 
asterisk is a prefix query though).

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: phrases

2004-03-16 Thread Erik Hatcher
Try setting the slop factor on your phrase query.  This should 
accomplish what you want.  Set it to something like 10 and see what you 
get.

	Erik

On Mar 16, 2004, at 8:55 PM, Supun Edirisinghe wrote:

I have a field called buisnessname and this field contains keywords 
like
Georgian House Georgian The Georgian House Hotel Georgian blah
blee bloo Hotel along with 10,000s of other documents that have the
word 'Hotel' somewhere in the businessname field.

When I do a phrase query on Georgian Hotel I get only the one 
document
back. I would like to get that one back as the top result but also the
other stuff that has Georgian and Hotel too. Also, I'd like to have
Georgian House Hotel show up before Georgian blah blee bloo Hotel

Right now I do an or'd boolean queary with
each of the words in the the search string as a Term in business name
as well as
the entire search string as an exact PhraseQuery and boost that by 3.
But this doesn't allow me to ensure that The Georgian House Hotel 
will
come before Georgian blah blee bloo Hotel. (there are other fields
queried besides business name) and in my instance of the index,
Georgian blah blee bloo Hotel comes out with a better score because 
of
other fields). I would like the the closeness of the phrase to be taken
into account. any ideas on constructing a good query for this 
situation?

thanks

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: CJK Analyzer indexing japanese word document

2004-03-16 Thread Che Dong
Yes, store data in Unicode inside and present to localize outside .

for Chinese users can read my documents on Java unicode process:
http://www.chedong.com/tech/hello_unicode.html
http://www.chedong.com/tech/unicode_java.html

Che Dong
http://www.chedong.com/

- Original Message - 
From: Scott Smith [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Wednesday, March 17, 2004 6:42 AM
Subject: RE: CJK Analyzer indexing japanese word document


I have used this analyzer with Japanese and it works fine.  In fact, I'm
currently doing English, several western European languages, traditional
and simplified Chinese and Japanese.  I throw them all in the same index
and have had no problem other than my users wanted the search limited by
language.  I solved that problem by simply adding a keyword field to the
Document which has the 2-letter language code.  I then automatically add
the term indicating the language as an additional constraint when the
user specifies the search.  

You do need to be sure that the Shift-JIS gets converted to unicode
before you put it in the Document (and pass it to the analyzer).
Internally, I believe lucene wants everything in unicode (as any good
java program would). Originally, I had problems with Asian languages and
eventually determined my xml parser wasn't translating my Shift-JIS,
Big5, etc. to unicode.  Once I fixed that, life was good.

-Original Message-
From: Che Dong [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, March 16, 2004 8:31 AM
To: Lucene Users List
Subject: Re: CJK Analyzer indexing japanese word document

some Korean friends tell me they use it successfully for Korean. So I
think its also work for Japanese. mostly the problem is locale settings

Please check weblucene project for xml indexing samples:
http://sourceforge.net/projects/weblucene/ 

Che Dong
- Original Message -
From: Chandan Tamrakar [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Tuesday, March 16, 2004 4:31 PM
Subject: CJK Analyzer indexing japanese word document


 
 I am using a CJKAnalyzer from apache sandbox , I have set the java
 file.encoding setting to SJIS
 and  i am able to index and search the japanese html page . I can see
the
 index dumps as i expected , However when i index a word document
containing
 japanese characters it is not indexing as expected . Do I need to
change
 anything with CJKTokenizer and CJKAnalyzer classes?
 I have been able to index a word document with StandardAnalyzers.
 
 thanks in advace
 chandan
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Search in all fields

2004-03-16 Thread Kelvin Tan


On Tue, 16 Mar 2004 08:11:34 -0500, Grant Ingersoll said:

 You can use the MultiFieldQueryParser, which will generate a query against all
of the fields you
 specify, or you could index all of your documents into one or two common
fields and search
 against them.  Since you have a lot of fields, I would guess the latter is the
better choice.


Don't you mean add one or two common fields to all your documents? Or am I
mistaken.

Anyway, I believe adding of this common field constitutes a best-practice, since
you definitely need these fields if one wishes to perform a date-range-only
search.

Would probably be a good idea to start a best-practices page in the Wiki.

K


 [EMAIL PROTECTED] 03/16/04 07:56AM 
 In QueryParser.parse method I must give which is the default field.

 Does this means ttah non-adressed queris are executed only over
 this field?

 The main question is:
 How I can search in all fields in all documents in the index?
 Note that I don't know field names, there can be thousands field
 names in all documnets.

 10x in advance.


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: CJK Analyzer indexing japanese word document

2004-03-16 Thread Chandan Tamrakar
thanks smith  . How do i convert SJIS encoding to be converted into unicode
? As far as i know java converts ascii and latin1 into unicode by default
which xml parsers you are using to translate to unicode ?

- Original Message -
From: Scott Smith [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Wednesday, March 17, 2004 4:27 AM
Subject: RE: CJK Analyzer indexing japanese word document


 I have used this analyzer with Japanese and it works fine.  In fact, I'm
 currently doing English, several western European languages, traditional
 and simplified Chinese and Japanese.  I throw them all in the same index
 and have had no problem other than my users wanted the search limited by
 language.  I solved that problem by simply adding a keyword field to the
 Document which has the 2-letter language code.  I then automatically add
 the term indicating the language as an additional constraint when the
 user specifies the search.

 You do need to be sure that the Shift-JIS gets converted to unicode
 before you put it in the Document (and pass it to the analyzer).
 Internally, I believe lucene wants everything in unicode (as any good
 java program would). Originally, I had problems with Asian languages and
 eventually determined my xml parser wasn't translating my Shift-JIS,
 Big5, etc. to unicode.  Once I fixed that, life was good.

 -Original Message-
 From: Che Dong [mailto:[EMAIL PROTECTED]
 Sent: Tuesday, March 16, 2004 8:31 AM
 To: Lucene Users List
 Subject: Re: CJK Analyzer indexing japanese word document

 some Korean friends tell me they use it successfully for Korean. So I
 think its also work for Japanese. mostly the problem is locale settings

 Please check weblucene project for xml indexing samples:
 http://sourceforge.net/projects/weblucene/

 Che Dong
 - Original Message -
 From: Chandan Tamrakar [EMAIL PROTECTED]
 To: [EMAIL PROTECTED]
 Sent: Tuesday, March 16, 2004 4:31 PM
 Subject: CJK Analyzer indexing japanese word document


 
  I am using a CJKAnalyzer from apache sandbox , I have set the java
  file.encoding setting to SJIS
  and  i am able to index and search the japanese html page . I can see
 the
  index dumps as i expected , However when i index a word document
 containing
  japanese characters it is not indexing as expected . Do I need to
 change
  anything with CJKTokenizer and CJKAnalyzer classes?
  I have been able to index a word document with StandardAnalyzers.
 
  thanks in advace
  chandan
 
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]






-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: CJK Analyzer indexing japanese word document

2004-03-16 Thread Che Dong
please check the java i/o's  ByteStream == CharactorStream

Che Dong
- Original Message - 
From: Chandan Tamrakar [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Wednesday, March 17, 2004 12:37 PM
Subject: Re: CJK Analyzer indexing japanese word document


 thanks smith  . How do i convert SJIS encoding to be converted into unicode
 ? As far as i know java converts ascii and latin1 into unicode by default
 which xml parsers you are using to translate to unicode ?
 
 - Original Message -
 From: Scott Smith [EMAIL PROTECTED]
 To: Lucene Users List [EMAIL PROTECTED]
 Sent: Wednesday, March 17, 2004 4:27 AM
 Subject: RE: CJK Analyzer indexing japanese word document
 
 
  I have used this analyzer with Japanese and it works fine.  In fact, I'm
  currently doing English, several western European languages, traditional
  and simplified Chinese and Japanese.  I throw them all in the same index
  and have had no problem other than my users wanted the search limited by
  language.  I solved that problem by simply adding a keyword field to the
  Document which has the 2-letter language code.  I then automatically add
  the term indicating the language as an additional constraint when the
  user specifies the search.
 
  You do need to be sure that the Shift-JIS gets converted to unicode
  before you put it in the Document (and pass it to the analyzer).
  Internally, I believe lucene wants everything in unicode (as any good
  java program would). Originally, I had problems with Asian languages and
  eventually determined my xml parser wasn't translating my Shift-JIS,
  Big5, etc. to unicode.  Once I fixed that, life was good.
 
  -Original Message-
  From: Che Dong [mailto:[EMAIL PROTECTED]
  Sent: Tuesday, March 16, 2004 8:31 AM
  To: Lucene Users List
  Subject: Re: CJK Analyzer indexing japanese word document
 
  some Korean friends tell me they use it successfully for Korean. So I
  think its also work for Japanese. mostly the problem is locale settings
 
  Please check weblucene project for xml indexing samples:
  http://sourceforge.net/projects/weblucene/
 
  Che Dong
  - Original Message -
  From: Chandan Tamrakar [EMAIL PROTECTED]
  To: [EMAIL PROTECTED]
  Sent: Tuesday, March 16, 2004 4:31 PM
  Subject: CJK Analyzer indexing japanese word document
 
 
  
   I am using a CJKAnalyzer from apache sandbox , I have set the java
   file.encoding setting to SJIS
   and  i am able to index and search the japanese html page . I can see
  the
   index dumps as i expected , However when i index a word document
  containing
   japanese characters it is not indexing as expected . Do I need to
  change
   anything with CJKTokenizer and CJKAnalyzer classes?
   I have been able to index a word document with StandardAnalyzers.
  
   thanks in advace
   chandan
  
  
  
   -
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]
  
  
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]