CJK Analyzer indexing japanese word document

2004-03-16 Thread Chandan Tamrakar
I am using a CJKAnalyzer from apache sandbox , I have set the java file.encoding setting to SJIS and i am able to index and search the japanese html page . I can see the index dumps as i expected , However when i index a word document containing japanese characters it is not indexing as expected

Re: UNIX command-line indexing script?

2004-03-16 Thread Linto Joseph Mathew
I have wrote one that will index PDF,DOC,XLS,XML,HTML,TXT and plain/text files. I wrote this based on demo application and using other open soure componets POI by Apache (for doc and exel) and PDFBox. I modified client interface also. Now its looks like google. Still i have to do a couple of

Search in all fields

2004-03-16 Thread Rosen Marinov
In QueryParser.parse method I must give which is the default field. Does this means ttah non-adressed queris are executed only over this field? The main question is: How I can search in all fields in all documents in the index? Note that I don't know field names, there can be thousands field

Re: Search in all fields

2004-03-16 Thread Grant Ingersoll
You can use the MultiFieldQueryParser, which will generate a query against all of the fields you specify, or you could index all of your documents into one or two common fields and search against them. Since you have a lot of fields, I would guess the latter is the better choice. [EMAIL

order of Field objects within Document

2004-03-16 Thread Sam Hough
Can anybody confirm that no guarantee is given that Fields retain their order within a Document? Version 1.3 seems to (although reversing the order on occasion). Doesnt seem likely but would be really useful for my current application ;) Im just asking for clarification not a change of spec

Re: CJK Analyzer indexing japanese word document

2004-03-16 Thread Che Dong
some Korean friends tell me they use it successfully for Korean. So I think its also work for Japanese. mostly the problem is locale settings Please check weblucene project for xml indexing samples: http://sourceforge.net/projects/weblucene/ Che Dong - Original Message - From: Chandan

FAQ 3.41 (modify index while searching)

2004-03-16 Thread Brandon Lee
Hi. In the Lucene FAQ, 3.41; it's stated: http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi?file=chapter.searchtoc=faq#q41 41. Can I modify the index while performing ongoing searches ? Yes and no. At the time of writing this FAQ (June 2001), Lucene is not thread safe in this

Re: FAQ 3.41 (modify index while searching)

2004-03-16 Thread Brandon Lee
Ooops, sorry, found this on the mailing list referring to 3.41: (maybe someone should update the FAQ item?) From: Doug Cutting [EMAIL PROTECTED] Subject: Re: Searching while optimizing Date: Wed, 20 Aug 2003 12:25:41 -0700 That is an old FAQ item. Lucene has been thread safe for a

Re: FAQ 3.41 (modify index while searching)

2004-03-16 Thread Otis Gospodnetic
From the quick scan of the entry, I would say the entry is still true. This is not an issue of being thread safe or not, really. The jGuru Lucene FAQ is more up to date anyway, so I suggest you check that one. Otis --- Brandon Lee [EMAIL PROTECTED] wrote: Ooops, sorry, found this on the

RE: CJK Analyzer indexing japanese word document

2004-03-16 Thread Scott Smith
I have used this analyzer with Japanese and it works fine. In fact, I'm currently doing English, several western European languages, traditional and simplified Chinese and Japanese. I throw them all in the same index and have had no problem other than my users wanted the search limited by

RE: Can lucene index both Big5 and GB2312 encoding character?

2004-03-16 Thread Scott Smith
I believe that Lucene only indexes Unicode (or as Henry Ford might say you can search any encoding as long as it's Unicode). Therefore, you have to translate Big5 and GB1312 to Unicode before you put them in the index. Your code that reads in the html needs to be smart enough to notice how the

Re: RE: CJK Analyzer indexing japanese word document

2004-03-16 Thread xx28
My experience tells me that CJKAnalyzer needs to be improved somehow For example, single word X* search works perfectly, however, multiple words wildcard XX* never works. - Original Message - From: Scott Smith [EMAIL PROTECTED] Date: Tuesday, March 16, 2004 5:42 pm Subject: RE: CJK

phrases

2004-03-16 Thread Supun Edirisinghe
I have a field called buisnessname and this field contains keywords like Georgian House Georgian The Georgian House Hotel Georgian blah blee bloo Hotel along with 10,000s of other documents that have the word 'Hotel' somewhere in the businessname field. When I do a phrase query on Georgian Hotel

Re: CJK Analyzer indexing japanese word document

2004-03-16 Thread Erik Hatcher
On Mar 16, 2004, at 8:39 PM, [EMAIL PROTECTED] wrote: My experience tells me that CJKAnalyzer needs to be improved somehow For example, single word X* search works perfectly, however, multiple words wildcard XX* never works. Well, in this case it is QueryParser, not the analyzer, as the

Re: phrases

2004-03-16 Thread Erik Hatcher
Try setting the slop factor on your phrase query. This should accomplish what you want. Set it to something like 10 and see what you get. Erik On Mar 16, 2004, at 8:55 PM, Supun Edirisinghe wrote: I have a field called buisnessname and this field contains keywords like Georgian House

Re: CJK Analyzer indexing japanese word document

2004-03-16 Thread Che Dong
Yes, store data in Unicode inside and present to localize outside . for Chinese users can read my documents on Java unicode process: http://www.chedong.com/tech/hello_unicode.html http://www.chedong.com/tech/unicode_java.html Che Dong http://www.chedong.com/ - Original Message - From:

Re: Search in all fields

2004-03-16 Thread Kelvin Tan
On Tue, 16 Mar 2004 08:11:34 -0500, Grant Ingersoll said: You can use the MultiFieldQueryParser, which will generate a query against all of the fields you specify, or you could index all of your documents into one or two common fields and search against them. Since you have a lot of fields,

Re: CJK Analyzer indexing japanese word document

2004-03-16 Thread Chandan Tamrakar
thanks smith . How do i convert SJIS encoding to be converted into unicode ? As far as i know java converts ascii and latin1 into unicode by default which xml parsers you are using to translate to unicode ? - Original Message - From: Scott Smith [EMAIL PROTECTED] To: Lucene Users List

Re: CJK Analyzer indexing japanese word document

2004-03-16 Thread Che Dong
please check the java i/o's ByteStream == CharactorStream Che Dong - Original Message - From: Chandan Tamrakar [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Wednesday, March 17, 2004 12:37 PM Subject: Re: CJK Analyzer indexing japanese word document thanks smith .