I am using a CJKAnalyzer from apache sandbox , I have set the java
file.encoding setting to SJIS
and i am able to index and search the japanese html page . I can see the
index dumps as i expected , However when i index a word document containing
japanese characters it is not indexing as expected
I have wrote one that will index PDF,DOC,XLS,XML,HTML,TXT and plain/text files. I
wrote this based on demo application and using other
open soure componets POI by Apache (for doc and exel) and PDFBox. I modified client
interface also. Now its looks like google. Still i have to do a couple of
In QueryParser.parse method I must give which is the default field.
Does this means ttah non-adressed queris are executed only over
this field?
The main question is:
How I can search in all fields in all documents in the index?
Note that I don't know field names, there can be thousands field
You can use the MultiFieldQueryParser, which will generate a query against all of the
fields you specify, or you could index all of your documents into one or two common
fields and search against them. Since you have a lot of fields, I would guess the
latter is the better choice.
[EMAIL
Can anybody confirm that no guarantee is given that Fields retain
their order within a Document?
Version 1.3 seems to (although reversing the order
on occasion).
Doesnt seem likely but would be really useful for my current application ;)
Im just asking for clarification not a change of spec
some Korean friends tell me they use it successfully for Korean. So I think its also
work for Japanese. mostly the problem is locale settings
Please check weblucene project for xml indexing samples:
http://sourceforge.net/projects/weblucene/
Che Dong
- Original Message -
From: Chandan
Hi. In the Lucene FAQ, 3.41; it's stated:
http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi?file=chapter.searchtoc=faq#q41
41. Can I modify the index while performing ongoing searches ?
Yes and no. At the time of writing this FAQ (June 2001), Lucene is not
thread safe in this
Ooops, sorry, found this on the mailing list referring to 3.41: (maybe
someone should update the FAQ item?)
From: Doug Cutting [EMAIL PROTECTED]
Subject: Re: Searching while optimizing
Date: Wed, 20 Aug 2003 12:25:41 -0700
That is an old FAQ item. Lucene has been thread safe for a
From the quick scan of the entry, I would say the entry is still true.
This is not an issue of being thread safe or not, really.
The jGuru Lucene FAQ is more up to date anyway, so I suggest you check
that one.
Otis
--- Brandon Lee [EMAIL PROTECTED] wrote:
Ooops, sorry, found this on the
I have used this analyzer with Japanese and it works fine. In fact, I'm
currently doing English, several western European languages, traditional
and simplified Chinese and Japanese. I throw them all in the same index
and have had no problem other than my users wanted the search limited by
I believe that Lucene only indexes Unicode (or as Henry Ford might say
you can search any encoding as long as it's Unicode). Therefore, you
have to translate Big5 and GB1312 to Unicode before you put them in the
index. Your code that reads in the html needs to be smart enough to
notice how the
My experience tells me that CJKAnalyzer needs to be improved somehow
For example, single word X* search works perfectly, however, multiple words wildcard
XX* never works.
- Original Message -
From: Scott Smith [EMAIL PROTECTED]
Date: Tuesday, March 16, 2004 5:42 pm
Subject: RE: CJK
I have a field called buisnessname and this field contains keywords like
Georgian House Georgian The Georgian House Hotel Georgian blah
blee bloo Hotel along with 10,000s of other documents that have the
word 'Hotel' somewhere in the businessname field.
When I do a phrase query on Georgian Hotel
On Mar 16, 2004, at 8:39 PM, [EMAIL PROTECTED] wrote:
My experience tells me that CJKAnalyzer needs to be improved
somehow
For example, single word X* search works perfectly, however,
multiple words wildcard XX* never works.
Well, in this case it is QueryParser, not the analyzer, as the
Try setting the slop factor on your phrase query. This should
accomplish what you want. Set it to something like 10 and see what you
get.
Erik
On Mar 16, 2004, at 8:55 PM, Supun Edirisinghe wrote:
I have a field called buisnessname and this field contains keywords
like
Georgian House
Yes, store data in Unicode inside and present to localize outside .
for Chinese users can read my documents on Java unicode process:
http://www.chedong.com/tech/hello_unicode.html
http://www.chedong.com/tech/unicode_java.html
Che Dong
http://www.chedong.com/
- Original Message -
From:
On Tue, 16 Mar 2004 08:11:34 -0500, Grant Ingersoll said:
You can use the MultiFieldQueryParser, which will generate a query against all
of the fields you
specify, or you could index all of your documents into one or two common
fields and search
against them. Since you have a lot of fields,
thanks smith . How do i convert SJIS encoding to be converted into unicode
? As far as i know java converts ascii and latin1 into unicode by default
which xml parsers you are using to translate to unicode ?
- Original Message -
From: Scott Smith [EMAIL PROTECTED]
To: Lucene Users List
please check the java i/o's ByteStream == CharactorStream
Che Dong
- Original Message -
From: Chandan Tamrakar [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Wednesday, March 17, 2004 12:37 PM
Subject: Re: CJK Analyzer indexing japanese word document
thanks smith .
19 matches
Mail list logo