Re: CJK Analyzer indexing japanese word document
please check the java i/o's ByteStream ==> CharactorStream Che Dong - Original Message - From: "Chandan Tamrakar" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Wednesday, March 17, 2004 12:37 PM Subject: Re: CJK Analyzer indexing japanese word document > thanks smith . How do i convert SJIS encoding to be converted into unicode > ? As far as i know java converts ascii and latin1 into unicode by default > which xml parsers you are using to translate to unicode ? > > - Original Message - > From: "Scott Smith" <[EMAIL PROTECTED]> > To: "Lucene Users List" <[EMAIL PROTECTED]> > Sent: Wednesday, March 17, 2004 4:27 AM > Subject: RE: CJK Analyzer indexing japanese word document > > > > I have used this analyzer with Japanese and it works fine. In fact, I'm > > currently doing English, several western European languages, traditional > > and simplified Chinese and Japanese. I throw them all in the same index > > and have had no problem other than my users wanted the search limited by > > language. I solved that problem by simply adding a keyword field to the > > Document which has the 2-letter language code. I then automatically add > > the term indicating the language as an additional constraint when the > > user specifies the search. > > > > You do need to be sure that the Shift-JIS gets converted to unicode > > before you put it in the Document (and pass it to the analyzer). > > Internally, I believe lucene wants everything in unicode (as any good > > java program would). Originally, I had problems with Asian languages and > > eventually determined my xml parser wasn't translating my Shift-JIS, > > Big5, etc. to unicode. Once I fixed that, life was good. > > > > -Original Message- > > From: Che Dong [mailto:[EMAIL PROTECTED] > > Sent: Tuesday, March 16, 2004 8:31 AM > > To: Lucene Users List > > Subject: Re: CJK Analyzer indexing japanese word document > > > > some Korean friends tell me they use it successfully for Korean. So I > > think its also work for Japanese. mostly the problem is locale settings > > > > Please check weblucene project for xml indexing samples: > > http://sourceforge.net/projects/weblucene/ > > > > Che Dong > > - Original Message - > > From: "Chandan Tamrakar" <[EMAIL PROTECTED]> > > To: <[EMAIL PROTECTED]> > > Sent: Tuesday, March 16, 2004 4:31 PM > > Subject: CJK Analyzer indexing japanese word document > > > > > > > > > > I am using a CJKAnalyzer from apache sandbox , I have set the java > > > file.encoding setting to SJIS > > > and i am able to index and search the japanese html page . I can see > > the > > > index dumps as i expected , However when i index a word document > > containing > > > japanese characters it is not indexing as expected . Do I need to > > change > > > anything with CJKTokenizer and CJKAnalyzer classes? > > > I have been able to index a word document with StandardAnalyzers. > > > > > > thanks in advace > > > chandan > > > > > > > > > > > > - > > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >
Re: CJK Analyzer indexing japanese word document
thanks smith . How do i convert SJIS encoding to be converted into unicode ? As far as i know java converts ascii and latin1 into unicode by default which xml parsers you are using to translate to unicode ? - Original Message - From: "Scott Smith" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Wednesday, March 17, 2004 4:27 AM Subject: RE: CJK Analyzer indexing japanese word document > I have used this analyzer with Japanese and it works fine. In fact, I'm > currently doing English, several western European languages, traditional > and simplified Chinese and Japanese. I throw them all in the same index > and have had no problem other than my users wanted the search limited by > language. I solved that problem by simply adding a keyword field to the > Document which has the 2-letter language code. I then automatically add > the term indicating the language as an additional constraint when the > user specifies the search. > > You do need to be sure that the Shift-JIS gets converted to unicode > before you put it in the Document (and pass it to the analyzer). > Internally, I believe lucene wants everything in unicode (as any good > java program would). Originally, I had problems with Asian languages and > eventually determined my xml parser wasn't translating my Shift-JIS, > Big5, etc. to unicode. Once I fixed that, life was good. > > -Original Message- > From: Che Dong [mailto:[EMAIL PROTECTED] > Sent: Tuesday, March 16, 2004 8:31 AM > To: Lucene Users List > Subject: Re: CJK Analyzer indexing japanese word document > > some Korean friends tell me they use it successfully for Korean. So I > think its also work for Japanese. mostly the problem is locale settings > > Please check weblucene project for xml indexing samples: > http://sourceforge.net/projects/weblucene/ > > Che Dong > - Original Message - > From: "Chandan Tamrakar" <[EMAIL PROTECTED]> > To: <[EMAIL PROTECTED]> > Sent: Tuesday, March 16, 2004 4:31 PM > Subject: CJK Analyzer indexing japanese word document > > > > > > I am using a CJKAnalyzer from apache sandbox , I have set the java > > file.encoding setting to SJIS > > and i am able to index and search the japanese html page . I can see > the > > index dumps as i expected , However when i index a word document > containing > > japanese characters it is not indexing as expected . Do I need to > change > > anything with CJKTokenizer and CJKAnalyzer classes? > > I have been able to index a word document with StandardAnalyzers. > > > > thanks in advace > > chandan > > > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Search in all fields
On Tue, 16 Mar 2004 08:11:34 -0500, Grant Ingersoll said: > > You can use the MultiFieldQueryParser, which will generate a query against all of the fields you > specify, or you could index all of your documents into one or two common fields and search > against them. Since you have a lot of fields, I would guess the latter is the better choice. > Don't you mean "add one or two common fields to all your documents"? Or am I mistaken. Anyway, I believe adding of this common field constitutes a best-practice, since you definitely need these fields if one wishes to perform a date-range-only search. Would probably be a good idea to start a best-practices page in the Wiki. K > [EMAIL PROTECTED] 03/16/04 07:56AM >>> > In QueryParser.parse method I must give which is the default field. > > Does this means ttah non-adressed queris are executed only over > this field? > > The main question is: > How I can search in all fields in all documents in the index? > Note that I don't know field names, there can be thousands field > names in all documnets. > > 10x in advance. > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: CJK Analyzer indexing japanese word document
Yes, store data in Unicode inside and present to localize outside . for Chinese users can read my documents on Java unicode process: http://www.chedong.com/tech/hello_unicode.html http://www.chedong.com/tech/unicode_java.html Che Dong http://www.chedong.com/ - Original Message - From: "Scott Smith" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Wednesday, March 17, 2004 6:42 AM Subject: RE: CJK Analyzer indexing japanese word document I have used this analyzer with Japanese and it works fine. In fact, I'm currently doing English, several western European languages, traditional and simplified Chinese and Japanese. I throw them all in the same index and have had no problem other than my users wanted the search limited by language. I solved that problem by simply adding a keyword field to the Document which has the 2-letter language code. I then automatically add the term indicating the language as an additional constraint when the user specifies the search. You do need to be sure that the Shift-JIS gets converted to unicode before you put it in the Document (and pass it to the analyzer). Internally, I believe lucene wants everything in unicode (as any good java program would). Originally, I had problems with Asian languages and eventually determined my xml parser wasn't translating my Shift-JIS, Big5, etc. to unicode. Once I fixed that, life was good. -Original Message- From: Che Dong [mailto:[EMAIL PROTECTED] Sent: Tuesday, March 16, 2004 8:31 AM To: Lucene Users List Subject: Re: CJK Analyzer indexing japanese word document some Korean friends tell me they use it successfully for Korean. So I think its also work for Japanese. mostly the problem is locale settings Please check weblucene project for xml indexing samples: http://sourceforge.net/projects/weblucene/ Che Dong - Original Message - From: "Chandan Tamrakar" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Tuesday, March 16, 2004 4:31 PM Subject: CJK Analyzer indexing japanese word document > > I am using a CJKAnalyzer from apache sandbox , I have set the java > file.encoding setting to SJIS > and i am able to index and search the japanese html page . I can see the > index dumps as i expected , However when i index a word document containing > japanese characters it is not indexing as expected . Do I need to change > anything with CJKTokenizer and CJKAnalyzer classes? > I have been able to index a word document with StandardAnalyzers. > > thanks in advace > chandan > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: phrases
Try setting the slop factor on your phrase query. This should accomplish what you want. Set it to something like 10 and see what you get. Erik On Mar 16, 2004, at 8:55 PM, Supun Edirisinghe wrote: I have a field called buisnessname and this field contains keywords like "Georgian House" "Georgian" "The Georgian House Hotel" "Georgian blah blee bloo Hotel" along with 10,000s of other documents that have the word 'Hotel' somewhere in the businessname field. When I do a phrase query on "Georgian Hotel" I get only the one document back. I would like to get that one back as the top result but also the other stuff that has "Georgian" and "Hotel" too. Also, I'd like to have "Georgian House Hotel" show up before "Georgian blah blee bloo Hotel" Right now I do an or'd boolean queary with each of the words in the the search string as a Term in business name as well as the entire search string as an exact PhraseQuery and boost that by 3. But this doesn't allow me to ensure that "The Georgian House Hotel" will come before "Georgian blah blee bloo Hotel". (there are other fields queried besides business name) and in my instance of the index, "Georgian blah blee bloo Hotel" comes out with a better score because of other fields). I would like the the closeness of the phrase to be taken into account. any ideas on constructing a good query for this situation? thanks - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: CJK Analyzer indexing japanese word document
On Mar 16, 2004, at 8:39 PM, [EMAIL PROTECTED] wrote: My experience tells me that CJKAnalyzer needs to be improved somehow For example, single word "X*" search works perfectly, however, multiple words wildcard "XX*" never works. Well, in this case it is QueryParser, not the analyzer, as the culprit. QueryParser does not analyze wildcard expressions - that is just the nature of the beast. You could override this behavior by subclassing and overriding getPrefixQuery (or getWildcardQuery too, perhaps - a single trailing asterisk is a prefix query though). Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
phrases
I have a field called buisnessname and this field contains keywords like "Georgian House" "Georgian" "The Georgian House Hotel" "Georgian blah blee bloo Hotel" along with 10,000s of other documents that have the word 'Hotel' somewhere in the businessname field. When I do a phrase query on "Georgian Hotel" I get only the one document back. I would like to get that one back as the top result but also the other stuff that has "Georgian" and "Hotel" too. Also, I'd like to have "Georgian House Hotel" show up before "Georgian blah blee bloo Hotel" Right now I do an or'd boolean queary with each of the words in the the search string as a Term in business name as well as the entire search string as an exact PhraseQuery and boost that by 3. But this doesn't allow me to ensure that "The Georgian House Hotel" will come before "Georgian blah blee bloo Hotel". (there are other fields queried besides business name) and in my instance of the index, "Georgian blah blee bloo Hotel" comes out with a better score because of other fields). I would like the the closeness of the phrase to be taken into account. any ideas on constructing a good query for this situation? thanks - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: RE: CJK Analyzer indexing japanese word document
My experience tells me that CJKAnalyzer needs to be improved somehow For example, single word "X*" search works perfectly, however, multiple words wildcard "XX*" never works. - Original Message - From: Scott Smith <[EMAIL PROTECTED]> Date: Tuesday, March 16, 2004 5:42 pm Subject: RE: CJK Analyzer indexing japanese word document > I have used this analyzer with Japanese and it works fine. In > fact, I'm > currently doing English, several western European languages, > traditionaland simplified Chinese and Japanese. I throw them all > in the same index > and have had no problem other than my users wanted the search > limited by > language. I solved that problem by simply adding a keyword field > to the > Document which has the 2-letter language code. I then > automatically add > the term indicating the language as an additional constraint when the > user specifies the search. > > You do need to be sure that the Shift-JIS gets converted to unicode > before you put it in the Document (and pass it to the analyzer). > Internally, I believe lucene wants everything in unicode (as any good > java program would). Originally, I had problems with Asian > languages and > eventually determined my xml parser wasn't translating my Shift-JIS, > Big5, etc. to unicode. Once I fixed that, life was good. > > -Original Message- > From: Che Dong [EMAIL PROTECTED] > Sent: Tuesday, March 16, 2004 8:31 AM > To: Lucene Users List > Subject: Re: CJK Analyzer indexing japanese word document > > some Korean friends tell me they use it successfully for Korean. > So I > think its also work for Japanese. mostly the problem is locale > settings > Please check weblucene project for xml indexing samples: > http://sourceforge.net/projects/weblucene/ > > Che Dong > - Original Message - > From: "Chandan Tamrakar" <[EMAIL PROTECTED]> > To: <[EMAIL PROTECTED]> > Sent: Tuesday, March 16, 2004 4:31 PM > Subject: CJK Analyzer indexing japanese word document > > > > > > I am using a CJKAnalyzer from apache sandbox , I have set the java > > file.encoding setting to SJIS > > and i am able to index and search the japanese html page . I > can see > the > > index dumps as i expected , However when i index a word document > containing > > japanese characters it is not indexing as expected . Do I need to > change > > anything with CJKTokenizer and CJKAnalyzer classes? > > I have been able to index a word document with StandardAnalyzers. > > > > thanks in advace > > chandan > > > > > > > > - > > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > --- > -- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Can lucene index both Big5 and GB2312 encoding character?
I believe that Lucene only indexes Unicode (or as Henry Ford might say "you can search any encoding as long as it's Unicode"). Therefore, you have to translate Big5 and GB1312 to Unicode before you put them in the index. Your code that reads in the html needs to be smart enough to notice how the html is encoded and make the translation before it creates the Document. Likewise, your search terms need to be Unicode. Btw--you will also need to use something like Che Dong's CJK Analyzer. I don't believe the standard analyzer's will handle Asian languages in any kind of reasonable fashion. Disclaimer: I am new to Lucene (about 6 months) and to processing Asian languages. In my defense, I do have a number of happy users using Lucene to search Japanese and Chinese XML files for things of interest. -Original Message- From: Tuan Jean Tee [mailto:[EMAIL PROTECTED] Sent: Monday, March 15, 2004 11:07 PM To: [EMAIL PROTECTED] Subject: Can lucene index both Big5 and GB2312 encoding character? Can I find out if I have both Big5 and GB2312 encoded HTML files in two separate directories, and when I build the index, does Lucene able to distinguish the character set? or Lucene only work with single encoding. Thank you. IMPORTANT - This email and any attachments are confidential and may be privileged in which case neither is intended to be waived. If you have received this message in error, please notify us and remove it from your system. It is your responsibility to check any attachments for viruses and defects before opening or sending them on. Where applicable, liability is limited by the Solicitors Scheme approved under the Professional Standards Act 1994 (NSW). Minter Ellison collects personal information to provide and market our services. For more information about use, disclosure and access, see our privacy policy at www.minterellison.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: CJK Analyzer indexing japanese word document
I have used this analyzer with Japanese and it works fine. In fact, I'm currently doing English, several western European languages, traditional and simplified Chinese and Japanese. I throw them all in the same index and have had no problem other than my users wanted the search limited by language. I solved that problem by simply adding a keyword field to the Document which has the 2-letter language code. I then automatically add the term indicating the language as an additional constraint when the user specifies the search. You do need to be sure that the Shift-JIS gets converted to unicode before you put it in the Document (and pass it to the analyzer). Internally, I believe lucene wants everything in unicode (as any good java program would). Originally, I had problems with Asian languages and eventually determined my xml parser wasn't translating my Shift-JIS, Big5, etc. to unicode. Once I fixed that, life was good. -Original Message- From: Che Dong [mailto:[EMAIL PROTECTED] Sent: Tuesday, March 16, 2004 8:31 AM To: Lucene Users List Subject: Re: CJK Analyzer indexing japanese word document some Korean friends tell me they use it successfully for Korean. So I think its also work for Japanese. mostly the problem is locale settings Please check weblucene project for xml indexing samples: http://sourceforge.net/projects/weblucene/ Che Dong - Original Message - From: "Chandan Tamrakar" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Tuesday, March 16, 2004 4:31 PM Subject: CJK Analyzer indexing japanese word document > > I am using a CJKAnalyzer from apache sandbox , I have set the java > file.encoding setting to SJIS > and i am able to index and search the japanese html page . I can see the > index dumps as i expected , However when i index a word document containing > japanese characters it is not indexing as expected . Do I need to change > anything with CJKTokenizer and CJKAnalyzer classes? > I have been able to index a word document with StandardAnalyzers. > > thanks in advace > chandan > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: FAQ 3.41 (modify index while searching)
>From the quick scan of the entry, I would say the entry is still true. This is not an issue of being thread safe or not, really. The jGuru Lucene FAQ is more up to date anyway, so I suggest you check that one. Otis --- Brandon Lee <[EMAIL PROTECTED]> wrote: > Ooops, sorry, found this on the mailing list referring to 3.41: > (maybe > someone should update the FAQ item?) > > > From: Doug Cutting <[EMAIL PROTECTED]> > Subject: Re: Searching while optimizing > Date: Wed, 20 Aug 2003 12:25:41 -0700 > > That is an old FAQ item. Lucene has been thread safe for a while > now. > > > * Brandon Lee <[EMAIL PROTECTED]> [2004-03-16 12:51]: > | Hi. In the Lucene FAQ, 3.41; it's stated: > | > http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi?file=chapter.search&toc=faq#q41 > | > | 41. Can I modify the index while performing ongoing searches ? > | > | Yes and no. At the time of writing this FAQ (June 2001), Lucene > is not > | thread safe in this regard. Here is a quote from Doug Cutting, > the > | creator of Lucene: > | > | The problems are only when you add documents or optimize an > | index, and then search with an IndexReader that was > constructed > | before those changes to the index were made. > | > | A possible work around is to perform the index updates in a > parable > | and separate index and switch to the new index when its updating > is > | ... > | > | Is this still true? Looking on the mailing list, it seems like > people > | are doing this w/ the caveat that any Readers created will not > include > | documents added by any subsequent Writers. > | > | Also, this FAQ does not state the consequences of this - will > Lucene > | crash if using a Reader where a subsequent Writer adds documents? > How > | about in the case of optimize? > | > | Thanks for any help. > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: FAQ 3.41 (modify index while searching)
Ooops, sorry, found this on the mailing list referring to 3.41: (maybe someone should update the FAQ item?) From: Doug Cutting <[EMAIL PROTECTED]> Subject: Re: Searching while optimizing Date: Wed, 20 Aug 2003 12:25:41 -0700 That is an old FAQ item. Lucene has been thread safe for a while now. * Brandon Lee <[EMAIL PROTECTED]> [2004-03-16 12:51]: | Hi. In the Lucene FAQ, 3.41; it's stated: | http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi?file=chapter.search&toc=faq#q41 | | 41. Can I modify the index while performing ongoing searches ? | | Yes and no. At the time of writing this FAQ (June 2001), Lucene is not | thread safe in this regard. Here is a quote from Doug Cutting, the | creator of Lucene: | | The problems are only when you add documents or optimize an | index, and then search with an IndexReader that was constructed | before those changes to the index were made. | | A possible work around is to perform the index updates in a parable | and separate index and switch to the new index when its updating is | ... | | Is this still true? Looking on the mailing list, it seems like people | are doing this w/ the caveat that any Readers created will not include | documents added by any subsequent Writers. | | Also, this FAQ does not state the consequences of this - will Lucene | crash if using a Reader where a subsequent Writer adds documents? How | about in the case of optimize? | | Thanks for any help. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
FAQ 3.41 (modify index while searching)
Hi. In the Lucene FAQ, 3.41; it's stated: http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi?file=chapter.search&toc=faq#q41 41. Can I modify the index while performing ongoing searches ? Yes and no. At the time of writing this FAQ (June 2001), Lucene is not thread safe in this regard. Here is a quote from Doug Cutting, the creator of Lucene: The problems are only when you add documents or optimize an index, and then search with an IndexReader that was constructed before those changes to the index were made. A possible work around is to perform the index updates in a parable and separate index and switch to the new index when its updating is ... Is this still true? Looking on the mailing list, it seems like people are doing this w/ the caveat that any Readers created will not include documents added by any subsequent Writers. Also, this FAQ does not state the consequences of this - will Lucene crash if using a Reader where a subsequent Writer adds documents? How about in the case of optimize? Thanks for any help. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: CJK Analyzer indexing japanese word document
some Korean friends tell me they use it successfully for Korean. So I think its also work for Japanese. mostly the problem is locale settings Please check weblucene project for xml indexing samples: http://sourceforge.net/projects/weblucene/ Che Dong - Original Message - From: "Chandan Tamrakar" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Tuesday, March 16, 2004 4:31 PM Subject: CJK Analyzer indexing japanese word document > > I am using a CJKAnalyzer from apache sandbox , I have set the java > file.encoding setting to SJIS > and i am able to index and search the japanese html page . I can see the > index dumps as i expected , However when i index a word document containing > japanese characters it is not indexing as expected . Do I need to change > anything with CJKTokenizer and CJKAnalyzer classes? > I have been able to index a word document with StandardAnalyzers. > > thanks in advace > chandan > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >
order of Field objects within Document
Can anybody confirm that no guarantee is given that Fields retain their order within a Document? Version 1.3 seems to (although reversing the order on occasion). Doesnt seem likely but would be really useful for my current application ;) Im just asking for clarification not a change of spec although a comment in the JavaDoc for Document making it explicit might be handy. Thanks Sam - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Search in all fields
You can use the MultiFieldQueryParser, which will generate a query against all of the fields you specify, or you could index all of your documents into one or two common fields and search against them. Since you have a lot of fields, I would guess the latter is the better choice. >>> [EMAIL PROTECTED] 03/16/04 07:56AM >>> In QueryParser.parse method I must give which is the default field. Does this means ttah non-adressed queris are executed only over this field? The main question is: How I can search in all fields in all documents in the index? Note that I don't know field names, there can be thousands field names in all documnets. 10x in advance. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Search in all fields
In QueryParser.parse method I must give which is the default field. Does this means ttah non-adressed queris are executed only over this field? The main question is: How I can search in all fields in all documents in the index? Note that I don't know field names, there can be thousands field names in all documnets. 10x in advance.
Re: UNIX command-line indexing script?
I have wrote one that will index PDF,DOC,XLS,XML,HTML,TXT and plain/text files. I wrote this based on demo application and using other open soure componets POI by Apache (for doc and exel) and PDFBox. I modified client interface also. Now its looks like google. Still i have to do a couple of things. 1) At present i'm using UNIX 'file' command to check it is plain text. This will spwan process and take more time. The advantage this isin unix based mechines where file extention is not important.( it uses magic numbers. ) 2) The information such as Index Location, Directory, URL, etc. should be kept in an xml file. So that it cam be dynamic. 3) Categeory Since apache guys provided good frame work every thing made easy. Thanks guys! Linto On Sat, 13 Mar 2004 Charlie Smith wrote : >Anyone written a simple UNIX command-line indexing script which will read a >bunch off different kinds of docs and index them? I'd like to make a cron job >out of this so as to be able to come back and read it later during a search. > >PERL or JAVA script would be fine. > >
CJK Analyzer indexing japanese word document
I am using a CJKAnalyzer from apache sandbox , I have set the java file.encoding setting to SJIS and i am able to index and search the japanese html page . I can see the index dumps as i expected , However when i index a word document containing japanese characters it is not indexing as expected . Do I need to change anything with CJKTokenizer and CJKAnalyzer classes? I have been able to index a word document with StandardAnalyzers. thanks in advace chandan - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]