Re: StandardAnalyzer unit tests?
€ 0.02: Indexing code "++" is a stop term, it might be in english text as well. 'C' is a not very descriptive but very valid variable name. '#' is used in some old morse transcripts I think. I am not going to die or get fired, but I'd suggest not including those tokens in a standard anything. Erik Hatcher wrote: > I personally don't have a problem with that change, however I don't > like changing such things as they can lead to unexpected and confusing > issues later. Suppose someone upgrades their version of Lucene without > re-indexing and now queries that used to work no longer work? (sure, I > agree it is wise to re-index if you upgrade Lucene). > > Perhaps others could chime in on whether this change would adversely > affect them or if this a desirable change? > > Erik > > > > On Jan 17, 2005, at 4:51 AM, Chris Lamprecht wrote: > >> Erik, Paul, Daniel, >> >> I submitted a testcase -- >> http://issues.apache.org/bugzilla/show_bug.cgi?id=33134 >> >> On a related note, what do you all think about updating the >> StandardAnalyzer grammar to treat "C#" and "C++" as tokens? It's a >> small modification to the grammar -- NutchAnalysis.jj has it. >> >> -Chris >> >> On Mon, 17 Jan 2005 03:23:41 -0500, Erik Hatcher >> <[EMAIL PROTECTED]> wrote: >> >>> I don't see any tests of StandardAnalyzer either. Your contribution >>> would be most welcome. There are tests that use StandardAnalyzer, but >>> not to test it directly. >>> >> >> - >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. ASML is neither liable for the proper and complete transmission of the information contained in this communication, nor for any delay in its receipt.
Re: IndexWriter failure leaves lock in place
Joseph (and others), I'm not an expert on lucene either. Your mail just rang a bell and I thought I'd contribute the ring for any expert to use. I have found stale locks on a system running on solaris/iplanet with the FSDirectory. The same code does not pose a problem on a windows/apache/tomcat environment. I cannot reproduce the problem yet, and I'm not sure if it is new to version 1.4 (the system has been running with lucene 1.2 before). Joseph Ottinger wrote: >I'm still working through making my own directory, based on JDBC (and yes, >I know, there are some out there already, unsuitable for this reason or >that reason.) > >One thing I've noticed is that the Lock procedure in IndexWriter is a >little off, I think. > >My normal process on application startup is to get an IndexWriter, just to >make sure an index is there. If I get an exception (FileNotFoundException >for the FSDirectory, for example), I assume the index isn't created >properly, so then I create a new IndexWriter set to create the index. > >With a file-based directory, that works well enough - and I realise there >might be a better way to do it (but I don't know it yet.) > >However, the SQL-based directory leaves the lock. I think what's happening >is that the IndexWriter constructor (IndexWriter.java:216 from 1.4.3's >souce distribution) is obtaining the lock, but then the synchronized block >(starting at line 227) gets an IOException from >segmentInfos.read(directory), which throws an IOException - but the >writeLock is never explicitly removed once it's obtained. > >I would think that a try/finally (or something even more predictable, >like a try/catch tht rethrows the IOException after cleanup) would be >appropriate to clear the lock *provided it's obtained* in the IndexWriter >construction, and it'd make the code that I typically use work regardless >of the specific directory I rely on. > >Now, to be sure, I'm VERY FAR from a Lucene expert; am I missing >something? (I can contribute a patch if you'd like.) > >--- >Joseph B. Ottinger http://enigmastation.com >IT Consultant[EMAIL PROTECTED] > > >- >To unsubscribe, e-mail: [EMAIL PROTECTED] >For additional commands, e-mail: [EMAIL PROTECTED] > > > > -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. ASML is neither liable for the proper and complete transmission of the information contained in this communication, nor for any delay in its receipt. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: searching using the CJKAnalyzer
Che Dong wrote: > CJKAnalyser not support single byte-stream, front end interface and > backend indexing process need to transform source into double byte > charactor-stream properly before search/index. > > Please tell me know the output of > http://www.chedong.com/tech/HelloUnicode.java > with javac -encoding=gb2312 and javac -encoding=iso-8859-1 Here's the output. I can see what's wrong in it. Should I put an extra field in the index containing encoding? > > Regards > > Che Dong > > > Daan Hoogland wrote: > >> Jon Schuster wrote: >> >> >>> I didn't need to make any changes to Entities to get Japanese >>> searches working. Are you using the CJKAnalyzer when you perform the >>> search, not only when building the index? >>> >>> >> >> Yes, I use CJKAnalyzer all around. When searching I translate >> character-entities in order to find anything. When displaying search >> results, I don't see anything that looks as being part of an eastern >> character set. instead I see accented latin - and mathematical symbols. >> >> When I don't pass entities by the way things get really nasty: >> query passed: >Î??Âââ< >> char(Î, LATIN_1_SUPPLEMENT) char(?, LATIN_1_SUPPLEMENT) token found >> : >Î< length: 1 >> char(?, LATIN_1_SUPPLEMENT) char(Â, LATIN_1_SUPPLEMENT) char(â, >> LATIN_1_SUPPLEMENT) token found : >Â< length: 1 >> char(â, LATIN_1_SUPPLEMENT) searching contents:"Î Â" >> >> This was a query for two japanese characters. >> >> >>> -Original Message- >>> From: Daan Hoogland [mailto:[EMAIL PROTECTED] Sent: Sunday, >>> October 10, 2004 10:48 PM >>> To: Lucene Users List >>> Subject: Re: searching using the CJKAnalyzer >>> Importance: Low >>> >>> >>> Che Dong wrote: >>> >>> >>> >>> >>>> Seem not Analyser problem but html parser charset detecting error. >>>> >>>> Could you show me the detail of the problem? >>>> >>> >>> >>> Thank Che, >>> I got it working by making the decode() from the Entities in demo >>> public. I wrote a scanner to tranlate any entities in the query. >>> I want to translate back to entities in the results, but I'm not >>> sure what the criteria should be. It seems to be just binary data. >>> How to conclude that Â0Å4?Â0â3ÂÂ?Â0â4 means ÃÃÃÂ? >>> >>> >>> >>> >>>> Thanks >>>> >>>> Che Dong >>>> >>>> Daan Hoogland wrote: >>>> >>>> >>>> >>>>> LS, >>>>> in >>>>> http://issues.apache.org/eyebrowse/ReadMsg?listId=30&msgNo=8980 >>>>> Jon Schuster explains how to get a Japanese search system working. >>>>> I followed his advice and got a index that "luke" shows as what I >>>>> expected it to be. >>>>> I don't know how to enter a search so that it gets passed to the >>>>> engine properly. It works in luke but not in weblucene or in my >>>>> own app. >>>>> >>>>> >>>>> >>>> >>>> >>>> - >>>> To unsubscribe, e-mail: [EMAIL PROTECTED] >>>> For additional commands, e-mail: [EMAIL PROTECTED] >>>> >>>> >>>> >>> >>> >>> >>> >>> >>> >> >> >> >> > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. ASML is neither liable for the proper and complete transmission of the information contained in this communication, nor for any delay in its receipt. >>>>testing1: write hello world to files<<<< [test 1-1]: with system default encoding=Cp1252 string=Hello world length=16 char[0]='H' byte=72 \u48short=72 \u48 BASIC_LATIN char[1
Re: searching using the CJKAnalyzer
Jon Schuster wrote: >I didn't need to make any changes to Entities to get Japanese searches working. Are >you using the CJKAnalyzer when you perform the search, not only when building the >index? > > Yes, I use CJKAnalyzer all around. When searching I translate character-entities in order to find anything. When displaying search results, I don't see anything that looks as being part of an eastern character set. instead I see accented latin - and mathematical symbols. When I don't pass entities by the way things get really nasty: query passed: >Î??Âââ< char(Î, LATIN_1_SUPPLEMENT) char(?, LATIN_1_SUPPLEMENT) token found : >Î< length: 1 char(?, LATIN_1_SUPPLEMENT) char(Â, LATIN_1_SUPPLEMENT) char(â, LATIN_1_SUPPLEMENT) token found : >Â< length: 1 char(â, LATIN_1_SUPPLEMENT) searching contents:"Î Â" This was a query for two japanese characters. >-Original Message- >From: Daan Hoogland [mailto:[EMAIL PROTECTED] >Sent: Sunday, October 10, 2004 10:48 PM >To: Lucene Users List >Subject: Re: searching using the CJKAnalyzer >Importance: Low > > >Che Dong wrote: > > > >>Seem not Analyser problem but html parser charset detecting error. >> >>Could you show me the detail of the problem? >> >> > >Thank Che, >I got it working by making the decode() from the Entities in demo >public. I wrote a scanner to tranlate any entities in the query. >I want to translate back to entities in the results, but I'm not sure >what the criteria should be. It seems to be just binary data. >How to conclude that Â0Å4?Â0â3ÂÂ?Â0â4 means ÃÃÃÂ? > > > >>Thanks >> >>Che Dong >> >>Daan Hoogland wrote: >> >> >> >>>LS, >>>in >>>http://issues.apache.org/eyebrowse/ReadMsg?listId=30&msgNo=8980 >>>Jon Schuster explains how to get a Japanese search system working. I >>>followed his advice and got a index that "luke" shows as what I >>>expected it to be. >>>I don't know how to enter a search so that it gets passed to the >>>engine properly. It works in luke but not in weblucene or in my own app. >>> >>> >>> >>> >>- >>To unsubscribe, e-mail: [EMAIL PROTECTED] >>For additional commands, e-mail: [EMAIL PROTECTED] >> >> >> >> > > > > > -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. ASML is neither liable for the proper and complete transmission of the information contained in this communication, nor for any delay in its receipt.
Re: searching using the CJKAnalyzer
Che Dong wrote: > Seem not Analyser problem but html parser charset detecting error. > > Could you show me the detail of the problem? Thank Che, I got it working by making the decode() from the Entities in demo public. I wrote a scanner to tranlate any entities in the query. I want to translate back to entities in the results, but I'm not sure what the criteria should be. It seems to be just binary data. How to conclude that 04?03¨¦?04 means ÓÐÒ°? > > Thanks > > Che Dong > > Daan Hoogland wrote: > >> LS, >> in >> http://issues.apache.org/eyebrowse/ReadMsg?listId=30&msgNo=8980 >> Jon Schuster explains how to get a Japanese search system working. I >> followed his advice and got a index that "luke" shows as what I >> expected it to be. >> I don't know how to enter a search so that it gets passed to the >> engine properly. It works in luke but not in weblucene or in my own app. >> >> > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. ASML is neither liable for the proper and complete transmission of the information contained in this communication, nor for any delay in its receipt.
searching using the CJKAnalyzer
LS, in http://issues.apache.org/eyebrowse/ReadMsg?listId=30&msgNo=8980 Jon Schuster explains how to get a Japanese search system working. I followed his advice and got a index that "luke" shows as what I expected it to be. I don't know how to enter a search so that it gets passed to the engine properly. It works in luke but not in weblucene or in my own app. -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. ASML is neither liable for the proper and complete transmission of the information contained in this communication, nor for any delay in its receipt. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: indexing numeric entities?
maybe inline? http://www.w3.org/2001/XMLSchema-instance";> japan フィールドサービスエンジニア Indexing the above document using the HTMLParser demo and the CJKAnalyzer, only the term "japan" is found in the content. This is not correct, is it? Should I convert the entities by hand? Sorry for the mess I send before. -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. ASML is neither liable for the proper and complete transmission of the information contained in this communication, nor for any delay in its receipt. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: indexing numeric entities?
I guess something wnet wrong; Daan Hoogland wrote: Daan Hoogland wrote: Daan Hoogland wrote: Hello, Does anyone do indexeing of numeric entities for japanese characters? I have (non-x)html containing those entities and need to index and search them. Can the CJKAnalyzer index a string like "●入社"? It seems to be ignored completely when used with the demo. There was talk on this list of fixes for the demo HTMLParser, do these adres this issue? When I look ate the code it seems that the entities should have been interpreted before indexing. What am I missing? Any comment please? Or a pointer to a howto for dumm^H^H^H^H^H westerners? Indexing the attached document using the HTMLParser demo and the CJKAnalyzer, only the term "japan" is found in the content. This is not correct, is it? Should I convert the entities by hand? thanks, - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. ASML is neither liable for the proper and complete transmission of the information contained in this communication, nor for any delay in its receipt. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: indexing numeric entities?
Daan Hoogland wrote: >Daan Hoogland wrote: > > > >>Hello, >> >>Does anyone do indexeing of numeric entities for japanese characters? I >>have (non-x)html containing those entities and need to index and search >>them. >> >> >> >> >> >> >Can the CJKAnalyzer index a string like "●入社"? It >seems to be ignored completely when used with the demo. There was talk >on this list of fixes for the demo HTMLParser, do these adres this >issue? When I look ate the code it seems that the entities should have >been interpreted before indexing. What am I missing? > >Any comment please? >Or a pointer to a howto for dumm^H^H^H^H^H westerners? > > Indexing the attached document using the HTMLParser demo and the CJKAnalyzer, only the term "japan" is found in the content. This is not correct, is it? Should I convert the entities by hand? > >thanks, > > > > -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. ASML is neither liable for the proper and complete transmission of the information contained in this communication, nor for any delay in its receipt. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: indexing numeric entities?
Daan Hoogland wrote: >Hello, > >Does anyone do indexeing of numeric entities for japanese characters? I >have (non-x)html containing those entities and need to index and search >them. > > > > Can the CJKAnalyzer index a string like "●入社"? It seems to be ignored completely when used with the demo. There was talk on this list of fixes for the demo HTMLParser, do these adres this issue? When I look ate the code it seems that the entities should have been interpreted before indexing. What am I missing? Any comment please? Or a pointer to a howto for dumm^H^H^H^H^H westerners? thanks, -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. ASML is neither liable for the proper and complete transmission of the information contained in this communication, nor for any delay in its receipt. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
indexing numeric entities?
Hello, Does anyone do indexeing of numeric entities for japanese characters? I have (non-x)html containing those entities and need to index and search them. -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. ASML is neither liable for the proper and complete transmission of the information contained in this communication, nor for any delay in its receipt. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
different analyzer all produce the same index?
H all, I try to create different indices using different Analyzer-classes. I tried standard, german, russian, and cjk. They all produce exactly the same index file (md5-wise). There are over 280 pages so I expected at least some differences. Any ideas anyone? -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. ASML is neither liable for the proper and complete transmission of the information contained in this communication, nor for any delay in its receipt. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]