Re: StandardAnalyzer unit tests?

2005-01-17 Thread Daan Hoogland
€ 0.02: Indexing code "++" is a stop term, it might be in english text 
as well. 'C' is a not very descriptive but very valid variable name. '#' 
is used in some old morse transcripts I think. I am not going to die or 
get fired, but I'd suggest not including those tokens in a standard 
anything.

Erik Hatcher wrote:

> I personally don't have a problem with that change, however I don't 
> like changing such things as they can lead to unexpected and confusing 
> issues later. Suppose someone upgrades their version of Lucene without 
> re-indexing and now queries that used to work no longer work? (sure, I 
> agree it is wise to re-index if you upgrade Lucene).
>
> Perhaps others could chime in on whether this change would adversely 
> affect them or if this a desirable change?
>
> Erik
>
>
>
> On Jan 17, 2005, at 4:51 AM, Chris Lamprecht wrote:
>
>> Erik, Paul, Daniel,
>>
>> I submitted a testcase --
>> http://issues.apache.org/bugzilla/show_bug.cgi?id=33134
>>
>> On a related note, what do you all think about updating the
>> StandardAnalyzer grammar to treat "C#" and "C++" as tokens? It's a
>> small modification to the grammar -- NutchAnalysis.jj has it.
>>
>> -Chris
>>
>> On Mon, 17 Jan 2005 03:23:41 -0500, Erik Hatcher
>> <[EMAIL PROTECTED]> wrote:
>>
>>> I don't see any tests of StandardAnalyzer either. Your contribution
>>> would be most welcome. There are tests that use StandardAnalyzer, but
>>> not to test it directly.
>>>
>>
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



-- 
The information contained in this communication and any attachments is 
confidential and may be privileged, and is for the sole use of the intended 
recipient(s). Any unauthorized review, use, disclosure or distribution is 
prohibited. If you are not the intended recipient, please notify the sender 
immediately by replying to this message and destroy all copies of this message 
and any attachments. ASML is neither liable for the proper and complete 
transmission of the information contained in this communication, nor for any 
delay in its receipt.


Re: IndexWriter failure leaves lock in place

2005-01-10 Thread Daan Hoogland
Joseph (and others),

I'm not an expert on lucene either. Your mail just rang a bell and I 
thought I'd contribute the ring for any expert to use. I have found 
stale locks on a system running on solaris/iplanet with the FSDirectory. 
The same code does not pose a problem on a windows/apache/tomcat 
environment. I cannot reproduce the problem yet, and I'm not sure if it 
is new to version 1.4 (the system has been running with lucene 1.2 before).

Joseph Ottinger wrote:

>I'm still working through making my own directory, based on JDBC (and yes,
>I know, there are some out there already, unsuitable for this reason or
>that reason.)
>
>One thing I've noticed is that the Lock procedure in IndexWriter is a
>little off, I think.
>
>My normal process on application startup is to get an IndexWriter, just to
>make sure an index is there. If I get an exception (FileNotFoundException
>for the FSDirectory, for example), I assume the index isn't created
>properly, so then I create a new IndexWriter set to create the index.
>
>With a file-based directory, that works well enough - and I realise there
>might be a better way to do it (but I don't know it yet.)
>
>However, the SQL-based directory leaves the lock. I think what's happening
>is that the IndexWriter constructor (IndexWriter.java:216 from 1.4.3's
>souce distribution) is obtaining the lock, but then the synchronized block
>(starting at line 227) gets an IOException from
>segmentInfos.read(directory), which throws an IOException - but the
>writeLock is never explicitly removed once it's obtained.
>
>I would think that a try/finally (or something even more predictable,
>like a try/catch tht rethrows the IOException after cleanup) would be
>appropriate to clear the lock *provided it's obtained* in the IndexWriter
>construction, and it'd make the code that I typically use work regardless
>of the specific directory I rely on.
>
>Now, to be sure, I'm VERY FAR from a Lucene expert; am I missing
>something? (I can contribute a patch if you'd like.)
>
>---
>Joseph B. Ottinger http://enigmastation.com
>IT Consultant[EMAIL PROTECTED]
>
>
>-
>To unsubscribe, e-mail: [EMAIL PROTECTED]
>For additional commands, e-mail: [EMAIL PROTECTED]
>
>
>  
>



-- 
The information contained in this communication and any attachments is 
confidential and may be privileged, and is for the sole use of the intended 
recipient(s). Any unauthorized review, use, disclosure or distribution is 
prohibited. If you are not the intended recipient, please notify the sender 
immediately by replying to this message and destroy all copies of this message 
and any attachments. ASML is neither liable for the proper and complete 
transmission of the information contained in this communication, nor for any 
delay in its receipt.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: searching using the CJKAnalyzer

2004-10-12 Thread Daan Hoogland
Che Dong wrote:

> CJKAnalyser not support single byte-stream, front end interface and 
> backend indexing process need to transform source into double byte 
> charactor-stream properly before search/index.
>
> Please tell me know the output of
> http://www.chedong.com/tech/HelloUnicode.java
> with javac -encoding=gb2312 and javac -encoding=iso-8859-1

Here's the output. I can see what's wrong in it. Should I put an extra 
field in the index containing encoding?

>
> Regards
>
> Che Dong
>
>
> Daan Hoogland wrote:
>
>> Jon Schuster wrote:
>>
>>
>>> I didn't need to make any changes to Entities to get Japanese 
>>> searches working. Are you using the CJKAnalyzer when you perform the 
>>> search, not only when building the index?
>>>
>>>
>>
>> Yes, I use CJKAnalyzer all around. When searching I translate 
>> character-entities in order to find anything. When displaying search 
>> results, I don't see anything that looks as being part of an eastern 
>> character set. instead I see accented latin - and mathematical symbols.
>>
>> When I don't pass entities by the way things get really nasty:
>> query passed: >Î??Âââ<
>>  char(Î, LATIN_1_SUPPLEMENT)  char(?, LATIN_1_SUPPLEMENT) token found 
>> :  >Î< length: 1
>>  char(?, LATIN_1_SUPPLEMENT)  char(Â, LATIN_1_SUPPLEMENT)  char(â, 
>> LATIN_1_SUPPLEMENT) token found : >Â< length: 1
>>  char(â, LATIN_1_SUPPLEMENT) searching contents:"Î Â"
>>
>> This was a query for two japanese characters.
>>
>>
>>> -Original Message-
>>> From: Daan Hoogland [mailto:[EMAIL PROTECTED] Sent: Sunday, 
>>> October 10, 2004 10:48 PM
>>> To: Lucene Users List
>>> Subject: Re: searching using the CJKAnalyzer
>>> Importance: Low
>>>
>>>
>>> Che Dong wrote:
>>>
>>>
>>>
>>>
>>>> Seem not Analyser problem but html parser charset detecting error.
>>>>
>>>> Could you show me the detail of the problem?
>>>>  
>>>
>>>
>>> Thank Che,
>>> I got it working by making the decode() from the Entities in demo 
>>> public. I wrote a scanner to tranlate any entities in the query.
>>> I want to translate back to entities in the results, but I'm not 
>>> sure what the criteria should be. It seems to be just binary data.
>>> How to conclude that Â0Å4?Â0â3ÂÂ?Â0â4 means ÃÃÃÂ?
>>>
>>>
>>>
>>>
>>>> Thanks
>>>>
>>>> Che Dong
>>>>
>>>> Daan Hoogland wrote:
>>>>
>>>>  
>>>>
>>>>> LS,
>>>>> in
>>>>> http://issues.apache.org/eyebrowse/ReadMsg?listId=30&msgNo=8980
>>>>> Jon Schuster explains how to get a Japanese search system working. 
>>>>> I followed his advice and got a index that "luke" shows as what I 
>>>>> expected it to be.
>>>>> I don't know how to enter a search so that it gets passed to the 
>>>>> engine properly. It works in luke but not in weblucene or in my 
>>>>> own app.
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> -
>>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>>>> For additional commands, e-mail: [EMAIL PROTECTED]
>>>>
>>>>
>>>>  
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>>
>>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



-- 
The information contained in this communication and any attachments is confidential 
and may be privileged, and is for the sole use of the intended recipient(s). Any 
unauthorized review, use, disclosure or distribution is prohibited. If you are not the 
intended recipient, please notify the sender immediately by replying to this message 
and destroy all copies of this message and any attachments. ASML is neither liable for 
the proper and complete transmission of the information contained in this 
communication, nor for any delay in its receipt.
>>>>testing1: write hello world to files<<<<
[test 1-1]: with system default encoding=Cp1252
string=Hello world  length=16
char[0]='H' byte=72 \u48short=72 \u48   BASIC_LATIN
char[1

Re: searching using the CJKAnalyzer

2004-10-12 Thread Daan Hoogland
Jon Schuster wrote:

>I didn't need to make any changes to Entities to get Japanese searches working. Are 
>you using the CJKAnalyzer when you perform the search, not only when building the 
>index?
>  
>
Yes, I use CJKAnalyzer all around. When searching I translate 
character-entities in order to find anything. When displaying search 
results, I don't see anything that looks as being part of an eastern 
character set. instead I see accented latin - and mathematical symbols.

When I don't pass entities by the way things get really nasty:
query passed: >Î??Âââ<
 char(Î, LATIN_1_SUPPLEMENT)  char(?, LATIN_1_SUPPLEMENT) token found : 
 >Î< length: 1
 char(?, LATIN_1_SUPPLEMENT)  char(Â, LATIN_1_SUPPLEMENT)  char(â, 
LATIN_1_SUPPLEMENT) token found : >Â< length: 1
 char(â, LATIN_1_SUPPLEMENT) searching contents:"Î Â"

This was a query for two japanese characters.

>-Original Message-
>From: Daan Hoogland [mailto:[EMAIL PROTECTED] 
>Sent: Sunday, October 10, 2004 10:48 PM
>To: Lucene Users List
>Subject: Re: searching using the CJKAnalyzer
>Importance: Low
>
>
>Che Dong wrote:
>
>  
>
>>Seem not Analyser problem but html parser charset detecting error.
>>
>>Could you show me the detail of the problem?
>>
>>
>
>Thank Che,
>I got it working by making the decode() from the Entities in demo 
>public. I wrote a scanner to tranlate any entities in the query.
>I want to translate back to entities in the results, but I'm not sure 
>what the criteria should be. It seems to be just binary data.
>How to conclude that Â0Å4?Â0â3ÂÂ?Â0â4 means ÃÃÃÂ?
>
>  
>
>>Thanks
>>
>>Che Dong
>>
>>Daan Hoogland wrote:
>>
>>
>>
>>>LS,
>>>in
>>>http://issues.apache.org/eyebrowse/ReadMsg?listId=30&msgNo=8980
>>>Jon Schuster explains how to get a Japanese search system working. I 
>>>followed his advice and got a index that "luke" shows as what I 
>>>expected it to be.
>>>I don't know how to enter a search so that it gets passed to the 
>>>engine properly. It works in luke but not in weblucene or in my own app.
>>>
>>>
>>>  
>>>
>>-
>>To unsubscribe, e-mail: [EMAIL PROTECTED]
>>For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
>>
>>
>
>
>
>  
>



-- 
The information contained in this communication and any attachments is confidential 
and may be privileged, and is for the sole use of the intended recipient(s). Any 
unauthorized review, use, disclosure or distribution is prohibited. If you are not the 
intended recipient, please notify the sender immediately by replying to this message 
and destroy all copies of this message and any attachments. ASML is neither liable for 
the proper and complete transmission of the information contained in this 
communication, nor for any delay in its receipt.


Re: searching using the CJKAnalyzer

2004-10-10 Thread Daan Hoogland
Che Dong wrote:

> Seem not Analyser problem but html parser charset detecting error.
>
> Could you show me the detail of the problem?

Thank Che,
I got it working by making the decode() from the Entities in demo 
public. I wrote a scanner to tranlate any entities in the query.
I want to translate back to entities in the results, but I'm not sure 
what the criteria should be. It seems to be just binary data.
How to conclude that 0Š4?0†3¨¦?0„4 means ÓÐÒ°?

>
> Thanks
>
> Che Dong
>
> Daan Hoogland wrote:
>
>> LS,
>> in
>> http://issues.apache.org/eyebrowse/ReadMsg?listId=30&msgNo=8980
>> Jon Schuster explains how to get a Japanese search system working. I 
>> followed his advice and got a index that "luke" shows as what I 
>> expected it to be.
>> I don't know how to enter a search so that it gets passed to the 
>> engine properly. It works in luke but not in weblucene or in my own app.
>>
>>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



-- 
The information contained in this communication and any attachments is confidential 
and may be privileged, and is for the sole use of the intended recipient(s). Any 
unauthorized review, use, disclosure or distribution is prohibited. If you are not the 
intended recipient, please notify the sender immediately by replying to this message 
and destroy all copies of this message and any attachments. ASML is neither liable for 
the proper and complete transmission of the information contained in this 
communication, nor for any delay in its receipt.


searching using the CJKAnalyzer

2004-10-08 Thread Daan Hoogland
LS,
in
http://issues.apache.org/eyebrowse/ReadMsg?listId=30&msgNo=8980
Jon Schuster explains how to get a Japanese search system working. I 
followed his advice and got a index that "luke" shows as what I expected 
it to be.
I don't know how to enter a search so that it gets passed to the engine 
properly. It works in luke but not in weblucene or in my own app.


-- 
The information contained in this communication and any attachments is confidential 
and may be privileged, and is for the sole use of the intended recipient(s). Any 
unauthorized review, use, disclosure or distribution is prohibited. If you are not the 
intended recipient, please notify the sender immediately by replying to this message 
and destroy all copies of this message and any attachments. ASML is neither liable for 
the proper and complete transmission of the information contained in this 
communication, nor for any delay in its receipt.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: indexing numeric entities?

2004-10-07 Thread Daan Hoogland
maybe inline?

http://www.w3.org/2001/XMLSchema-instance";>
 
  japan
 
 
  

フィールドサービスエンジニア

  



Indexing the above document using the HTMLParser demo and the 
CJKAnalyzer, only the term "japan" is found in the content. This is not 
correct, is it?
Should I convert the entities by hand?


Sorry for the mess I send before.


-- 
The information contained in this communication and any attachments is confidential 
and may be privileged, and is for the sole use of the intended recipient(s). Any 
unauthorized review, use, disclosure or distribution is prohibited. If you are not the 
intended recipient, please notify the sender immediately by replying to this message 
and destroy all copies of this message and any attachments. ASML is neither liable for 
the proper and complete transmission of the information contained in this 
communication, nor for any delay in its receipt.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: indexing numeric entities?

2004-10-07 Thread Daan Hoogland




I guess something wnet wrong;

Daan Hoogland wrote:

  Daan Hoogland wrote:

  
  
Daan Hoogland wrote:

 



  Hello,

Does anyone do indexeing of numeric entities for japanese characters? I 
have (non-x)html containing those entities and need to index and search 
them.




   

  

Can the CJKAnalyzer index a string like "●入社"? It 
seems to be ignored completely when used with the demo. There was talk 
on this list of fixes for the demo HTMLParser, do these adres this 
issue? When I look ate the code it seems that the entities should have 
been interpreted before indexing. What am I missing?

Any comment please?
Or a pointer to a howto for dumm^H^H^H^H^H westerners?
 


  
  Indexing the attached document using the HTMLParser demo and the 
CJKAnalyzer, only the term "japan" is found in the content. This is not 
correct, is it?
Should I convert the entities by hand?

  
  
thanks,


 


  
  


  
  

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-- 
The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. ASML is neither liable for the proper and complete transmission of the information contained in this communication, nor for any delay in its receipt.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: indexing numeric entities?

2004-10-07 Thread Daan Hoogland
Daan Hoogland wrote:

>Daan Hoogland wrote:
>
>  
>
>>Hello,
>>
>>Does anyone do indexeing of numeric entities for japanese characters? I 
>>have (non-x)html containing those entities and need to index and search 
>>them.
>>
>>
>> 
>>
>>
>>
>Can the CJKAnalyzer index a string like "●入社"? It 
>seems to be ignored completely when used with the demo. There was talk 
>on this list of fixes for the demo HTMLParser, do these adres this 
>issue? When I look ate the code it seems that the entities should have 
>been interpreted before indexing. What am I missing?
>
>Any comment please?
>Or a pointer to a howto for dumm^H^H^H^H^H westerners?
>  
>
Indexing the attached document using the HTMLParser demo and the 
CJKAnalyzer, only the term "japan" is found in the content. This is not 
correct, is it?
Should I convert the entities by hand?

>
>thanks,
>
>
>  
>



-- 
The information contained in this communication and any attachments is confidential 
and may be privileged, and is for the sole use of the intended recipient(s). Any 
unauthorized review, use, disclosure or distribution is prohibited. If you are not the 
intended recipient, please notify the sender immediately by replying to this message 
and destroy all copies of this message and any attachments. ASML is neither liable for 
the proper and complete transmission of the information contained in this 
communication, nor for any delay in its receipt.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: indexing numeric entities?

2004-10-07 Thread Daan Hoogland
Daan Hoogland wrote:

>Hello,
>
>Does anyone do indexeing of numeric entities for japanese characters? I 
>have (non-x)html containing those entities and need to index and search 
>them.
>
>
>  
>
Can the CJKAnalyzer index a string like "●入社"? It 
seems to be ignored completely when used with the demo. There was talk 
on this list of fixes for the demo HTMLParser, do these adres this 
issue? When I look ate the code it seems that the entities should have 
been interpreted before indexing. What am I missing?

Any comment please?
Or a pointer to a howto for dumm^H^H^H^H^H westerners?


thanks,


-- 
The information contained in this communication and any attachments is confidential 
and may be privileged, and is for the sole use of the intended recipient(s). Any 
unauthorized review, use, disclosure or distribution is prohibited. If you are not the 
intended recipient, please notify the sender immediately by replying to this message 
and destroy all copies of this message and any attachments. ASML is neither liable for 
the proper and complete transmission of the information contained in this 
communication, nor for any delay in its receipt.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



indexing numeric entities?

2004-10-06 Thread Daan Hoogland
Hello,

Does anyone do indexeing of numeric entities for japanese characters? I 
have (non-x)html containing those entities and need to index and search 
them.


-- 
The information contained in this communication and any attachments is confidential 
and may be privileged, and is for the sole use of the intended recipient(s). Any 
unauthorized review, use, disclosure or distribution is prohibited. If you are not the 
intended recipient, please notify the sender immediately by replying to this message 
and destroy all copies of this message and any attachments. ASML is neither liable for 
the proper and complete transmission of the information contained in this 
communication, nor for any delay in its receipt.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



different analyzer all produce the same index?

2004-10-04 Thread Daan Hoogland
H all,

I try to create different indices using different Analyzer-classes. I 
tried standard, german, russian, and cjk. They all produce exactly the 
same index file (md5-wise). There are over 280 pages so I expected at 
least some differences.

Any ideas anyone?


-- 
The information contained in this communication and any attachments is confidential 
and may be privileged, and is for the sole use of the intended recipient(s). Any 
unauthorized review, use, disclosure or distribution is prohibited. If you are not the 
intended recipient, please notify the sender immediately by replying to this message 
and destroy all copies of this message and any attachments. ASML is neither liable for 
the proper and complete transmission of the information contained in this 
communication, nor for any delay in its receipt.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]