Hudson build is back to normal: Lucene-trunk #428

2008-04-06 Thread Apache Hudson Server
See http://hudson.zones.apache.org/hudson/job/Lucene-trunk/428/changes



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: shingles and punctuations

2008-04-06 Thread Grant Ingersoll
For now, it's up to your app to know, unfortunately :-(  I think the  
WikipediaTokenizer is the only one using flags currently in the Lucene.



On Apr 6, 2008, at 10:43 PM, Mathieu Lecarme wrote:

I'll use Token flags to specifiy first token in a sentence, but how  
it's works? how flag collision is avoided? to keep it simple, i'll  
take 1 as flag, but what happens if an other filter use the same  
flags?


M.

Le 6 avr. 08 à 20:13, Grant Ingersoll a écrit :
I think you need sentence detection to take place further  
upstream.  Then you could use the Token type or Token flags to  
indicate punctuation, sentences, whatever and we could patch the  
shingle filter to ignore these things, or break and move onto the  
next one.


-Grant

On Apr 6, 2008, at 7:23 PM, Mathieu Lecarme wrote:

The newly ShingleFilter is very helpful to fetch group of words,  
but it doesn't handle ponctuation or any separation.
If you feed it with multiple sentences, you will get shingle that  
start in one sentences and end in the next.
In order to avoid that, you can handle token positions, if there  
is more than one char with the previous token, it should be  
punctation (or typo).

Any suggestions to handle only shingle in the same sentence?

M.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: shingles and punctuations

2008-04-06 Thread Mathieu Lecarme
I'll use Token flags to specifiy first token in a sentence, but how  
it's works? how flag collision is avoided? to keep it simple, i'll  
take 1 as flag, but what happens if an other filter use the same flags?


M.

Le 6 avr. 08 à 20:13, Grant Ingersoll a écrit :
I think you need sentence detection to take place further upstream.   
Then you could use the Token type or Token flags to indicate  
punctuation, sentences, whatever and we could patch the shingle  
filter to ignore these things, or break and move onto the next one.


-Grant

On Apr 6, 2008, at 7:23 PM, Mathieu Lecarme wrote:

The newly ShingleFilter is very helpful to fetch group of words,  
but it doesn't handle ponctuation or any separation.
If you feed it with multiple sentences, you will get shingle that  
start in one sentences and end in the next.
In order to avoid that, you can handle token positions, if there is  
more than one char with the previous token, it should be punctation  
(or typo).

Any suggestions to handle only shingle in the same sentence?

M.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: shingles and punctuations

2008-04-06 Thread Grant Ingersoll
I think you need sentence detection to take place further upstream.   
Then you could use the Token type or Token flags to indicate  
punctuation, sentences, whatever and we could patch the shingle filter  
to ignore these things, or break and move onto the next one.


-Grant

On Apr 6, 2008, at 7:23 PM, Mathieu Lecarme wrote:

The newly ShingleFilter is very helpful to fetch group of words, but  
it doesn't handle ponctuation or any separation.
If you feed it with multiple sentences, you will get shingle that  
start in one sentences and end in the next.
In order to avoid that, you can handle token positions, if there is  
more than one char with the previous token, it should be punctation  
(or typo).

Any suggestions to handle only shingle in the same sentence?

M.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



shingles and punctuations

2008-04-06 Thread Mathieu Lecarme
The newly ShingleFilter is very helpful to fetch group of words, but  
it doesn't handle ponctuation or any separation.
If you feed it with multiple sentences, you will get shingle that  
start in one sentences and end in the next.
In order to avoid that, you can handle token positions, if there is  
more than one char with the previous token, it should be punctation  
(or typo).

Any suggestions to handle only shingle in the same sentence?

M.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]