[jira] Commented: (LUCENE-2197) StopFilter should not create a new CharArraySet if the given set is already an instance of CharArraySet

2010-01-08 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12797952#action_12797952
 ] 

Simon Willnauer commented on LUCENE-2197:
-

bq. Here's a patch that reverts to the previous behavior of using the set 
provided. 
Doesn't seem to lead anywhere to discuss with the performance police when I 
look at the average size of your comments. :)
This was actually meant to be a "pattern" for analyzer subclasses so I won't be 
the "immutability" police here. Yonik, will you take this issue and commit?!

bq. We should really avoid this type of nannyism in Lucene.
oh well this seems to me like a  void * is / isn't evil discussion - nevermind.

> StopFilter should not create a new CharArraySet if the given set is already 
> an instance of CharArraySet
> ---
>
> Key: LUCENE-2197
> URL: https://issues.apache.org/jira/browse/LUCENE-2197
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 3.1
>Reporter: Simon Willnauer
>Priority: Critical
> Fix For: 3.1
>
> Attachments: LUCENE-2197.patch, LUCENE-2197.patch
>
>
> With LUCENE-2094 a new CharArraySet is created no matter what type of set is 
> passed to StopFilter. This does not behave as  documented and could introduce 
> serious performance problems. Yet, according to the javadoc, the instance of 
> CharArraySet should be passed to CharArraySet.copy (which is very fast for 
> CharArraySet instances) instead of "copied" via "new CharArraySet()"

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Hudson build is back to normal: Lucene-trunk #1055

2010-01-08 Thread Apache Hudson Server
See 



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2197) StopFilter should not create a new CharArraySet if the given set is already an instance of CharArraySet

2010-01-08 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12798040#action_12798040
 ] 

Yonik Seeley commented on LUCENE-2197:
--

Sorry Simon... I think I just got fed up with stuff like this in the JDK over 
the years (that forces people to write their own implementations for best 
performance), and you happened to be the closest person at the time :-)

Related: I'm the one who added this to BooleanQuery some time ago:
{code}
  /** Returns the list of clauses in this query. */
  public List clauses() { return clauses; }
{code}
Yes, it probably should also say something like "Don't modify - it may change 
the query" to the comments.
To the software pedant, that's not safe and would probably be called bad design 
- but I strongly believe that our API should be for adults, and one should be 
able to introspect objects like Queries w/o suffering object allocations.  We 
should also continue to develop Lucene for *ourselves*, not for some mythic 
stupid user... I've seen too many bad design decisions based on "this will 
confuse users" arguments rather than "this is confusing".

Sometimes it comes down to people trying to solve a class of problems that 
others aren't even having issues with - I don't ever recall someone 
accidentally modifying the set after they passed it to the StopFilter, or 
someone accidentally modifying clauses() from BooleanQuery.

I also disagree with checking all input parameters in many cases (things that 
could possibly be in someones inner loop and will throw an exception anyway).

Say we have this piece of code:
{code}
boolean checkLength(String str) {
  return str.length() < MY_MAX_LENGTH;
}
{code}

I think it's silly to add an explicit null check like so (but you see plenty of 
code like that):
{code}
boolean checkLength(String str) {
  if (str == null) {
 throw new RuntimeException("Can't pass checkLength a null string");
  }
  return str.length() < MY_MAX_LENGTH;
}
{code}


There.  Is that reply long enough for you ;-)

> StopFilter should not create a new CharArraySet if the given set is already 
> an instance of CharArraySet
> ---
>
> Key: LUCENE-2197
> URL: https://issues.apache.org/jira/browse/LUCENE-2197
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 3.1
>Reporter: Simon Willnauer
>Priority: Critical
> Fix For: 3.1
>
> Attachments: LUCENE-2197.patch, LUCENE-2197.patch
>
>
> With LUCENE-2094 a new CharArraySet is created no matter what type of set is 
> passed to StopFilter. This does not behave as  documented and could introduce 
> serious performance problems. Yet, according to the javadoc, the instance of 
> CharArraySet should be passed to CharArraySet.copy (which is very fast for 
> CharArraySet instances) instead of "copied" via "new CharArraySet()"

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1967) make it easier to access default stopwords for language analyzers

2010-01-08 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12798087#action_12798087
 ] 

Robert Muir commented on LUCENE-1967:
-

Simon, can i close this? I think you have fixed it with LUCENE-2034

> make it easier to access default stopwords for language analyzers
> -
>
> Key: LUCENE-1967
> URL: https://issues.apache.org/jira/browse/LUCENE-1967
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Reporter: Robert Muir
>Assignee: Simon Willnauer
>Priority: Minor
>
> DM Smith made the following comment: (sometimes it is hard to dig out the 
> stop set from the analyzers)
> Looking around, some of these analyzers have very different ways of storing 
> the default list.
> One idea is to consider generalizing something like what Simon did with 
> LUCENE-1965, LUCENE-1962,
> and having all stopwords lists stored as .txt files in resources folder.
> {code}
>   /**
>* Returns an unmodifiable instance of the default stop-words set.
>* @return an unmodifiable instance of the default stop-words set.
>*/
>   public static Set getDefaultStopSet()
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2198) support protected words in Stemming TokenFilters

2010-01-08 Thread Robert Muir (JIRA)
support protected words in Stemming TokenFilters


 Key: LUCENE-2198
 URL: https://issues.apache.org/jira/browse/LUCENE-2198
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Affects Versions: 3.0
Reporter: Robert Muir
Priority: Minor


This is from LUCENE-1515

I propose that all stemming TokenFilters have an 'exclusion set' that bypasses 
any stemming for words in this set.
Some stemming tokenfilters have this, some do not.

This would be one way for Karl to implement his new swedish stemmer (as a text 
file of ignore words).
Additionally, it would remove duplication between lucene and solr, as they 
reimplement snowballfilter since it does not have this functionality.
Finally, I think this is a pretty common use case, where people want to ignore 
things like proper nouns in the stemming.

As an alternative design I considered a case where we generalized this to 
CharArrayMap (and ignoring words would mean mapping them to themselves), which 
would also provide a mechanism to override the stemming algorithm. But I think 
this is too expert, could be its own filter, and the only example of this i can 
find is in the Dutch stemmer.

So I think we should just provide ignore with CharArraySet, but if you feel 
otherwise please comment.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Closed: (LUCENE-1967) make it easier to access default stopwords for language analyzers

2010-01-08 Thread Simon Willnauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer closed LUCENE-1967.
---

Resolution: Fixed

incorporated in LUCENE-2034

> make it easier to access default stopwords for language analyzers
> -
>
> Key: LUCENE-1967
> URL: https://issues.apache.org/jira/browse/LUCENE-1967
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Reporter: Robert Muir
>Assignee: Simon Willnauer
>Priority: Minor
>
> DM Smith made the following comment: (sometimes it is hard to dig out the 
> stop set from the analyzers)
> Looking around, some of these analyzers have very different ways of storing 
> the default list.
> One idea is to consider generalizing something like what Simon did with 
> LUCENE-1965, LUCENE-1962,
> and having all stopwords lists stored as .txt files in resources folder.
> {code}
>   /**
>* Returns an unmodifiable instance of the default stop-words set.
>* @return an unmodifiable instance of the default stop-words set.
>*/
>   public static Set getDefaultStopSet()
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2199) ShingleFilter skips over trie-shingles if outputUnigram is set to false

2010-01-08 Thread Simon Willnauer (JIRA)
ShingleFilter skips over trie-shingles if outputUnigram is set to false
---

 Key: LUCENE-2199
 URL: https://issues.apache.org/jira/browse/LUCENE-2199
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/analyzers
Affects Versions: 3.0, 2.9.1, 2.9, 2.4.1, 2.4
Reporter: Simon Willnauer
 Fix For: 3.1


Spinoff from http://lucene.markmail.org/message/uq4xdjk26yduvnpa

{quote}
I noticed that if I set outputUnigrams to false it gives me the same output for
maxShingleSize=2 and maxShingleSize=3.

please divide divide this this sentence

when i set maxShingleSize to 4 output is:

please divide please divide this sentence divide this this sentence

I was expecting the output as follows with maxShingleSize=3 and
outputUnigrams=false :

please divide this divide this sentence 
{quote}




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2199) ShingleFilter skips over trie-shingles if outputUnigram is set to false

2010-01-08 Thread Simon Willnauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer updated LUCENE-2199:


Attachment: LUCENE-2199.patch

This patch adds test for trigram and fourgram with and without outputUnigram. 
All tests pass

> ShingleFilter skips over trie-shingles if outputUnigram is set to false
> ---
>
> Key: LUCENE-2199
> URL: https://issues.apache.org/jira/browse/LUCENE-2199
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/analyzers
>Affects Versions: 2.4, 2.4.1, 2.9, 2.9.1, 3.0
>Reporter: Simon Willnauer
> Fix For: 3.1
>
> Attachments: LUCENE-2199.patch
>
>
> Spinoff from http://lucene.markmail.org/message/uq4xdjk26yduvnpa
> {quote}
> I noticed that if I set outputUnigrams to false it gives me the same output 
> for
> maxShingleSize=2 and maxShingleSize=3.
> please divide divide this this sentence
> when i set maxShingleSize to 4 output is:
> please divide please divide this sentence divide this this sentence
> I was expecting the output as follows with maxShingleSize=3 and
> outputUnigrams=false :
> please divide this divide this sentence 
> {quote}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2199) ShingleFilter skips over trie-shingles if outputUnigram is set to false

2010-01-08 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12798162#action_12798162
 ] 

Simon Willnauer commented on LUCENE-2199:
-

We should likely backport this to 2.9 / 3.0 too

> ShingleFilter skips over trie-shingles if outputUnigram is set to false
> ---
>
> Key: LUCENE-2199
> URL: https://issues.apache.org/jira/browse/LUCENE-2199
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/analyzers
>Affects Versions: 2.4, 2.4.1, 2.9, 2.9.1, 3.0
>Reporter: Simon Willnauer
> Fix For: 3.1
>
> Attachments: LUCENE-2199.patch
>
>
> Spinoff from http://lucene.markmail.org/message/uq4xdjk26yduvnpa
> {quote}
> I noticed that if I set outputUnigrams to false it gives me the same output 
> for
> maxShingleSize=2 and maxShingleSize=3.
> please divide divide this this sentence
> when i set maxShingleSize to 4 output is:
> please divide please divide this sentence divide this this sentence
> I was expecting the output as follows with maxShingleSize=3 and
> outputUnigrams=false :
> please divide this divide this sentence 
> {quote}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2199) ShingleFilter skips over trie-shingles if outputUnigram is set to false

2010-01-08 Thread Simon Willnauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer updated LUCENE-2199:


Attachment: LUCENE-2199.patch

last patch messed up the posInc - fixed it in this one

> ShingleFilter skips over trie-shingles if outputUnigram is set to false
> ---
>
> Key: LUCENE-2199
> URL: https://issues.apache.org/jira/browse/LUCENE-2199
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/analyzers
>Affects Versions: 2.4, 2.4.1, 2.9, 2.9.1, 3.0
>Reporter: Simon Willnauer
> Fix For: 3.1
>
> Attachments: LUCENE-2199.patch, LUCENE-2199.patch
>
>
> Spinoff from http://lucene.markmail.org/message/uq4xdjk26yduvnpa
> {quote}
> I noticed that if I set outputUnigrams to false it gives me the same output 
> for
> maxShingleSize=2 and maxShingleSize=3.
> please divide divide this this sentence
> when i set maxShingleSize to 4 output is:
> please divide please divide this sentence divide this this sentence
> I was expecting the output as follows with maxShingleSize=3 and
> outputUnigrams=false :
> please divide this divide this sentence 
> {quote}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Assigned: (LUCENE-2199) ShingleFilter skips over trie-shingles if outputUnigram is set to false

2010-01-08 Thread Simon Willnauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer reassigned LUCENE-2199:
---

Assignee: Simon Willnauer

> ShingleFilter skips over trie-shingles if outputUnigram is set to false
> ---
>
> Key: LUCENE-2199
> URL: https://issues.apache.org/jira/browse/LUCENE-2199
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/analyzers
>Affects Versions: 2.4, 2.4.1, 2.9, 2.9.1, 3.0
>Reporter: Simon Willnauer
>Assignee: Simon Willnauer
> Fix For: 3.1
>
> Attachments: LUCENE-2199.patch, LUCENE-2199.patch
>
>
> Spinoff from http://lucene.markmail.org/message/uq4xdjk26yduvnpa
> {quote}
> I noticed that if I set outputUnigrams to false it gives me the same output 
> for
> maxShingleSize=2 and maxShingleSize=3.
> please divide divide this this sentence
> when i set maxShingleSize to 4 output is:
> please divide please divide this sentence divide this this sentence
> I was expecting the output as follows with maxShingleSize=3 and
> outputUnigrams=false :
> please divide this divide this sentence 
> {quote}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2197) StopFilter should not create a new CharArraySet if the given set is already an instance of CharArraySet

2010-01-08 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12798189#action_12798189
 ] 

Simon Willnauer commented on LUCENE-2197:
-

bq. Sorry Simon... I think I just got fed up with stuff like this in the JDK 
over the years (that forces people to write their own implementations for best 
performance), and you happened to be the closest person at the time 
:) no worries, thanks for the reply!

bq. To the software pedant, that's not safe and would probably be called bad 
design - ...
I understand and I can totally see your point. I was kind of separated due to 
the kind of short "rants" (don't get me wrong). I agree with you that we should 
not do that in a filter as this constructor could be called very very 
frequently especially if an analyzer does not implement reusableTokenStream. I 
would still argue that for an analyzer this is a different story and I would 
want to keep the code in analyzers copying the set. Classes, instantiated so 
frequently as filters should not introduce possible bottlenecks while analyzers 
are usually shared that won't be much of a hassle - any performance police 
issues with this? :)

> StopFilter should not create a new CharArraySet if the given set is already 
> an instance of CharArraySet
> ---
>
> Key: LUCENE-2197
> URL: https://issues.apache.org/jira/browse/LUCENE-2197
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 3.1
>Reporter: Simon Willnauer
>Priority: Critical
> Fix For: 3.1
>
> Attachments: LUCENE-2197.patch, LUCENE-2197.patch
>
>
> With LUCENE-2094 a new CharArraySet is created no matter what type of set is 
> passed to StopFilter. This does not behave as  documented and could introduce 
> serious performance problems. Yet, according to the javadoc, the instance of 
> CharArraySet should be passed to CharArraySet.copy (which is very fast for 
> CharArraySet instances) instead of "copied" via "new CharArraySet()"

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org