[jira] [Commented] (LUCENE-2100) Make contrib analyzers final

2011-05-16 Thread Esmond Pitt (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034555#comment-13034555
 ] 

Esmond Pitt commented on LUCENE-2100:
-

Many thanks. 



> Make contrib analyzers final
> 
>
> Key: LUCENE-2100
> URL: https://issues.apache.org/jira/browse/LUCENE-2100
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: modules/analysis
>Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.4, 2.4.1, 
> 2.9, 2.9.1, 3.0
>Reporter: Simon Willnauer
>Assignee: Simon Willnauer
>Priority: Minor
> Fix For: 4.0
>
> Attachments: LUCENE-2100.patch, LUCENE-2100.patch
>
>
> The analyzers in contrib/analyzers should all be marked final. None of the 
> Analyzers should ever be subclassed - users should build their own analyzers 
> if a different combination of filters and Tokenizers is desired.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-2100) Make contrib analyzers final

2011-05-16 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034549#comment-13034549
 ] 

Robert Muir commented on LUCENE-2100:
-

Esmond: hi, what you are doing here is exactly the reason why we made it final.

By subclassing StandardAnalyzer in this way, the indexer is no longer able to 
reuse tokenstreams, making analysis very slow and inefficient.

The easiest way to get your PorterStemAnalyzer is to just use EnglishAnalyzer, 
which does just this.

Otherwise if you really want to do it yourself, do it like this:
{noformat}
Analyzer analyzer = new ReusableAnalyzerBase() {
  protected TokenStreamComponents createComponents(String fieldName, Reader 
reader) {
Tokenizer tokenizer = new StandardTokenizer(...);
TokenStream filteredStream = new StandardFilter(tokenizer, ...);
filteredStream = new LowerCaseFilterFilter(filteredStream, ...);
filteredStream = new StopFilterFilter(filteredStream, ...);
filteredStream = new PorterStemFilter(filteredStream, ...);
return new TokenStreamComponents(tokenizer, filteredStream);
  }
};
{noformat}

Please see LUCENE-3055 for more examples and a more thorough explanation.

The good news is if you implement your analyzer like this, you will see 
performance improvements!


> Make contrib analyzers final
> 
>
> Key: LUCENE-2100
> URL: https://issues.apache.org/jira/browse/LUCENE-2100
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: modules/analysis
>Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.4, 2.4.1, 
> 2.9, 2.9.1, 3.0
>Reporter: Simon Willnauer
>Assignee: Simon Willnauer
>Priority: Minor
> Fix For: 4.0
>
> Attachments: LUCENE-2100.patch, LUCENE-2100.patch
>
>
> The analyzers in contrib/analyzers should all be marked final. None of the 
> Analyzers should ever be subclassed - users should build their own analyzers 
> if a different combination of filters and Tokenizers is desired.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-2100) Make contrib analyzers final

2011-05-16 Thread Esmond Pitt (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034544#comment-13034544
 ] 

Esmond Pitt commented on LUCENE-2100:
-

Steve

Thanks. Maybe you could have a look at this. How do you suggest I recode it?
I wrote this 7 years ago and cannot now remember anything about it. Quite
possibly the entire thing is now obsolete, but I've been carting it around
since before Lucene was even at Apache. All I've ever done is adjust the
version number.

==
public class PorterStemAnalyzer extends StandardAnalyzer
{
/**
 * Construct a new instance of PorterStemAnalyzer.
 */
public PorterStemAnalyzer()
{
super(Version.LUCENE_30);
}

@Override
public final TokenStream tokenStream(String fieldName, Reader
reader)
{
return new PorterStemFilter(super.tokenStream(fieldName,
reader));
}
}


EJP



> Make contrib analyzers final
> 
>
> Key: LUCENE-2100
> URL: https://issues.apache.org/jira/browse/LUCENE-2100
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: modules/analysis
>Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.4, 2.4.1, 
> 2.9, 2.9.1, 3.0
>Reporter: Simon Willnauer
>Assignee: Simon Willnauer
>Priority: Minor
> Fix For: 4.0
>
> Attachments: LUCENE-2100.patch, LUCENE-2100.patch
>
>
> The analyzers in contrib/analyzers should all be marked final. None of the 
> Analyzers should ever be subclassed - users should build their own analyzers 
> if a different combination of filters and Tokenizers is desired.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-2100) Make contrib analyzers final

2011-05-16 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034540#comment-13034540
 ] 

Steven Rowe commented on LUCENE-2100:
-

Hi Esmond,

Take a look at [the source code for 
StandardAnalyzer|http://svn.apache.org/viewvc/lucene/dev/tags/lucene_solr_3_1/lucene/src/java/org/apache/lucene/analysis/standard/StandardAnalyzer.java?view=markup].
  Fewer than 50 lines of code there, if you take out the comments.  Copy/paste 
suddenly seems doable.  Lucene's Analyzers are best thought of as examples.

Steve

> Make contrib analyzers final
> 
>
> Key: LUCENE-2100
> URL: https://issues.apache.org/jira/browse/LUCENE-2100
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: modules/analysis
>Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.4, 2.4.1, 
> 2.9, 2.9.1, 3.0
>Reporter: Simon Willnauer
>Assignee: Simon Willnauer
>Priority: Minor
> Fix For: 4.0
>
> Attachments: LUCENE-2100.patch, LUCENE-2100.patch
>
>
> The analyzers in contrib/analyzers should all be marked final. None of the 
> Analyzers should ever be subclassed - users should build their own analyzers 
> if a different combination of filters and Tokenizers is desired.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-3107) Binary compatibility broken b/w 3.03 and 3.1.0

2011-05-16 Thread Steven Rowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rowe resolved LUCENE-3107.
-

Resolution: Invalid

>From [item #8 in the "Changes in backward compatibility policy" section in the 
>3.1.0 
>CHANGES.txt|http://lucene.apache.org/java/3_1_0/changes/Changes.html#3.1.0.changes_in_backwards_compatibility_policy]:

{quote}
LUCENE-2372, LUCENE-2389: StandardAnalyzer, KeywordAnalyzer, 
PerFieldAnalyzerWrapper, WhitespaceTokenizer are now final. Also removed the 
now obsolete and deprecated Analyzer.setOverridesTokenStreamMethod(). Analyzer 
and TokenStream base classes now have an assertion in their ctor, that check 
subclasses to be final or at least have final implementations of 
incrementToken(), tokenStream(), and reusableTokenStream().
{quote}

> Binary compatibility broken b/w 3.03 and 3.1.0
> --
>
> Key: LUCENE-3107
> URL: https://issues.apache.org/jira/browse/LUCENE-3107
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: core/index, core/other
>Affects Versions: 3.1
> Environment: Windows Vista Microsoft Windows [Version 6.1.7600]
> java version "1.6.0_24"
> Java(TM) SE Runtime Environment (build 1.6.0_24-b07)
> Java HotSpot(TM) Client VM (build 19.1-b02, mixed mode, sharing)
>Reporter: Esmond Pitt
>Priority: Blocker
>
> StandardAnalyzer became final between 3.0.3 and 3.1.0. Unacceptable binary 
> incompatibility. See my comment in Lucene-2100.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-2736) Wrong implementation of DocIdSetIterator.advance

2011-05-16 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-2736:
---

Attachment: LUCENE-2736.patch

Patch with Javadocs fixes. I will commit it later today.

> Wrong implementation of DocIdSetIterator.advance 
> -
>
> Key: LUCENE-2736
> URL: https://issues.apache.org/jira/browse/LUCENE-2736
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 3.2, 4.0
>Reporter: Hardy Ferentschik
>Assignee: Shai Erera
> Attachments: LUCENE-2736.patch
>
>
> Implementations of {{DocIdSetIterator}} behave differently when advanced is 
> called. Taking the following test for {{OpenBitSet}}, {{DocIdBitSet}} and 
> {{SortedVIntList}} only {{SortedVIntList}} passes the test:
> {code:title=org.apache.lucene.search.TestDocIdSet.java|borderStyle=solid}
> ...
>   public void testAdvanceWithOpenBitSet() throws IOException {
>   DocIdSet idSet = new OpenBitSet( new long[] { 1121 }, 1 );  // 
> bits 0, 5, 6, 10
>   assertAdvance( idSet );
>   }
>   public void testAdvanceDocIdBitSet() throws IOException {
>   BitSet bitSet = new BitSet();
>   bitSet.set( 0 );
>   bitSet.set( 5 );
>   bitSet.set( 6 );
>   bitSet.set( 10 );
>   DocIdSet idSet = new DocIdBitSet(bitSet);
>   assertAdvance( idSet );
>   }
>   public void testAdvanceWithSortedVIntList() throws IOException {
>   DocIdSet idSet = new SortedVIntList( 0, 5, 6, 10 );
>   assertAdvance( idSet );
>   }   
>   private void assertAdvance(DocIdSet idSet) throws IOException {
>   DocIdSetIterator iter = idSet.iterator();
>   int docId = iter.nextDoc();
>   assertEquals( "First doc id should be 0", 0, docId );
>   docId = iter.nextDoc();
>   assertEquals( "Second doc id should be 5", 5, docId );
>   docId = iter.advance( 5 );
>   assertEquals( "Advancing iterator should return the next doc 
> id", 6, docId );
>   }
> {code}
> The javadoc for {{advance}} says:
> {quote}
> Advances to the first *beyond* the current whose document number is greater 
> than or equal to _target_.
> {quote}
> This seems to indicate that {{SortedVIntList}} behaves correctly, whereas the 
> other two don't. 
> Just looking at the {{DocIdBitSet}} implementation advance is implemented as:
> {code}
> bitSet.nextSetBit(target);
> {code}
> where the docs of {{nextSetBit}} say:
> {quote}
> Returns the index of the first bit that is set to true that occurs *on or 
> after* the specified starting index
> {quote}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-3107) Binary compatibility broken b/w 3.03 and 3.1.0

2011-05-16 Thread Esmond Pitt (JIRA)
Binary compatibility broken b/w 3.03 and 3.1.0
--

 Key: LUCENE-3107
 URL: https://issues.apache.org/jira/browse/LUCENE-3107
 Project: Lucene - Java
  Issue Type: Bug
  Components: core/index, core/other
Affects Versions: 3.1
 Environment: Windows Vista Microsoft Windows [Version 6.1.7600]
java version "1.6.0_24"
Java(TM) SE Runtime Environment (build 1.6.0_24-b07)
Java HotSpot(TM) Client VM (build 19.1-b02, mixed mode, sharing)
Reporter: Esmond Pitt
Priority: Blocker


StandardAnalyzer became final between 3.0.3 and 3.1.0. Unacceptable binary 
incompatibility. See my comment in Lucene-2100.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Bulk changing issues in JIRA

2011-05-16 Thread Shai Erera
Hi

If you ever wondered how to bulk change issues in JIRA, here's the
procedure:

* View a list of issues, e.g. by query/filter

* At the top-right you'll find this:


* Click on "Tools" and select



* The screen changes so that next to each issue there's a check box.

* Mark all the issues you want to change and click "Next"

* Select the operation (e.g. Edit)

* The next screen (followed by choosing operation "Edit") lets you edit the
issues. Note this at the bottom:



Deselect if you don't want to spam the list :).

FYI,
Shai


[jira] [Commented] (LUCENE-2100) Make contrib analyzers final

2011-05-16 Thread Esmond Pitt (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034530#comment-13034530
 ] 

Esmond Pitt commented on LUCENE-2100:
-

Did somebody implement this for 3.1.0? StandardAnalyzer became final between 
3.0.3 and 3.1.0. This is *not acceptable.* Binary compatibility must be 
preserved and to be frank I do not give a good goddam how ugly the code inside 
looks compared to this requirement.

> Make contrib analyzers final
> 
>
> Key: LUCENE-2100
> URL: https://issues.apache.org/jira/browse/LUCENE-2100
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: modules/analysis
>Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.4, 2.4.1, 
> 2.9, 2.9.1, 3.0
>Reporter: Simon Willnauer
>Assignee: Simon Willnauer
>Priority: Minor
> Fix For: 4.0
>
> Attachments: LUCENE-2100.patch, LUCENE-2100.patch
>
>
> The analyzers in contrib/analyzers should all be marked final. None of the 
> Analyzers should ever be subclassed - users should build their own analyzers 
> if a different combination of filters and Tokenizers is desired.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-2736) Wrong implementation of DocIdSetIterator.advance

2011-05-16 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-2736:
---

  Component/s: (was: core/other)
   core/search
Affects Version/s: (was: 3.0.2)
   3.2
 Assignee: Shai Erera

> Wrong implementation of DocIdSetIterator.advance 
> -
>
> Key: LUCENE-2736
> URL: https://issues.apache.org/jira/browse/LUCENE-2736
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 3.2, 4.0
>Reporter: Hardy Ferentschik
>Assignee: Shai Erera
>
> Implementations of {{DocIdSetIterator}} behave differently when advanced is 
> called. Taking the following test for {{OpenBitSet}}, {{DocIdBitSet}} and 
> {{SortedVIntList}} only {{SortedVIntList}} passes the test:
> {code:title=org.apache.lucene.search.TestDocIdSet.java|borderStyle=solid}
> ...
>   public void testAdvanceWithOpenBitSet() throws IOException {
>   DocIdSet idSet = new OpenBitSet( new long[] { 1121 }, 1 );  // 
> bits 0, 5, 6, 10
>   assertAdvance( idSet );
>   }
>   public void testAdvanceDocIdBitSet() throws IOException {
>   BitSet bitSet = new BitSet();
>   bitSet.set( 0 );
>   bitSet.set( 5 );
>   bitSet.set( 6 );
>   bitSet.set( 10 );
>   DocIdSet idSet = new DocIdBitSet(bitSet);
>   assertAdvance( idSet );
>   }
>   public void testAdvanceWithSortedVIntList() throws IOException {
>   DocIdSet idSet = new SortedVIntList( 0, 5, 6, 10 );
>   assertAdvance( idSet );
>   }   
>   private void assertAdvance(DocIdSet idSet) throws IOException {
>   DocIdSetIterator iter = idSet.iterator();
>   int docId = iter.nextDoc();
>   assertEquals( "First doc id should be 0", 0, docId );
>   docId = iter.nextDoc();
>   assertEquals( "Second doc id should be 5", 5, docId );
>   docId = iter.advance( 5 );
>   assertEquals( "Advancing iterator should return the next doc 
> id", 6, docId );
>   }
> {code}
> The javadoc for {{advance}} says:
> {quote}
> Advances to the first *beyond* the current whose document number is greater 
> than or equal to _target_.
> {quote}
> This seems to indicate that {{SortedVIntList}} behaves correctly, whereas the 
> other two don't. 
> Just looking at the {{DocIdBitSet}} implementation advance is implemented as:
> {code}
> bitSet.nextSetBit(target);
> {code}
> where the docs of {{nextSetBit}} say:
> {quote}
> Returns the index of the first bit that is set to true that occurs *on or 
> after* the specified starting index
> {quote}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-2736) Wrong implementation of DocIdSetIterator.advance

2011-05-16 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034527#comment-13034527
 ] 

Shai Erera commented on LUCENE-2736:


Thanks Hardy for reporting that.

But I think this works exactly as documented? Note that the javadocs of 
advance() state "*beyond* the current whose document number is *greater than or 
equal* to target". Also, there's a note in the javadocs:

{noformat}
   * NOTE: when  target ≤ current implementations may 
opt 
   * not to advance beyond their current {@link #docID()}.
{noformat}

I think that the word 'beyond' is confusing here. Perhaps we can modify the 
javadocs to:

"Advances to the first document whose number is greater than or equal to target"

If there are no objections, or better wording, I'll commit this later today, 
but only to 3.2/4.0 and not 3.0.2

> Wrong implementation of DocIdSetIterator.advance 
> -
>
> Key: LUCENE-2736
> URL: https://issues.apache.org/jira/browse/LUCENE-2736
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: core/other
>Affects Versions: 3.0.2, 4.0
>Reporter: Hardy Ferentschik
>
> Implementations of {{DocIdSetIterator}} behave differently when advanced is 
> called. Taking the following test for {{OpenBitSet}}, {{DocIdBitSet}} and 
> {{SortedVIntList}} only {{SortedVIntList}} passes the test:
> {code:title=org.apache.lucene.search.TestDocIdSet.java|borderStyle=solid}
> ...
>   public void testAdvanceWithOpenBitSet() throws IOException {
>   DocIdSet idSet = new OpenBitSet( new long[] { 1121 }, 1 );  // 
> bits 0, 5, 6, 10
>   assertAdvance( idSet );
>   }
>   public void testAdvanceDocIdBitSet() throws IOException {
>   BitSet bitSet = new BitSet();
>   bitSet.set( 0 );
>   bitSet.set( 5 );
>   bitSet.set( 6 );
>   bitSet.set( 10 );
>   DocIdSet idSet = new DocIdBitSet(bitSet);
>   assertAdvance( idSet );
>   }
>   public void testAdvanceWithSortedVIntList() throws IOException {
>   DocIdSet idSet = new SortedVIntList( 0, 5, 6, 10 );
>   assertAdvance( idSet );
>   }   
>   private void assertAdvance(DocIdSet idSet) throws IOException {
>   DocIdSetIterator iter = idSet.iterator();
>   int docId = iter.nextDoc();
>   assertEquals( "First doc id should be 0", 0, docId );
>   docId = iter.nextDoc();
>   assertEquals( "Second doc id should be 5", 5, docId );
>   docId = iter.advance( 5 );
>   assertEquals( "Advancing iterator should return the next doc 
> id", 6, docId );
>   }
> {code}
> The javadoc for {{advance}} says:
> {quote}
> Advances to the first *beyond* the current whose document number is greater 
> than or equal to _target_.
> {quote}
> This seems to indicate that {{SortedVIntList}} behaves correctly, whereas the 
> other two don't. 
> Just looking at the {{DocIdBitSet}} implementation advance is implemented as:
> {code}
> bitSet.nextSetBit(target);
> {code}
> where the docs of {{nextSetBit}} say:
> {quote}
> Returns the index of the first bit that is set to true that occurs *on or 
> after* the specified starting index
> {quote}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3106) commongrams filter calls incrementToken() after it returns false

2011-05-16 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-3106:


Attachment: LUCENE-3106.patch

here's the obvious solution, but there might be a cleaner way to rewrite its 
loop...

> commongrams filter calls incrementToken() after it returns false
> 
>
> Key: LUCENE-3106
> URL: https://issues.apache.org/jira/browse/LUCENE-3106
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: modules/analysis
>Reporter: Robert Muir
> Fix For: 3.2, 4.0
>
> Attachments: LUCENE-3106.patch, LUCENE-3106_test.patch
>
>
> In LUCENE-3064, we beefed up MockTokenizer with assertions, and I started 
> cutting over some analysis tests to use MockTokenizer for better coverage.
> The commongrams tests fail, because they call incrementToken() after it 
> already returns false. 
> In general its my understanding consumers should not do this (and i know of a 
> few tokenizers that will actually throw exceptions if you do this, just like 
> java iterators and such).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3106) commongrams filter calls incrementToken() after it returns false

2011-05-16 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-3106:


Component/s: modules/analysis

> commongrams filter calls incrementToken() after it returns false
> 
>
> Key: LUCENE-3106
> URL: https://issues.apache.org/jira/browse/LUCENE-3106
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: modules/analysis
>Reporter: Robert Muir
> Fix For: 3.2, 4.0
>
> Attachments: LUCENE-3106_test.patch
>
>
> In LUCENE-3064, we beefed up MockTokenizer with assertions, and I started 
> cutting over some analysis tests to use MockTokenizer for better coverage.
> The commongrams tests fail, because they call incrementToken() after it 
> already returns false. 
> In general its my understanding consumers should not do this (and i know of a 
> few tokenizers that will actually throw exceptions if you do this, just like 
> java iterators and such).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (SOLR-2520) JSONResponseWriter w/json.wrf can produce invalid javascript depending on unicode chars in response data

2011-05-16 Thread Yonik Seeley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yonik Seeley resolved SOLR-2520.


   Resolution: Fixed
Fix Version/s: 3.2

Committed to trunk and 3x.
Thanks for bringing this to our attention Benson!

> JSONResponseWriter w/json.wrf can produce invalid javascript depending on 
> unicode chars in response data
> 
>
> Key: SOLR-2520
> URL: https://issues.apache.org/jira/browse/SOLR-2520
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 4.0
>Reporter: Benson Margulies
> Fix For: 3.2
>
> Attachments: SOLR-2520.patch
>
>
> Please see http://timelessrepo.com/json-isnt-a-javascript-subset.
> If a stored field contains Unicode characters that are valid in Json but not 
> valid in Javascript, and you use the query option to ask for JSONP 
> (json.wrf), solr does *not* escape them, resulting in content that explodes 
> on contact with browsers. That is, there are certain Unicode characters that 
> are valid JSON but invalid in Javascript source, and a JSONP response is 
> javascript source, to be incorporated in an HTML script tag. Further 
> investigation suggests that only one character is a problem here:  U+2029 
> must be represented as \u2029 instead of left 'as-is'.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-2522) Change max() and min() to work on multiValued fields

2011-05-16 Thread Bill Bell (JIRA)
Change max() and min() to work on multiValued fields 
-

 Key: SOLR-2522
 URL: https://issues.apache.org/jira/browse/SOLR-2522
 Project: Solr
  Issue Type: Improvement
Reporter: Bill Bell


Switch max() and min() functions to work on multiValued fields so we can 
do sort=min(fieldname) asc and the sort would work on multiValued fields...



--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3106) commongrams filter calls incrementToken() after it returns false

2011-05-16 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-3106:


Attachment: LUCENE-3106_test.patch

patch with the test modifications to produce the failure.

> commongrams filter calls incrementToken() after it returns false
> 
>
> Key: LUCENE-3106
> URL: https://issues.apache.org/jira/browse/LUCENE-3106
> Project: Lucene - Java
>  Issue Type: Bug
>Reporter: Robert Muir
> Fix For: 3.2, 4.0
>
> Attachments: LUCENE-3106_test.patch
>
>
> In LUCENE-3064, we beefed up MockTokenizer with assertions, and I started 
> cutting over some analysis tests to use MockTokenizer for better coverage.
> The commongrams tests fail, because they call incrementToken() after it 
> already returns false. 
> In general its my understanding consumers should not do this (and i know of a 
> few tokenizers that will actually throw exceptions if you do this, just like 
> java iterators and such).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-3106) commongrams filter calls incrementToken() after it returns false

2011-05-16 Thread Robert Muir (JIRA)
commongrams filter calls incrementToken() after it returns false


 Key: LUCENE-3106
 URL: https://issues.apache.org/jira/browse/LUCENE-3106
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Robert Muir
 Fix For: 3.2, 4.0


In LUCENE-3064, we beefed up MockTokenizer with assertions, and I started 
cutting over some analysis tests to use MockTokenizer for better coverage.

The commongrams tests fail, because they call incrementToken() after it already 
returns false. 

In general its my understanding consumers should not do this (and i know of a 
few tokenizers that will actually throw exceptions if you do this, just like 
java iterators and such).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2339) No error reported when sorting on a multiValued field

2011-05-16 Thread Bill Bell (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034492#comment-13034492
 ] 

Bill Bell commented on SOLR-2339:
-

Guys,

How are we going to support sorting on multiValued fields?

Would a function work for this?

> No error reported when sorting on a multiValued field
> -
>
> Key: SOLR-2339
> URL: https://issues.apache.org/jira/browse/SOLR-2339
> Project: Solr
>  Issue Type: Bug
>  Components: search
>Reporter: Hoss Man
>Assignee: Hoss Man
> Fix For: 3.1, 4.0
>
> Attachments: SOLR-2339.patch, SOLR-2339.patch
>
>
> In the past, Solr has relied on the underlying FieldCache to throw an error 
> in situations where sorting on a field was not possible.  however LUCENE-2142 
> has changed this, so that FieldCache never throws an error.
> In order to maintain the functionality of past Solr releases (ie: error when 
> users attempt to sort on a field that we known will produce meaningless 
> results) we should add some sort of check at the Solr level.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-152) [PATCH] KStem for Lucene

2011-05-16 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034478#comment-13034478
 ] 

Mark Miller commented on LUCENE-152:


bq. More specifically: compile time dependencies on compiled BSD libraries are 
fine, but actually incorporating and releasing code that is under a BSD license 
is something we're aren't suppose to do (last time i checked)

Code is fine to afaik:
http://www.apache.org/legal/3party.html

> [PATCH] KStem for Lucene
> 
>
> Key: LUCENE-152
> URL: https://issues.apache.org/jira/browse/LUCENE-152
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: modules/analysis
>Affects Versions: unspecified
> Environment: Operating System: other
> Platform: Other
>Reporter: Otis Gospodnetic
>Priority: Minor
>
> September 10th 2003 contributionn from "Sergio Guzman-Lara" 
> 
> Original email:
> Hi all,
>   I have ported the kstem stemmer to Java and incorporated it to 
> Lucene. You can get the source code (Kstem.jar) from the following website:
> http://ciir.cs.umass.edu/downloads/
>   Just click on "KStem Java Implementation" (you will need to register 
> your e-mail, for free of course, with the CIIR --Center for Intelligent 
> Information Retrieval, UMass -- and get an access code).
> Content of Kstem.jar:
> java/org/apache/lucene/analysis/KStemData1.java
> java/org/apache/lucene/analysis/KStemData2.java
> java/org/apache/lucene/analysis/KStemData3.java
> java/org/apache/lucene/analysis/KStemData4.java
> java/org/apache/lucene/analysis/KStemData5.java
> java/org/apache/lucene/analysis/KStemData6.java
> java/org/apache/lucene/analysis/KStemData7.java
> java/org/apache/lucene/analysis/KStemData8.java
> java/org/apache/lucene/analysis/KStemFilter.java
> java/org/apache/lucene/analysis/KStemmer.java
> KStemData1.java, ..., KStemData8.java   Contain several lists of words 
> used by Kstem
> KStemmer.java  Implements the Kstem algorithm 
> KStemFilter.java Extends TokenFilter applying Kstem
> To compile
> unjar the file Kstem.jar to Lucene's "src" directory, and compile it 
> there. 
> What is Kstem?
>   A stemmer designed by Bob Krovetz (for more information see 
> http://ciir.cs.umass.edu/pubfiles/ir-35.pdf). 
> Copyright issues
>   This is open source. The actual license agreement is included at the 
> top of every source file.
>  Any comments/questions/suggestions are welcome,
>   Sergio Guzman-Lara
>   Senior Research Fellow
>   CIIR UMass

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-2424) extracted text from tika has no spaces

2011-05-16 Thread Liam O'Boyle (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liam O'Boyle updated SOLR-2424:
---

Attachment: ET2000 Service Manual.pdf

This file has problems which trigger this bug.

> extracted text from tika has no spaces
> --
>
> Key: SOLR-2424
> URL: https://issues.apache.org/jira/browse/SOLR-2424
> Project: Solr
>  Issue Type: Bug
>  Components: contrib - Solr Cell (Tika extraction)
>Affects Versions: 3.1
>Reporter: Yonik Seeley
> Attachments: ET2000 Service Manual.pdf
>
>
> Try this:
> curl 
> "http://localhost:8983/solr/update/extract?extractOnly=true&wt=json&indent=true";
>   -F "tutorial=@tutorial.pdf"
> And you get text output w/o spaces: 
> "ThisdocumentcoversthebasicsofrunningSolru"...

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-152) [PATCH] KStem for Lucene

2011-05-16 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034456#comment-13034456
 ] 

Mark Miller commented on LUCENE-152:


To extract a bit for clarity:

{quote}
This form is not for new projects. This is for projects and PMCs that have 
already been created and are receiving a code donation into an existing 
codebase. Any code that was developed outside of the ASF SVN repository and our 
public mailing lists must be processed like this, even if the external 
developer is already an ASF committer.
{quote}



> [PATCH] KStem for Lucene
> 
>
> Key: LUCENE-152
> URL: https://issues.apache.org/jira/browse/LUCENE-152
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: modules/analysis
>Affects Versions: unspecified
> Environment: Operating System: other
> Platform: Other
>Reporter: Otis Gospodnetic
>Priority: Minor
>
> September 10th 2003 contributionn from "Sergio Guzman-Lara" 
> 
> Original email:
> Hi all,
>   I have ported the kstem stemmer to Java and incorporated it to 
> Lucene. You can get the source code (Kstem.jar) from the following website:
> http://ciir.cs.umass.edu/downloads/
>   Just click on "KStem Java Implementation" (you will need to register 
> your e-mail, for free of course, with the CIIR --Center for Intelligent 
> Information Retrieval, UMass -- and get an access code).
> Content of Kstem.jar:
> java/org/apache/lucene/analysis/KStemData1.java
> java/org/apache/lucene/analysis/KStemData2.java
> java/org/apache/lucene/analysis/KStemData3.java
> java/org/apache/lucene/analysis/KStemData4.java
> java/org/apache/lucene/analysis/KStemData5.java
> java/org/apache/lucene/analysis/KStemData6.java
> java/org/apache/lucene/analysis/KStemData7.java
> java/org/apache/lucene/analysis/KStemData8.java
> java/org/apache/lucene/analysis/KStemFilter.java
> java/org/apache/lucene/analysis/KStemmer.java
> KStemData1.java, ..., KStemData8.java   Contain several lists of words 
> used by Kstem
> KStemmer.java  Implements the Kstem algorithm 
> KStemFilter.java Extends TokenFilter applying Kstem
> To compile
> unjar the file Kstem.jar to Lucene's "src" directory, and compile it 
> there. 
> What is Kstem?
>   A stemmer designed by Bob Krovetz (for more information see 
> http://ciir.cs.umass.edu/pubfiles/ir-35.pdf). 
> Copyright issues
>   This is open source. The actual license agreement is included at the 
> top of every source file.
>  Any comments/questions/suggestions are welcome,
>   Sergio Guzman-Lara
>   Senior Research Fellow
>   CIIR UMass

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-152) [PATCH] KStem for Lucene

2011-05-16 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034454#comment-13034454
 ] 

Mark Miller commented on LUCENE-152:


bq. Uh... that may be a stretch.

It's what the incubator seems to recommend, and the side have err'd on in the 
past.

http://incubator.apache.org/ip-clearance/index.html

If it was developed outside of Apache, we don't really know it's IP history, 
and that's something we want to take seriously.

> [PATCH] KStem for Lucene
> 
>
> Key: LUCENE-152
> URL: https://issues.apache.org/jira/browse/LUCENE-152
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: modules/analysis
>Affects Versions: unspecified
> Environment: Operating System: other
> Platform: Other
>Reporter: Otis Gospodnetic
>Priority: Minor
>
> September 10th 2003 contributionn from "Sergio Guzman-Lara" 
> 
> Original email:
> Hi all,
>   I have ported the kstem stemmer to Java and incorporated it to 
> Lucene. You can get the source code (Kstem.jar) from the following website:
> http://ciir.cs.umass.edu/downloads/
>   Just click on "KStem Java Implementation" (you will need to register 
> your e-mail, for free of course, with the CIIR --Center for Intelligent 
> Information Retrieval, UMass -- and get an access code).
> Content of Kstem.jar:
> java/org/apache/lucene/analysis/KStemData1.java
> java/org/apache/lucene/analysis/KStemData2.java
> java/org/apache/lucene/analysis/KStemData3.java
> java/org/apache/lucene/analysis/KStemData4.java
> java/org/apache/lucene/analysis/KStemData5.java
> java/org/apache/lucene/analysis/KStemData6.java
> java/org/apache/lucene/analysis/KStemData7.java
> java/org/apache/lucene/analysis/KStemData8.java
> java/org/apache/lucene/analysis/KStemFilter.java
> java/org/apache/lucene/analysis/KStemmer.java
> KStemData1.java, ..., KStemData8.java   Contain several lists of words 
> used by Kstem
> KStemmer.java  Implements the Kstem algorithm 
> KStemFilter.java Extends TokenFilter applying Kstem
> To compile
> unjar the file Kstem.jar to Lucene's "src" directory, and compile it 
> there. 
> What is Kstem?
>   A stemmer designed by Bob Krovetz (for more information see 
> http://ciir.cs.umass.edu/pubfiles/ir-35.pdf). 
> Copyright issues
>   This is open source. The actual license agreement is included at the 
> top of every source file.
>  Any comments/questions/suggestions are welcome,
>   Sergio Guzman-Lara
>   Senior Research Fellow
>   CIIR UMass

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: 3.2.0 (or 3.1.1)

2011-05-16 Thread Chris Hostetter

: > I don't disagree, but the devils advocate argument is "given the relative
: > size of the change sets, testing a 3.1.1 release is likely to be easier
: > then testing a 3.2 release, and the patches commited to the 3.1.x branch
: > are less likely to have introduced new bugs (becuase they only contain bug
: > fixes and not new features"

: thats true, but 3.2 also has better test coverage than 3.1.1 (a couple
: TestIdeas were worked off the list), and its in hudson's rotation
: every half hour.

+1 ... no argument.


-Hoss

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-152) [PATCH] KStem for Lucene

2011-05-16 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034450#comment-13034450
 ] 

Hoss Man commented on LUCENE-152:
-

bq. even if its Apache 2 licensed code. 

Uh... that may be a stretch.

More specifically: compile time dependencies on compiled BSD libraries are 
fine, but actually incorporating and *releasing* code that is under a BSD 
license is something we're aren't suppose to do (last time i checked)

> [PATCH] KStem for Lucene
> 
>
> Key: LUCENE-152
> URL: https://issues.apache.org/jira/browse/LUCENE-152
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: modules/analysis
>Affects Versions: unspecified
> Environment: Operating System: other
> Platform: Other
>Reporter: Otis Gospodnetic
>Priority: Minor
>
> September 10th 2003 contributionn from "Sergio Guzman-Lara" 
> 
> Original email:
> Hi all,
>   I have ported the kstem stemmer to Java and incorporated it to 
> Lucene. You can get the source code (Kstem.jar) from the following website:
> http://ciir.cs.umass.edu/downloads/
>   Just click on "KStem Java Implementation" (you will need to register 
> your e-mail, for free of course, with the CIIR --Center for Intelligent 
> Information Retrieval, UMass -- and get an access code).
> Content of Kstem.jar:
> java/org/apache/lucene/analysis/KStemData1.java
> java/org/apache/lucene/analysis/KStemData2.java
> java/org/apache/lucene/analysis/KStemData3.java
> java/org/apache/lucene/analysis/KStemData4.java
> java/org/apache/lucene/analysis/KStemData5.java
> java/org/apache/lucene/analysis/KStemData6.java
> java/org/apache/lucene/analysis/KStemData7.java
> java/org/apache/lucene/analysis/KStemData8.java
> java/org/apache/lucene/analysis/KStemFilter.java
> java/org/apache/lucene/analysis/KStemmer.java
> KStemData1.java, ..., KStemData8.java   Contain several lists of words 
> used by Kstem
> KStemmer.java  Implements the Kstem algorithm 
> KStemFilter.java Extends TokenFilter applying Kstem
> To compile
> unjar the file Kstem.jar to Lucene's "src" directory, and compile it 
> there. 
> What is Kstem?
>   A stemmer designed by Bob Krovetz (for more information see 
> http://ciir.cs.umass.edu/pubfiles/ir-35.pdf). 
> Copyright issues
>   This is open source. The actual license agreement is included at the 
> top of every source file.
>  Any comments/questions/suggestions are welcome,
>   Sergio Guzman-Lara
>   Senior Research Fellow
>   CIIR UMass

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-152) [PATCH] KStem for Lucene

2011-05-16 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034439#comment-13034439
 ] 

Mark Miller commented on LUCENE-152:


The general rule is that if its a fair amount of code, and it was developed 
outside of the Apache system, we want a software grant - even if its Apache 2 
licensed code. 

> [PATCH] KStem for Lucene
> 
>
> Key: LUCENE-152
> URL: https://issues.apache.org/jira/browse/LUCENE-152
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: modules/analysis
>Affects Versions: unspecified
> Environment: Operating System: other
> Platform: Other
>Reporter: Otis Gospodnetic
>Priority: Minor
>
> September 10th 2003 contributionn from "Sergio Guzman-Lara" 
> 
> Original email:
> Hi all,
>   I have ported the kstem stemmer to Java and incorporated it to 
> Lucene. You can get the source code (Kstem.jar) from the following website:
> http://ciir.cs.umass.edu/downloads/
>   Just click on "KStem Java Implementation" (you will need to register 
> your e-mail, for free of course, with the CIIR --Center for Intelligent 
> Information Retrieval, UMass -- and get an access code).
> Content of Kstem.jar:
> java/org/apache/lucene/analysis/KStemData1.java
> java/org/apache/lucene/analysis/KStemData2.java
> java/org/apache/lucene/analysis/KStemData3.java
> java/org/apache/lucene/analysis/KStemData4.java
> java/org/apache/lucene/analysis/KStemData5.java
> java/org/apache/lucene/analysis/KStemData6.java
> java/org/apache/lucene/analysis/KStemData7.java
> java/org/apache/lucene/analysis/KStemData8.java
> java/org/apache/lucene/analysis/KStemFilter.java
> java/org/apache/lucene/analysis/KStemmer.java
> KStemData1.java, ..., KStemData8.java   Contain several lists of words 
> used by Kstem
> KStemmer.java  Implements the Kstem algorithm 
> KStemFilter.java Extends TokenFilter applying Kstem
> To compile
> unjar the file Kstem.jar to Lucene's "src" directory, and compile it 
> there. 
> What is Kstem?
>   A stemmer designed by Bob Krovetz (for more information see 
> http://ciir.cs.umass.edu/pubfiles/ir-35.pdf). 
> Copyright issues
>   This is open source. The actual license agreement is included at the 
> top of every source file.
>  Any comments/questions/suggestions are welcome,
>   Sergio Guzman-Lara
>   Senior Research Fellow
>   CIIR UMass

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: 3.2.0 (or 3.1.1)

2011-05-16 Thread Robert Muir
On Mon, May 16, 2011 at 7:41 PM, Chris Hostetter
 wrote:

> I don't disagree, but the devils advocate argument is "given the relative
> size of the change sets, testing a 3.1.1 release is likely to be easier
> then testing a 3.2 release, and the patches commited to the 3.1.x branch
> are less likely to have introduced new bugs (becuase they only contain bug
> fixes and not new features"
>

thats true, but 3.2 also has better test coverage than 3.1.1 (a couple
TestIdeas were worked off the list), and its in hudson's rotation
every half hour.

additionally there's at least one or two test coverage things we can
backport from trunk to 3.2 just because... which seems more productive
than backporting things from branch_3x to a bugfix 3.1.1 branch that
isn't even being tested by hudson.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: 3.2.0 (or 3.1.1)

2011-05-16 Thread Chris Hostetter

: My vote would be to just spend our time on 3.2. people get bugfixes,
: better test coverage, and a couple of new features and optimizations,
: too.
: Is it really going to be harder to release 3.2 than to release 3.1.1?

I don't disagree, but the devils advocate argument is "given the relative 
size of the change sets, testing a 3.1.1 release is likely to be easier 
then testing a 3.2 release, and the patches commited to the 3.1.x branch 
are less likely to have introduced new bugs (becuase they only contain bug 
fixes and not new features"



-Hoss

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: 3.2.0 (or 3.1.1)

2011-05-16 Thread Chris Hostetter

: And also, we should adopt that approach going forward (no more bug fix
: releases for the stable branch, except for the last release before 4.0
: is out). That means updating the release TODO with e.g., not creating
: a branch for 3.2.x, only tag it. When 4.0 is out, we branch 3.x.y out
: of the last 3.x tag.

I don't know that we need box ourselves in ... if someone discovers a 
massively critical bug the day after 3.2 is released, it's totally 
reasonable/sensible to do a quick 3.2.1 release.

That said: i don't know that we have to create the 3.2.x branch when we 
create the 3.2 tag ... we can certainly do a lazy instantiation as needed.

Bottom line: 3.x.0 releases are still "feature releases on the stable api 
branch", and as long as we can maintain intertia on relatively rapid turn 
arround of feature release then great -- but that doens't mean we should 
completley rule out having 3.x.y "bug fix" releases.


-Hoss

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-2520) JSONResponseWriter w/json.wrf can produce invalid javascript depending on unicode chars in response data

2011-05-16 Thread Benson Margulies (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benson Margulies updated SOLR-2520:
---

Description: 
Please see http://timelessrepo.com/json-isnt-a-javascript-subset.

If a stored field contains Unicode characters that are valid in Json but not 
valid in Javascript, and you use the query option to ask for JSONP (json.wrf), 
solr does *not* escape them, resulting in content that explodes on contact with 
browsers. That is, there are certain Unicode characters that are valid JSON but 
invalid in Javascript source, and a JSONP response is javascript source, to be 
incorporated in an HTML script tag. Further investigation suggests that only 
one character is a problem here:  U+2029 must be represented as \u2029 instead 
of left 'as-is'.


  was:
Please see http://timelessrepo.com/json-isnt-a-javascript-subset.

If a stored field contains Unicode characters that are valid in Json but not 
valid in Javascript, and you use the query option to ask for jsonp (json.wrt), 
solr does *not* escape them characters, resulting in content that explodes on 
contact with browsers. That is, there are certain Unicode characters that are 
valid JSON but invalid in Javascript source, and a JSONP response is javascript 
source, to be incorporated in an HTML script tag. 



> JSONResponseWriter w/json.wrf can produce invalid javascript depending on 
> unicode chars in response data
> 
>
> Key: SOLR-2520
> URL: https://issues.apache.org/jira/browse/SOLR-2520
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 4.0
>Reporter: Benson Margulies
> Attachments: SOLR-2520.patch
>
>
> Please see http://timelessrepo.com/json-isnt-a-javascript-subset.
> If a stored field contains Unicode characters that are valid in Json but not 
> valid in Javascript, and you use the query option to ask for JSONP 
> (json.wrf), solr does *not* escape them, resulting in content that explodes 
> on contact with browsers. That is, there are certain Unicode characters that 
> are valid JSON but invalid in Javascript source, and a JSONP response is 
> javascript source, to be incorporated in an HTML script tag. Further 
> investigation suggests that only one character is a problem here:  U+2029 
> must be represented as \u2029 instead of left 'as-is'.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-2520) JSONResponseWriter w/json.wrf can produce invalid javascript depending on unicode chars in response data

2011-05-16 Thread Benson Margulies (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benson Margulies updated SOLR-2520:
---

Description: 
Please see http://timelessrepo.com/json-isnt-a-javascript-subset.

If a stored field contains Unicode characters that are valid in Json but not 
valid in Javascript, and you use the query option to ask for jsonp (json.wrt), 
solr does *not* escape them characters, resulting in content that explodes on 
contact with browsers. That is, there are certain Unicode characters that are 
valid JSON but invalid in Javascript source, and a JSONP response is javascript 
source, to be incorporated in an HTML script tag. 


  was:
Please see http://timelessrepo.com/json-isnt-a-javascript-subset.

If a stored field contains invalid Javascript characters, and you use the query 
option to ask for jsonp, solr does *not* escape some invalid Unicode 
characters, resulting in strings that explode on contact with browsers.



> JSONResponseWriter w/json.wrf can produce invalid javascript depending on 
> unicode chars in response data
> 
>
> Key: SOLR-2520
> URL: https://issues.apache.org/jira/browse/SOLR-2520
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 4.0
>Reporter: Benson Margulies
> Attachments: SOLR-2520.patch
>
>
> Please see http://timelessrepo.com/json-isnt-a-javascript-subset.
> If a stored field contains Unicode characters that are valid in Json but not 
> valid in Javascript, and you use the query option to ask for jsonp 
> (json.wrt), solr does *not* escape them characters, resulting in content that 
> explodes on contact with browsers. That is, there are certain Unicode 
> characters that are valid JSON but invalid in Javascript source, and a JSONP 
> response is javascript source, to be incorporated in an HTML script tag. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-2520) JSONResponseWriter w/json.wrf can produce invalid javascript depending on unicode chars in response data

2011-05-16 Thread Yonik Seeley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yonik Seeley updated SOLR-2520:
---

Attachment: SOLR-2520.patch

Here's a patch w/ simple test.

> JSONResponseWriter w/json.wrf can produce invalid javascript depending on 
> unicode chars in response data
> 
>
> Key: SOLR-2520
> URL: https://issues.apache.org/jira/browse/SOLR-2520
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 4.0
>Reporter: Benson Margulies
> Attachments: SOLR-2520.patch
>
>
> Please see http://timelessrepo.com/json-isnt-a-javascript-subset.
> If a stored field contains invalid Javascript characters, and you use the 
> query option to ask for jsonp, solr does *not* escape some invalid Unicode 
> characters, resulting in strings that explode on contact with browsers.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-2548) Remove all interning of field names from flex API

2011-05-16 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034393#comment-13034393
 ] 

Robert Muir commented on LUCENE-2548:
-

after seeing LUCENE-3105, i think we should take steps to remove this interning.

it looks like this can probably be done safely, according to 
http://www.cs.umd.edu/~jfoster/papers/issre04.pdf , findbugs, PMD, and JLint 
all support looking for string equality with == or !=, so we should be able to 
review all occurrences.

> Remove all interning of field names from flex API
> -
>
> Key: LUCENE-2548
> URL: https://issues.apache.org/jira/browse/LUCENE-2548
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Uwe Schindler
> Fix For: 4.0
>
>
> In previous versions of Lucene, interning of fields was important to minimize 
> string comparison cost when iterating TermEnums, to detect changes in field 
> name. As we separated field names from terms in flex, no query compares field 
> names anymore, so the whole performance problematic interning can be removed. 
> I will start with doing this, but we need to carefully review some places 
> e.g. in preflex codec.
> Maybe before this issue we should remove the Term class completely. :-) 
> Robert?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2445) unknown handler: standard

2011-05-16 Thread Koji Sekiguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034394#comment-13034394
 ] 

Koji Sekiguchi commented on SOLR-2445:
--

Any objections about applying this trivial patch to 3.1.1?

> unknown handler: standard
> -
>
> Key: SOLR-2445
> URL: https://issues.apache.org/jira/browse/SOLR-2445
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 1.4.1, 3.1, 3.2, 4.0
>Reporter: Koji Sekiguchi
>Assignee: Koji Sekiguchi
>Priority: Minor
> Fix For: 3.2, 4.0
>
> Attachments: SOLR-2445.patch, qt-form-jsp.patch
>
>
> To reproduce the problem using example config, go form.jsp, use standard for 
> qt (it is default) then click Search.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3105) String.intern() calls slow down IndexWriter.close() and IndexReader.open() for index with large number of unique field names

2011-05-16 Thread Mark Kristensson (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Kristensson updated LUCENE-3105:
-

Attachment: LUCENE-3105.patch

Patch file to eliminate String.intern() calls while opening indexReaders and 
closing indexWriters.

> String.intern() calls slow down IndexWriter.close() and IndexReader.open() 
> for index with large number of unique field names
> 
>
> Key: LUCENE-3105
> URL: https://issues.apache.org/jira/browse/LUCENE-3105
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 3.1
>Reporter: Mark Kristensson
> Attachments: LUCENE-3105.patch
>
>
> We have one index with several hundred thousand unqiue field names (we're 
> optimistic that Lucene 4.0 is flexible enough to allow us to change our index 
> design...) and found that opening an index writer and closing an index reader 
> results in horribly slow performance on that one index. I have isolated the 
> problem down to the calls to String.intern() that are used to allow for quick 
> string comparisons of field names throughout Lucene. These String.intern() 
> calls are unnecessary and can be replaced with a hashmap lookup. In fact, 
> StringHelper.java has its own hashmap implementation that it uses in 
> conjunction with String.intern(). Rather than using a one-off hashmap, I've 
> elected to use a ConcurrentHashMap in this patch.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-3105) String.intern() calls slow down IndexWriter.close() and IndexReader.open() for index with large number of unique field names

2011-05-16 Thread Mark Kristensson (JIRA)
String.intern() calls slow down IndexWriter.close() and IndexReader.open() for 
index with large number of unique field names


 Key: LUCENE-3105
 URL: https://issues.apache.org/jira/browse/LUCENE-3105
 Project: Lucene - Java
  Issue Type: Bug
  Components: core/index
Affects Versions: 3.1
Reporter: Mark Kristensson


We have one index with several hundred thousand unqiue field names (we're 
optimistic that Lucene 4.0 is flexible enough to allow us to change our index 
design...) and found that opening an index writer and closing an index reader 
results in horribly slow performance on that one index. I have isolated the 
problem down to the calls to String.intern() that are used to allow for quick 
string comparisons of field names throughout Lucene. These String.intern() 
calls are unnecessary and can be replaced with a hashmap lookup. In fact, 
StringHelper.java has its own hashmap implementation that it uses in 
conjunction with String.intern(). Rather than using a one-off hashmap, I've 
elected to use a ConcurrentHashMap in this patch.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-152) [PATCH] KStem for Lucene

2011-05-16 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034368#comment-13034368
 ] 

Steven Rowe commented on LUCENE-152:


If the original sources are BSD licensed, is a software grant required to 
incorporate the sources into the Lucene/Solr source tree?

> [PATCH] KStem for Lucene
> 
>
> Key: LUCENE-152
> URL: https://issues.apache.org/jira/browse/LUCENE-152
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: modules/analysis
>Affects Versions: unspecified
> Environment: Operating System: other
> Platform: Other
>Reporter: Otis Gospodnetic
>Priority: Minor
>
> September 10th 2003 contributionn from "Sergio Guzman-Lara" 
> 
> Original email:
> Hi all,
>   I have ported the kstem stemmer to Java and incorporated it to 
> Lucene. You can get the source code (Kstem.jar) from the following website:
> http://ciir.cs.umass.edu/downloads/
>   Just click on "KStem Java Implementation" (you will need to register 
> your e-mail, for free of course, with the CIIR --Center for Intelligent 
> Information Retrieval, UMass -- and get an access code).
> Content of Kstem.jar:
> java/org/apache/lucene/analysis/KStemData1.java
> java/org/apache/lucene/analysis/KStemData2.java
> java/org/apache/lucene/analysis/KStemData3.java
> java/org/apache/lucene/analysis/KStemData4.java
> java/org/apache/lucene/analysis/KStemData5.java
> java/org/apache/lucene/analysis/KStemData6.java
> java/org/apache/lucene/analysis/KStemData7.java
> java/org/apache/lucene/analysis/KStemData8.java
> java/org/apache/lucene/analysis/KStemFilter.java
> java/org/apache/lucene/analysis/KStemmer.java
> KStemData1.java, ..., KStemData8.java   Contain several lists of words 
> used by Kstem
> KStemmer.java  Implements the Kstem algorithm 
> KStemFilter.java Extends TokenFilter applying Kstem
> To compile
> unjar the file Kstem.jar to Lucene's "src" directory, and compile it 
> there. 
> What is Kstem?
>   A stemmer designed by Bob Krovetz (for more information see 
> http://ciir.cs.umass.edu/pubfiles/ir-35.pdf). 
> Copyright issues
>   This is open source. The actual license agreement is included at the 
> top of every source file.
>  Any comments/questions/suggestions are welcome,
>   Sergio Guzman-Lara
>   Senior Research Fellow
>   CIIR UMass

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2520) JSONResponseWriter w/json.wrf can produce invalid javascript depending on unicode chars in response data

2011-05-16 Thread Benson Margulies (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034352#comment-13034352
 ] 

Benson Margulies commented on SOLR-2520:


Yes, that looks like that.

> JSONResponseWriter w/json.wrf can produce invalid javascript depending on 
> unicode chars in response data
> 
>
> Key: SOLR-2520
> URL: https://issues.apache.org/jira/browse/SOLR-2520
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 4.0
>Reporter: Benson Margulies
>
> Please see http://timelessrepo.com/json-isnt-a-javascript-subset.
> If a stored field contains invalid Javascript characters, and you use the 
> query option to ask for jsonp, solr does *not* escape some invalid Unicode 
> characters, resulting in strings that explode on contact with browsers.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-3104) Hook up Automated Patch Checking for Lucene/Solr

2011-05-16 Thread Grant Ingersoll (JIRA)
Hook up Automated Patch Checking for Lucene/Solr


 Key: LUCENE-3104
 URL: https://issues.apache.org/jira/browse/LUCENE-3104
 Project: Lucene - Java
  Issue Type: Task
Reporter: Grant Ingersoll


It would be really great if we could get feedback to contributors sooner on 
many things that are basic (tests exist, patch applies cleanly, etc.)

>From Nigel Daley on builds@a.o
{quote}

I revamped the precommit testing in the fall so that it doesn't use Jira email 
anymore to trigger a build.  The process is controlled by
https://builds.apache.org/hudson/job/PreCommit-Admin/
which has some documentation up at the top of the job.  You can look at the 
config of the job (do you have access?) to see what it's doing.  Any project 
could use this same admin job -- you just need to ask me to add the project to 
the Jira filter used by the admin job 
(https://issues.apache.org/jira/sr/jira.issueviews:searchrequest-xml/12313474/SearchRequest-12313474.xml?tempMax=100
 ) once you have the downstream job(s) setup for your specific project.  For 
Hadoop we have 3 downstream builds configured which also have some 
documentation:
https://builds.apache.org/hudson/job/PreCommit-HADOOP-Build/
https://builds.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/
https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/
{quote}


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3097) Post grouping faceting

2011-05-16 Thread Martijn van Groningen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034312#comment-13034312
 ] 

Martijn van Groningen commented on LUCENE-3097:
---

bq. Ie, we just have to insure, at indexing time, that docs within the same 
"group" are adjacent, if you want to be able to count by unique group values.
This means that in the same group also need to be in the same segment, right? 
Or if we use this mechanism for faceting documents with the same facet need to 
be in the same segment??? If that is true, it would make the collectors easier. 
The SentinelIntSet we use in the collectors is not necessary, because we can 
lookup the norm from the DocIndexTerms. We won't find the same group in a 
different segment. On the other hand with scalability in mind would make it 
complex. Since documents with the in the same group need to be in the same 
segment. Which makes indexing complex.


> Post grouping faceting
> --
>
> Key: LUCENE-3097
> URL: https://issues.apache.org/jira/browse/LUCENE-3097
> Project: Lucene - Java
>  Issue Type: New Feature
>Reporter: Martijn van Groningen
>Priority: Minor
> Fix For: 3.2, 4.0
>
>
> This issues focuses on implementing post grouping faceting.
> * How to handle multivalued fields. What field value to show with the facet.
> * Where the facet counts should be based on
> ** Facet counts can be based on the normal documents. Ungrouped counts. 
> ** Facet counts can be based on the groups. Grouped counts.
> ** Facet counts can be based on the combination of group value and facet 
> value. Matrix counts.   
> And properly more implementation options.
> The first two methods are implemented in the SOLR-236 patch. For the first 
> option it calculates a DocSet based on the individual documents from the 
> query result. For the second option it calculates a DocSet for all the most 
> relevant documents of a group. Once the DocSet is computed the FacetComponent 
> and StatsComponent use one the DocSet to create facets and statistics.  
> This last one is a bit more complex. I think it is best explained with an 
> example. Lets say we search on travel offers:
> |||hotel||departure_airport||duration||
> |Hotel a|AMS|5
> |Hotel a|DUS|10
> |Hotel b|AMS|5
> |Hotel b|AMS|10
> If we group by hotel and have a facet for airport. Most end users expect 
> (according to my experience off course) the following airport facet:
> AMS: 2
> DUS: 1
> The above result can't be achieved by the first two methods. You either get 
> counts AMS:3 and DUS:1 or 1 for both airports.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Issue Comment Edited] (SOLR-2445) unknown handler: standard

2011-05-16 Thread Gabriele Kahlout (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1303#comment-1303
 ] 

Gabriele Kahlout edited comment on SOLR-2445 at 5/16/11 8:48 PM:
-

trivial patch to form.jsp that leaves qt empty (useful for setup scripts and 
those that need to stick to a 3.1.0 revision).

  was (Author: simpatico):
trivial patch to form.jsp that leaves qt empty (useful for setup scripts 
and those that need to stick to an 3.1.0 revision).
  
> unknown handler: standard
> -
>
> Key: SOLR-2445
> URL: https://issues.apache.org/jira/browse/SOLR-2445
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 1.4.1, 3.1, 3.2, 4.0
>Reporter: Koji Sekiguchi
>Assignee: Koji Sekiguchi
>Priority: Minor
> Fix For: 3.2, 4.0
>
> Attachments: SOLR-2445.patch, qt-form-jsp.patch
>
>
> To reproduce the problem using example config, go form.jsp, use standard for 
> qt (it is default) then click Search.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3096) MultiSearcher does not work correctly with Not on NumericRange

2011-05-16 Thread hao yan (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034289#comment-13034289
 ] 

hao yan commented on LUCENE-3096:
-

Thanks! Uwe!



> MultiSearcher does not work correctly with Not on NumericRange
> --
>
> Key: LUCENE-3096
> URL: https://issues.apache.org/jira/browse/LUCENE-3096
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 3.0.2
>Reporter: John Wang
> Fix For: 3.1
>
>
> Hi, Keith
> My colleague xiaoyang and I just confirmed that this is actually due to a 
> lucene bug on Multisearcher. In particular,
> If we search with Not on NumericRange and we use MultiSearcher, we
> will wrong search results (However, if we use IndexSearcher, the
> result is correct).  Basically the NotOfNumericRange does not have
> impact on multisearcher. We suspect it is because the createWeight()
> function in MultiSearcher and hope you can help us to fix this bug of
> lucene. I attached the code to reproduce this case. Please check it
> out.
> In the attached code, I have two separate functions :
> (1) testNumericRangeSingleSearcher(Query query)
> where I create 6 documents, with a field called "id"= 1,2,3,4,5,6
> respectively . Then I search by the query which is
> +MatchAllDocs -NumericRange(3,3). The expected result then should
> be 5 hits since the document 3 is MUST_NOT.
> (2) testNumericRangeMultiSearcher(Query query)
> where i create 2 RamDirectory(), each of which has 3 documents,
> 1,2,3; and 4,5,6. Then I search by the same query as above using
> multiSearcher. The expected result should also be 5 hits.
> However, from (1), we get 5 hits = expected results, while in (2) we
> get 6 hits != expected results.
> We also experimented this with our zoie/bobo open source tools and get
> the same results because our multi-bobo-browser is built on
> multi-searcher in lucene.
> I already emailed the lucene community group. Hopefully we can get some 
> feedback soon.
> If you have any further concern, pls let me know! 
> Thank you very much!
> Code:  (based on lucene 3.0.x)
> import java.io.IOException;
> import java.io.PrintStream;
> import java.text.DecimalFormat;
> import org.apache.lucene.analysis.WhitespaceAnalyzer;
> import org.apache.lucene.document.Document;
> import org.apache.lucene.document.Field;
> import org.apache.lucene.document.NumericField;
> import org.apache.lucene.index.CorruptIndexException;
> import org.apache.lucene.index.IndexWriter;
> import org.apache.lucene.index.Term;
> import org.apache.lucene.search.BooleanQuery;
> import org.apache.lucene.search.FieldCache;
> import org.apache.lucene.search.IndexSearcher;
> import org.apache.lucene.search.MatchAllDocsQuery;
> import org.apache.lucene.search.MultiSearcher;
> import org.apache.lucene.search.NumericRangeQuery;
> import org.apache.lucene.search.Query;
> import org.apache.lucene.search.ScoreDoc;
> import org.apache.lucene.search.Searchable;
> import org.apache.lucene.search.Sort;
> import org.apache.lucene.search.SortField;
> import org.apache.lucene.search.TermQuery;
> import org.apache.lucene.search.TopDocs;
> import org.apache.lucene.search.BooleanClause.Occur;
> import org.apache.lucene.store.Directory;
> import org.apache.lucene.store.LockObtainFailedException;
> import org.apache.lucene.store.RAMDirectory;
> import com.convertlucene.ConvertFrom2To3;
> public class TestNumericRange
> {
>  public final static void main(String[] args)
>  {
>try
>{
>  BooleanQuery query = new  BooleanQuery();
>  query.add(NumericRangeQuery.newIntRange("numId", 3, 3, true,
> true), Occur.MUST_NOT);
>  query.add(new MatchAllDocsQuery(), Occur.MUST);
>  testNumericRangeSingleSearcher(query);
>  testNumericRangeMultiSearcher(query);
>}
>catch(Exception e)
>{
>  e.printStackTrace();
>}
>  }
>  public static void testNumericRangeSingleSearcher(Query query)
> throws CorruptIndexException, LockObtainFailedException, IOException
>  {
> String[] ids = {"1", "2", "3", "4", "5", "6"};
>Directory directory = new RAMDirectory();
>IndexWriter writer = new IndexWriter(directory, new
> WhitespaceAnalyzer(),  IndexWriter.MaxFieldLength.UNLIMITED);
>for (int i = 0; i < ids.length; i++)
>{
>  Document doc = new Document();
>  doc.add(new Field("id", ids[i],
>Field.Store.YES,
>Field.Index.NOT_ANALYZED));
>  doc.add(new NumericField("numId").setIntValue(Integer.valueOf(ids[i])));
>  writer.addDocument(doc);
>}
>writer.close();
>IndexSearcher searcher = new IndexSearcher(directory);
>TopDocs docs = searcher.search(query, 10);
>System.out.println("SingleSearcher: testNumericRange: hitNum: " +
> docs.totalHits);
>f

[jira] [Updated] (LUCENE-3102) Few issues with CachingCollector

2011-05-16 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-3102:
---

Component/s: (was: modules/grouping)
 core/search

> Few issues with CachingCollector
> 
>
> Key: LUCENE-3102
> URL: https://issues.apache.org/jira/browse/LUCENE-3102
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: core/search
>Reporter: Shai Erera
>Assignee: Shai Erera
>Priority: Minor
> Fix For: 3.2, 4.0
>
> Attachments: LUCENE-3102.patch, LUCENE-3102.patch
>
>
> CachingCollector (introduced in LUCENE-1421) has few issues:
> # Since the wrapped Collector may support out-of-order collection, the 
> document IDs cached may be out-of-order (depends on the Query) and thus 
> replay(Collector) will forward document IDs out-of-order to a Collector that 
> may not support it.
> # It does not clear cachedScores + cachedSegs upon exceeding RAM limits
> # I think that instead of comparing curScores to null, in order to determine 
> if scores are requested, we should have a specific boolean - for clarity
> # This check "if (base + nextLength > maxDocsToCache)" (line 168) can be 
> relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the 
> maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want 
> to try and cache them?
> Also:
> * The TODO in line 64 (having Collector specify needsScores()) -- why do we 
> need that if CachingCollector ctor already takes a boolean "cacheScores"? I 
> think it's better defined explicitly than implicitly?
> * Let's introduce a factory method for creating a specialized version if 
> scoring is requested / not (i.e., impl the TODO in line 189)
> * I think it's a useful collector, which stands on its own and not specific 
> to grouping. Can we move it to core?
> * How about using OpenBitSet instead of int[] for doc IDs?
> ** If the number of hits is big, we'd gain some RAM back, and be able to 
> cache more entries
> ** NOTE: OpenBitSet can only be used for in-order collection only. So we can 
> use that if the wrapped Collector does not support out-of-order
> * Do you think we can modify this Collector to not necessarily wrap another 
> Collector? We have such Collector which stores (in-memory) all matching doc 
> IDs + scores (if required). Those are later fed into several processes that 
> operate on them (e.g. fetch more info from the index etc.). I am thinking, we 
> can make CachingCollector *optionally* wrap another Collector and then 
> someone can reuse it by setting RAM limit to unlimited (we should have a 
> constant for that) in order to simply collect all matching docs + scores.
> * I think a set of dedicated unit tests for this class alone would be good.
> That's it so far. Perhaps, if we do all of the above, more things will pop up.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Assigned] (LUCENE-3102) Few issues with CachingCollector

2011-05-16 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera reassigned LUCENE-3102:
--

Assignee: Shai Erera

> Few issues with CachingCollector
> 
>
> Key: LUCENE-3102
> URL: https://issues.apache.org/jira/browse/LUCENE-3102
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: core/search
>Reporter: Shai Erera
>Assignee: Shai Erera
>Priority: Minor
> Fix For: 3.2, 4.0
>
> Attachments: LUCENE-3102.patch, LUCENE-3102.patch
>
>
> CachingCollector (introduced in LUCENE-1421) has few issues:
> # Since the wrapped Collector may support out-of-order collection, the 
> document IDs cached may be out-of-order (depends on the Query) and thus 
> replay(Collector) will forward document IDs out-of-order to a Collector that 
> may not support it.
> # It does not clear cachedScores + cachedSegs upon exceeding RAM limits
> # I think that instead of comparing curScores to null, in order to determine 
> if scores are requested, we should have a specific boolean - for clarity
> # This check "if (base + nextLength > maxDocsToCache)" (line 168) can be 
> relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the 
> maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want 
> to try and cache them?
> Also:
> * The TODO in line 64 (having Collector specify needsScores()) -- why do we 
> need that if CachingCollector ctor already takes a boolean "cacheScores"? I 
> think it's better defined explicitly than implicitly?
> * Let's introduce a factory method for creating a specialized version if 
> scoring is requested / not (i.e., impl the TODO in line 189)
> * I think it's a useful collector, which stands on its own and not specific 
> to grouping. Can we move it to core?
> * How about using OpenBitSet instead of int[] for doc IDs?
> ** If the number of hits is big, we'd gain some RAM back, and be able to 
> cache more entries
> ** NOTE: OpenBitSet can only be used for in-order collection only. So we can 
> use that if the wrapped Collector does not support out-of-order
> * Do you think we can modify this Collector to not necessarily wrap another 
> Collector? We have such Collector which stores (in-memory) all matching doc 
> IDs + scores (if required). Those are later fed into several processes that 
> operate on them (e.g. fetch more info from the index etc.). I am thinking, we 
> can make CachingCollector *optionally* wrap another Collector and then 
> someone can reuse it by setting RAM limit to unlimited (we should have a 
> constant for that) in order to simply collect all matching docs + scores.
> * I think a set of dedicated unit tests for this class alone would be good.
> That's it so far. Perhaps, if we do all of the above, more things will pop up.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3102) Few issues with CachingCollector

2011-05-16 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034283#comment-13034283
 ] 

Shai Erera commented on LUCENE-3102:


Committed revision 1103870 (3x).
Committed revision 1103872 (trunk).

What's committed:
* Move CachingCollector to core
* Fix bugs
* Add TestCachingCollector
* Some refactoring

Moving on to next proposed changes.

> Few issues with CachingCollector
> 
>
> Key: LUCENE-3102
> URL: https://issues.apache.org/jira/browse/LUCENE-3102
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: modules/grouping
>Reporter: Shai Erera
>Priority: Minor
> Fix For: 3.2, 4.0
>
> Attachments: LUCENE-3102.patch, LUCENE-3102.patch
>
>
> CachingCollector (introduced in LUCENE-1421) has few issues:
> # Since the wrapped Collector may support out-of-order collection, the 
> document IDs cached may be out-of-order (depends on the Query) and thus 
> replay(Collector) will forward document IDs out-of-order to a Collector that 
> may not support it.
> # It does not clear cachedScores + cachedSegs upon exceeding RAM limits
> # I think that instead of comparing curScores to null, in order to determine 
> if scores are requested, we should have a specific boolean - for clarity
> # This check "if (base + nextLength > maxDocsToCache)" (line 168) can be 
> relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the 
> maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want 
> to try and cache them?
> Also:
> * The TODO in line 64 (having Collector specify needsScores()) -- why do we 
> need that if CachingCollector ctor already takes a boolean "cacheScores"? I 
> think it's better defined explicitly than implicitly?
> * Let's introduce a factory method for creating a specialized version if 
> scoring is requested / not (i.e., impl the TODO in line 189)
> * I think it's a useful collector, which stands on its own and not specific 
> to grouping. Can we move it to core?
> * How about using OpenBitSet instead of int[] for doc IDs?
> ** If the number of hits is big, we'd gain some RAM back, and be able to 
> cache more entries
> ** NOTE: OpenBitSet can only be used for in-order collection only. So we can 
> use that if the wrapped Collector does not support out-of-order
> * Do you think we can modify this Collector to not necessarily wrap another 
> Collector? We have such Collector which stores (in-memory) all matching doc 
> IDs + scores (if required). Those are later fed into several processes that 
> operate on them (e.g. fetch more info from the index etc.). I am thinking, we 
> can make CachingCollector *optionally* wrap another Collector and then 
> someone can reuse it by setting RAM limit to unlimited (we should have a 
> constant for that) in order to simply collect all matching docs + scores.
> * I think a set of dedicated unit tests for this class alone would be good.
> That's it so far. Perhaps, if we do all of the above, more things will pop up.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3098) Grouped total count

2011-05-16 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034247#comment-13034247
 ] 

Michael McCandless commented on LUCENE-3098:


Thanks Martijn!!  But, in general, you don't have to do the 3.x backport ;)  I 
can do it too...

We want to minimize the effort for people to contribute to Lucene/Solr!

But thank you for backporting!

> Grouped total count
> ---
>
> Key: LUCENE-3098
> URL: https://issues.apache.org/jira/browse/LUCENE-3098
> Project: Lucene - Java
>  Issue Type: New Feature
>Reporter: Martijn van Groningen
>Assignee: Michael McCandless
> Fix For: 3.2, 4.0
>
> Attachments: LUCENE-3098-3x.patch, LUCENE-3098-3x.patch, 
> LUCENE-3098.patch, LUCENE-3098.patch, LUCENE-3098.patch, LUCENE-3098.patch, 
> LUCENE-3098.patch
>
>
> When grouping currently you can get two counts:
> * Total hit count. Which counts all documents that matched the query.
> * Total grouped hit count. Which counts all documents that have been grouped 
> in the top N groups.
> Since the end user gets groups in his search result instead of plain 
> documents with grouping. The total number of groups as total count makes more 
> sense in many situations. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3092) NRTCachingDirectory, to buffer small segments in a RAMDir

2011-05-16 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034242#comment-13034242
 ] 

Simon Willnauer commented on LUCENE-3092:
-

mike I attached a patch to LUCENE-3100 and tested with the latest patch on this 
issue. The test randomly fails (after I close the IW in the test!) here is a 
trace:

{noformat}

junit-sequential:
[junit] Testsuite: org.apache.lucene.store.TestNRTCachingDirectory
[junit] Tests run: 1, Failures: 1, Errors: 0, Time elapsed: 5.16 sec
[junit] 
[junit] - Standard Error -
[junit] NOTE: reproduce with: ant test -Dtestcase=TestNRTCachingDirectory 
-Dtestmethod=testNRTAndCommit 
-Dtests.seed=-753565914717395747:-1817581638532977526
[junit] NOTE: test params are: codec=RandomCodecProvider: 
{docid=SimpleText, body=MockFixedIntBlock(blockSize=1993), 
title=Pulsing(freqCutoff=3), titleTokenized=MockSep, date=SimpleText}, 
locale=ar_AE, timezone=America/Santa_Isabel
[junit] NOTE: all tests run in this JVM:
[junit] [TestNRTCachingDirectory]
[junit] NOTE: Mac OS X 10.6.7 x86_64/Apple Inc. 1.6.0_24 
(64-bit)/cpus=2,threads=1,free=46213552,total=85000192
[junit] -  ---
[junit] Testcase: 
testNRTAndCommit(org.apache.lucene.store.TestNRTCachingDirectory):FAILED
[junit] limit=12 actual=16
[junit] junit.framework.AssertionFailedError: limit=12 actual=16
[junit] at 
org.apache.lucene.index.RandomIndexWriter.doRandomOptimize(RandomIndexWriter.java:165)
[junit] at 
org.apache.lucene.index.RandomIndexWriter.close(RandomIndexWriter.java:199)
[junit] at 
org.apache.lucene.store.TestNRTCachingDirectory.testNRTAndCommit(TestNRTCachingDirectory.java:179)
[junit] at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1282)
[junit] at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1211)
[junit] 
[junit] 
[junit] Test org.apache.lucene.store.TestNRTCachingDirectory FAILED
{noformat}

> NRTCachingDirectory, to buffer small segments in a RAMDir
> -
>
> Key: LUCENE-3092
> URL: https://issues.apache.org/jira/browse/LUCENE-3092
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/store
>Reporter: Michael McCandless
>Priority: Minor
> Fix For: 3.2, 4.0
>
> Attachments: LUCENE-3092-listener.patch, LUCENE-3092.patch, 
> LUCENE-3092.patch, LUCENE-3092.patch
>
>
> I created this simply Directory impl, whose goal is reduce IO
> contention in a frequent reopen NRT use case.
> The idea is, when reopening quickly, but not indexing that much
> content, you wind up with many small files created with time, that can
> possibly stress the IO system eg if merges, searching are also
> fighting for IO.
> So, NRTCachingDirectory puts these newly created files into a RAMDir,
> and only when they are merged into a too-large segment, does it then
> write-through to the real (delegate) directory.
> This lets you spend some RAM to reduce I0.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3100) IW.commit() writes but fails to fsync the N.fnx file

2011-05-16 Thread Simon Willnauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer updated LUCENE-3100:


Attachment: LUCENE-3100.patch

here is a patch sync'ing the file on successful write during prepareCommit

> IW.commit() writes but fails to fsync the N.fnx file
> 
>
> Key: LUCENE-3100
> URL: https://issues.apache.org/jira/browse/LUCENE-3100
> Project: Lucene - Java
>  Issue Type: Bug
>Reporter: Michael McCandless
>Assignee: Simon Willnauer
> Fix For: 4.0
>
> Attachments: LUCENE-3100.patch
>
>
> In making a unit test for NRTCachingDir (LUCENE-3092) I hit this surprising 
> bug!
> Because the new N.fnx file is written at the "last minute" along with the 
> segments file, it's not included in the sis.files() that IW uses to figure 
> out which files to sync.
> This bug means one could call IW.commit(), successfully, return, and then the 
> machine could crash and when it comes back up your index could be corrupted.
> We should hopefully first fix TestCrash so that it hits this bug (maybe it 
> needs more/better randomization?), then fix the bug

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3098) Grouped total count

2011-05-16 Thread Martijn van Groningen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martijn van Groningen updated LUCENE-3098:
--

Attachment: LUCENE-3098-3x.patch

Great! Attached the 3x backport.

> Grouped total count
> ---
>
> Key: LUCENE-3098
> URL: https://issues.apache.org/jira/browse/LUCENE-3098
> Project: Lucene - Java
>  Issue Type: New Feature
>Reporter: Martijn van Groningen
>Assignee: Michael McCandless
> Fix For: 3.2, 4.0
>
> Attachments: LUCENE-3098-3x.patch, LUCENE-3098-3x.patch, 
> LUCENE-3098.patch, LUCENE-3098.patch, LUCENE-3098.patch, LUCENE-3098.patch, 
> LUCENE-3098.patch
>
>
> When grouping currently you can get two counts:
> * Total hit count. Which counts all documents that matched the query.
> * Total grouped hit count. Which counts all documents that have been grouped 
> in the top N groups.
> Since the end user gets groups in his search result instead of plain 
> documents with grouping. The total number of groups as total count makes more 
> sense in many situations. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3103) create a simple test that indexes and searches byte[] terms

2011-05-16 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034224#comment-13034224
 ] 

Michael McCandless commented on LUCENE-3103:


+1 -- this is a great test to add, now that we support arbitrary binary terms.


> create a simple test that indexes and searches byte[] terms
> ---
>
> Key: LUCENE-3103
> URL: https://issues.apache.org/jira/browse/LUCENE-3103
> Project: Lucene - Java
>  Issue Type: Test
>  Components: general/test
>Reporter: Robert Muir
> Fix For: 4.0
>
> Attachments: LUCENE-3103.patch
>
>
> Currently, the only good test that does this is Test2BTerms (disabled by 
> default)
> I think we should test this capability, and also have a simpler example for 
> how to do this.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3103) create a simple test that indexes and searches byte[] terms

2011-05-16 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034220#comment-13034220
 ] 

Uwe Schindler commented on LUCENE-3103:
---

Reflection should work correct. No need to change anything.

> create a simple test that indexes and searches byte[] terms
> ---
>
> Key: LUCENE-3103
> URL: https://issues.apache.org/jira/browse/LUCENE-3103
> Project: Lucene - Java
>  Issue Type: Test
>  Components: general/test
>Reporter: Robert Muir
> Fix For: 4.0
>
> Attachments: LUCENE-3103.patch
>
>
> Currently, the only good test that does this is Test2BTerms (disabled by 
> default)
> I think we should test this capability, and also have a simpler example for 
> how to do this.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3103) create a simple test that indexes and searches byte[] terms

2011-05-16 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034217#comment-13034217
 ] 

Robert Muir commented on LUCENE-3103:
-

one thing i did previously (seemed overkill but maybe good to do) was to 
clearAttributes(), setBytesRef() on each incrementToken,
more like a normal tokenizer. we could still change it to work like this. in 
this case clear() set the br to null.

another thing to inspect is the reflection api so toString prints the bytes... 
didnt check this.


> create a simple test that indexes and searches byte[] terms
> ---
>
> Key: LUCENE-3103
> URL: https://issues.apache.org/jira/browse/LUCENE-3103
> Project: Lucene - Java
>  Issue Type: Test
>  Components: general/test
>Reporter: Robert Muir
> Fix For: 4.0
>
> Attachments: LUCENE-3103.patch
>
>
> Currently, the only good test that does this is Test2BTerms (disabled by 
> default)
> I think we should test this capability, and also have a simpler example for 
> how to do this.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Assigned] (LUCENE-3098) Grouped total count

2011-05-16 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reassigned LUCENE-3098:
--

Assignee: Michael McCandless

> Grouped total count
> ---
>
> Key: LUCENE-3098
> URL: https://issues.apache.org/jira/browse/LUCENE-3098
> Project: Lucene - Java
>  Issue Type: New Feature
>Reporter: Martijn van Groningen
>Assignee: Michael McCandless
> Fix For: 3.2, 4.0
>
> Attachments: LUCENE-3098-3x.patch, LUCENE-3098.patch, 
> LUCENE-3098.patch, LUCENE-3098.patch, LUCENE-3098.patch, LUCENE-3098.patch
>
>
> When grouping currently you can get two counts:
> * Total hit count. Which counts all documents that matched the query.
> * Total grouped hit count. Which counts all documents that have been grouped 
> in the top N groups.
> Since the end user gets groups in his search result instead of plain 
> documents with grouping. The total number of groups as total count makes more 
> sense in many situations. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3098) Grouped total count

2011-05-16 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034214#comment-13034214
 ] 

Michael McCandless commented on LUCENE-3098:


Looks great Martijn!

I'll commit in a day or two if nobody objects...

> Grouped total count
> ---
>
> Key: LUCENE-3098
> URL: https://issues.apache.org/jira/browse/LUCENE-3098
> Project: Lucene - Java
>  Issue Type: New Feature
>Reporter: Martijn van Groningen
> Fix For: 3.2, 4.0
>
> Attachments: LUCENE-3098-3x.patch, LUCENE-3098.patch, 
> LUCENE-3098.patch, LUCENE-3098.patch, LUCENE-3098.patch, LUCENE-3098.patch
>
>
> When grouping currently you can get two counts:
> * Total hit count. Which counts all documents that matched the query.
> * Total grouped hit count. Which counts all documents that have been grouped 
> in the top N groups.
> Since the end user gets groups in his search result instead of plain 
> documents with grouping. The total number of groups as total count makes more 
> sense in many situations. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2519) Improve the defaults for the "text" field type in default schema.xml

2011-05-16 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034203#comment-13034203
 ] 

Robert Muir commented on SOLR-2519:
---

As someone frustrated by this (but who would ultimately like to move past it 
and try to help with solr's intl), I just wanted to say +1 to Hoss Man's 
proposal.

My only suggestion on what he said is that I would greatly prefer text_en over 
text_western or whatever for these reasons:
1. the stemming and stopwords and crap here are english.
2. for other western languages, even if you swap these out to be say, french or 
italian (which is the seemingly obvious way to cut over), the whole 
WDF+autophrase is still a huge trap (see 
http://www.hathitrust.org/blogs/large-scale-search/tuning-search-performance 
for an example). in this case use of ElisionFilter can be taken to avoid it.

> Improve the defaults for the "text" field type in default schema.xml
> 
>
> Key: SOLR-2519
> URL: https://issues.apache.org/jira/browse/SOLR-2519
> Project: Solr
>  Issue Type: Bug
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 3.2, 4.0
>
> Attachments: SOLR-2519.patch
>
>
> Spinoff from: http://lucene.markmail.org/thread/ww6mhfi3rfpngmc5
> The text fieldType in schema.xml is unusable for non-whitespace
> languages, because it has the dangerous auto-phrase feature (of
> Lucene's QP -- see LUCENE-2458) enabled.
> Lucene leaves this off by default, as does ElasticSearch
> (http://http://www.elasticsearch.org/).
> Furthermore, the "text" fieldType uses WhitespaceTokenizer when
> StandardTokenizer is a better cross-language default.
> Until we have language specific field types, I think we should fix
> the "text" fieldType to work well for all languages, by:
>   * Switching from WhitespaceTokenizer to StandardTokenizer
>   * Turning off auto-phrase

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3098) Grouped total count

2011-05-16 Thread Martijn van Groningen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martijn van Groningen updated LUCENE-3098:
--

Attachment: LUCENE-3098.patch

Attached a new patch.

* Renamed TotalGroupCountCollector to AllGroupsCollector. This rename reflects 
more what the collector is actual doing.
* Group values are now collected in an ArrayList instead of a LinkedList. The 
initialSize is now also used for the ArrayList.

> Grouped total count
> ---
>
> Key: LUCENE-3098
> URL: https://issues.apache.org/jira/browse/LUCENE-3098
> Project: Lucene - Java
>  Issue Type: New Feature
>Reporter: Martijn van Groningen
> Fix For: 3.2, 4.0
>
> Attachments: LUCENE-3098-3x.patch, LUCENE-3098.patch, 
> LUCENE-3098.patch, LUCENE-3098.patch, LUCENE-3098.patch, LUCENE-3098.patch
>
>
> When grouping currently you can get two counts:
> * Total hit count. Which counts all documents that matched the query.
> * Total grouped hit count. Which counts all documents that have been grouped 
> in the top N groups.
> Since the end user gets groups in his search result instead of plain 
> documents with grouping. The total number of groups as total count makes more 
> sense in many situations. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2520) JSONResponseWriter w/json.wrf can produce invalid javascript depending on unicode chars in response data

2011-05-16 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034197#comment-13034197
 ] 

Yonik Seeley commented on SOLR-2520:


It looks like we already escape \u2028 (see SOLR-1936), so we should just do 
the same for \u2029?

> JSONResponseWriter w/json.wrf can produce invalid javascript depending on 
> unicode chars in response data
> 
>
> Key: SOLR-2520
> URL: https://issues.apache.org/jira/browse/SOLR-2520
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 4.0
>Reporter: Benson Margulies
>
> Please see http://timelessrepo.com/json-isnt-a-javascript-subset.
> If a stored field contains invalid Javascript characters, and you use the 
> query option to ask for jsonp, solr does *not* escape some invalid Unicode 
> characters, resulting in strings that explode on contact with browsers.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3103) create a simple test that indexes and searches byte[] terms

2011-05-16 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-3103:


Attachment: LUCENE-3103.patch

attached is a first patch... maybe Uwe won't be able to resist rewriting it to 
make it simpler :)

> create a simple test that indexes and searches byte[] terms
> ---
>
> Key: LUCENE-3103
> URL: https://issues.apache.org/jira/browse/LUCENE-3103
> Project: Lucene - Java
>  Issue Type: Test
>  Components: general/test
>Reporter: Robert Muir
> Fix For: 4.0
>
> Attachments: LUCENE-3103.patch
>
>
> Currently, the only good test that does this is Test2BTerms (disabled by 
> default)
> I think we should test this capability, and also have a simpler example for 
> how to do this.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-3103) create a simple test that indexes and searches byte[] terms

2011-05-16 Thread Robert Muir (JIRA)
create a simple test that indexes and searches byte[] terms
---

 Key: LUCENE-3103
 URL: https://issues.apache.org/jira/browse/LUCENE-3103
 Project: Lucene - Java
  Issue Type: Test
  Components: general/test
Reporter: Robert Muir
 Fix For: 4.0


Currently, the only good test that does this is Test2BTerms (disabled by 
default)

I think we should test this capability, and also have a simpler example for how 
to do this.


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Reorganizing JIRA components

2011-05-16 Thread Shai Erera
I renamed all current components, plus deleted two (contrib/analyzers and
contrib/wikipedia).

core/codecs
core/index
core/other
core/query/scoring
core/queryparser
core/search
core/store
core/termvectors
general/build
general/javadocs
general/test
general/website
modules/analysis
modules/benchmark
modules/examples
modules/grouping
modules/highlighter
modules/other
modules/queryparser
modules/spatial
modules/spellchecker

Shai

On Mon, May 16, 2011 at 7:07 AM, Mark Miller  wrote:

>
> On May 15, 2011, at 10:42 PM, Shai Erera wrote:
>
> > I was aiming at avoiding that scenario. I think every issue should be
> assigned to a specific component, and if there isn't one available, we
> should create it.
>
>
> Based on history and how these things normally go, unless you are planning
> on spending a *lot* of time curating JIRA for the forceable future, this is
> an unlikely outcome. Better categories will hopefully mean more compliance,
> but I'd bet the standard hodgepodge of JIRA submissions and curation is
> going to remain fairly similar to what we have seen. Version is a much more
> important field - and even it is not curated even close to this 'ideal'
> world level.
>
> I think every issue should be fully filled out, correctly filled out, cross
> linked with all relevant issues, etc, etc.
>
> But I don't plan on it being the normal scenario ;)
>
> FWIW: I fill out component sometimes, and other times I'm just not worried
> about it. Someone can always come along after us types and random users and
> clean up after them, but I surmise that won't last long.
>
> - Mark Miller
> lucidimagination.com
>
> Lucene/Solr User Conference
> May 25-26, San Francisco
> www.lucenerevolution.org
>
>
>
>
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


[jira] [Commented] (SOLR-2520) JSONResponseWriter w/json.wrf can produce invalid javascript depending on unicode chars in response data

2011-05-16 Thread Benson Margulies (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034187#comment-13034187
 ] 

Benson Margulies commented on SOLR-2520:


I'd vote for the later. I assume that there is some large inventory of people 
who are currently using json.wrf=foo and who would benefit from the change. 
However, I have limited context here, so if anyone else knows more about how 
users are using this stuff I hope they will speak up. Sorry not to have been 
fully clear on the first attempt.


> JSONResponseWriter w/json.wrf can produce invalid javascript depending on 
> unicode chars in response data
> 
>
> Key: SOLR-2520
> URL: https://issues.apache.org/jira/browse/SOLR-2520
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 4.0
>Reporter: Benson Margulies
>
> Please see http://timelessrepo.com/json-isnt-a-javascript-subset.
> If a stored field contains invalid Javascript characters, and you use the 
> query option to ask for jsonp, solr does *not* escape some invalid Unicode 
> characters, resulting in strings that explode on contact with browsers.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2519) Improve the defaults for the "text" field type in default schema.xml

2011-05-16 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034185#comment-13034185
 ] 

Michael McCandless commented on SOLR-2519:
--

bq. Bottom line: it's less confusing to remove  and add new ones 
with new names then to make radical changes to existing ones.

Ahh, this makes great sense!

I really like your proposal Hoss, and that's a great point about emails to the 
mailing lists.

So we'd have no more text fieldType.  Just text_en (what text now is) and 
text_general (basically just StandardAnalyzer, but maybe move/absorb "textgen" 
over).

Over time we can add in more language specific text_XX fieldTypes...

> Improve the defaults for the "text" field type in default schema.xml
> 
>
> Key: SOLR-2519
> URL: https://issues.apache.org/jira/browse/SOLR-2519
> Project: Solr
>  Issue Type: Bug
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 3.2, 4.0
>
> Attachments: SOLR-2519.patch
>
>
> Spinoff from: http://lucene.markmail.org/thread/ww6mhfi3rfpngmc5
> The text fieldType in schema.xml is unusable for non-whitespace
> languages, because it has the dangerous auto-phrase feature (of
> Lucene's QP -- see LUCENE-2458) enabled.
> Lucene leaves this off by default, as does ElasticSearch
> (http://http://www.elasticsearch.org/).
> Furthermore, the "text" fieldType uses WhitespaceTokenizer when
> StandardTokenizer is a better cross-language default.
> Until we have language specific field types, I think we should fix
> the "text" fieldType to work well for all languages, by:
>   * Switching from WhitespaceTokenizer to StandardTokenizer
>   * Turning off auto-phrase

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-2520) JSONResponseWriter w/json.wrf can produce invalid javascript depending on unicode chars in response data

2011-05-16 Thread Hoss Man (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man updated SOLR-2520:
---

Summary: JSONResponseWriter w/json.wrf can produce invalid javascript 
depending on unicode chars in response data  (was: Solr creates invalid jsonp 
strings)

Benson: thanks for the clarification, i've updated the summary to attempt to 
clarify the root of the issue.

Would make more sense to have a "JavascriptResponseWriter" or to have the 
JSONResponseWriter do unicode escaping/stripping if/when json.wrf is specified?

> JSONResponseWriter w/json.wrf can produce invalid javascript depending on 
> unicode chars in response data
> 
>
> Key: SOLR-2520
> URL: https://issues.apache.org/jira/browse/SOLR-2520
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 4.0
>Reporter: Benson Margulies
>
> Please see http://timelessrepo.com/json-isnt-a-javascript-subset.
> If a stored field contains invalid Javascript characters, and you use the 
> query option to ask for jsonp, solr does *not* escape some invalid Unicode 
> characters, resulting in strings that explode on contact with browsers.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3102) Few issues with CachingCollector

2011-05-16 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-3102:
---

Component/s: (was: contrib/*)
 modules/grouping

> Few issues with CachingCollector
> 
>
> Key: LUCENE-3102
> URL: https://issues.apache.org/jira/browse/LUCENE-3102
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: modules/grouping
>Reporter: Shai Erera
>Priority: Minor
> Fix For: 3.2, 4.0
>
> Attachments: LUCENE-3102.patch, LUCENE-3102.patch
>
>
> CachingCollector (introduced in LUCENE-1421) has few issues:
> # Since the wrapped Collector may support out-of-order collection, the 
> document IDs cached may be out-of-order (depends on the Query) and thus 
> replay(Collector) will forward document IDs out-of-order to a Collector that 
> may not support it.
> # It does not clear cachedScores + cachedSegs upon exceeding RAM limits
> # I think that instead of comparing curScores to null, in order to determine 
> if scores are requested, we should have a specific boolean - for clarity
> # This check "if (base + nextLength > maxDocsToCache)" (line 168) can be 
> relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the 
> maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want 
> to try and cache them?
> Also:
> * The TODO in line 64 (having Collector specify needsScores()) -- why do we 
> need that if CachingCollector ctor already takes a boolean "cacheScores"? I 
> think it's better defined explicitly than implicitly?
> * Let's introduce a factory method for creating a specialized version if 
> scoring is requested / not (i.e., impl the TODO in line 189)
> * I think it's a useful collector, which stands on its own and not specific 
> to grouping. Can we move it to core?
> * How about using OpenBitSet instead of int[] for doc IDs?
> ** If the number of hits is big, we'd gain some RAM back, and be able to 
> cache more entries
> ** NOTE: OpenBitSet can only be used for in-order collection only. So we can 
> use that if the wrapped Collector does not support out-of-order
> * Do you think we can modify this Collector to not necessarily wrap another 
> Collector? We have such Collector which stores (in-memory) all matching doc 
> IDs + scores (if required). Those are later fed into several processes that 
> operate on them (e.g. fetch more info from the index etc.). I am thinking, we 
> can make CachingCollector *optionally* wrap another Collector and then 
> someone can reuse it by setting RAM limit to unlimited (we should have a 
> constant for that) in order to simply collect all matching docs + scores.
> * I think a set of dedicated unit tests for this class alone would be good.
> That's it so far. Perhaps, if we do all of the above, more things will pop up.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2519) Improve the defaults for the "text" field type in default schema.xml

2011-05-16 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034176#comment-13034176
 ] 

Hoss Man commented on SOLR-2519:


bq. Also: existing users would be unaffected by this? They've already copied 
over / edited their own schema.xml? This is mainly about new users?

The trap we've seen with this type of thing in the past (ie: the numeric 
fields) is that people who tend to use the example configs w/o changing them 
much refer to the example field types by name when talking about them on the 
mailing list, not considering that those names can have differnet meanings 
depending on version.

if we make radical changes to a {{}} but leave the name alone, it 
could confuse a lot of people, ie: "i tried using the 'text' field but it 
didn't work"; "which version of solr are you using?"; "Solr 4.1"; "that should 
work, what exactly does your schema look like"; "..."; "that's the schema from 
3.6"; "yeah, i started with 3.6 nad then upgraded to 4.1 later", etc...

Bottom line: it's less confusing to *remove* {{}} and add new ones 
with new names then to make radical changes to existing ones.

> Improve the defaults for the "text" field type in default schema.xml
> 
>
> Key: SOLR-2519
> URL: https://issues.apache.org/jira/browse/SOLR-2519
> Project: Solr
>  Issue Type: Bug
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 3.2, 4.0
>
> Attachments: SOLR-2519.patch
>
>
> Spinoff from: http://lucene.markmail.org/thread/ww6mhfi3rfpngmc5
> The text fieldType in schema.xml is unusable for non-whitespace
> languages, because it has the dangerous auto-phrase feature (of
> Lucene's QP -- see LUCENE-2458) enabled.
> Lucene leaves this off by default, as does ElasticSearch
> (http://http://www.elasticsearch.org/).
> Furthermore, the "text" fieldType uses WhitespaceTokenizer when
> StandardTokenizer is a better cross-language default.
> Until we have language specific field types, I think we should fix
> the "text" fieldType to work well for all languages, by:
>   * Switching from WhitespaceTokenizer to StandardTokenizer
>   * Turning off auto-phrase

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2519) Improve the defaults for the "text" field type in default schema.xml

2011-05-16 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034172#comment-13034172
 ] 

Hoss Man commented on SOLR-2519:


I feel like we are convoluting two issues here: the "default" behavior of 
TextField, and the example configs.

i don't have any strong opinions about changing the default behavior of 
TextField when {{autoGeneratePhraseQueries}} is not specified in the 
{{}} but if we do make such a change, it should be contingent on 
the schema version property (which we should bump) so that people who upgrade 
will get consistent behavior with their existing configs (TextField.init 
already has an example of this for when we changed the default of {{omitNorms}})

as far as the example configs: i agree with yonik, that changing "text" at this 
point might be confusing ... i think the best way to iterate moving forward 
would probably be:

* rename {{}} and {{}} to something 
that makes their purpose more clear (text_en, or text_western, or 
text_european, or some other more general descriptive word for the types of 
languages were it makes sense) and switch all existing {{}} 
declarations that currently use use field type "text" to use this new name.

* add a new {{}} which is designed (and 
documented to be a general purpose field type when the language is unknown (it 
may make sense to fix/repurpose the existing {{}} 
for this, since it already suggests that's what it's for)

* Audit all {{}} declarations that use "text_en" (or whatever name was 
chosen above) and the existing sample data for those fields to see if it makes 
more sense to change them to "text_general". also change any where based on 
usage it shouldn't matter.

The end result being that we have no {{}} named "text" in the 
example configs, so people won't get it confused with previous versions, and 
we'll have a new {{}} that works as well as possible with all 
langauges which we use as much as possible with the example data.






> Improve the defaults for the "text" field type in default schema.xml
> 
>
> Key: SOLR-2519
> URL: https://issues.apache.org/jira/browse/SOLR-2519
> Project: Solr
>  Issue Type: Bug
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 3.2, 4.0
>
> Attachments: SOLR-2519.patch
>
>
> Spinoff from: http://lucene.markmail.org/thread/ww6mhfi3rfpngmc5
> The text fieldType in schema.xml is unusable for non-whitespace
> languages, because it has the dangerous auto-phrase feature (of
> Lucene's QP -- see LUCENE-2458) enabled.
> Lucene leaves this off by default, as does ElasticSearch
> (http://http://www.elasticsearch.org/).
> Furthermore, the "text" fieldType uses WhitespaceTokenizer when
> StandardTokenizer is a better cross-language default.
> Until we have language specific field types, I think we should fix
> the "text" fieldType to work well for all languages, by:
>   * Switching from WhitespaceTokenizer to StandardTokenizer
>   * Turning off auto-phrase

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3098) Grouped total count

2011-05-16 Thread Martijn van Groningen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martijn van Groningen updated LUCENE-3098:
--

Attachment: LUCENE-3098.patch

Attached patch with the discussed changes.
3x patch follows soon.

> Grouped total count
> ---
>
> Key: LUCENE-3098
> URL: https://issues.apache.org/jira/browse/LUCENE-3098
> Project: Lucene - Java
>  Issue Type: New Feature
>Reporter: Martijn van Groningen
> Fix For: 3.2, 4.0
>
> Attachments: LUCENE-3098-3x.patch, LUCENE-3098.patch, 
> LUCENE-3098.patch, LUCENE-3098.patch, LUCENE-3098.patch
>
>
> When grouping currently you can get two counts:
> * Total hit count. Which counts all documents that matched the query.
> * Total grouped hit count. Which counts all documents that have been grouped 
> in the top N groups.
> Since the end user gets groups in his search result instead of plain 
> documents with grouping. The total number of groups as total count makes more 
> sense in many situations. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2520) Solr creates invalid jsonp strings

2011-05-16 Thread Benson Margulies (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034159#comment-13034159
 ] 

Benson Margulies commented on SOLR-2520:


Fun happens when you specify something in json.wrf. This demands 'jsonp' 
instead of json, which results in the result being treated as javascript, not 
json.  wt=json&json.wrf=SOME_PREFIX will cause Solr to respond with

 SOME_PREFIX({whatever it was otherwise going to return})

instead of just

 {whatever it was otherwise going to return}

If there is then an interesting Unicode character in there, Chrome implodes and 
firefox quietly rejects.



> Solr creates invalid jsonp strings
> --
>
> Key: SOLR-2520
> URL: https://issues.apache.org/jira/browse/SOLR-2520
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 4.0
>Reporter: Benson Margulies
>
> Please see http://timelessrepo.com/json-isnt-a-javascript-subset.
> If a stored field contains invalid Javascript characters, and you use the 
> query option to ask for jsonp, solr does *not* escape some invalid Unicode 
> characters, resulting in strings that explode on contact with browsers.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2519) Improve the defaults for the "text" field type in default schema.xml

2011-05-16 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034158#comment-13034158
 ] 

Michael McCandless commented on SOLR-2519:
--

It's also spooky that "text" fieldType has different index
time vs query time analyzers?  Ie, WDF is configured differently.

> Improve the defaults for the "text" field type in default schema.xml
> 
>
> Key: SOLR-2519
> URL: https://issues.apache.org/jira/browse/SOLR-2519
> Project: Solr
>  Issue Type: Bug
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 3.2, 4.0
>
> Attachments: SOLR-2519.patch
>
>
> Spinoff from: http://lucene.markmail.org/thread/ww6mhfi3rfpngmc5
> The text fieldType in schema.xml is unusable for non-whitespace
> languages, because it has the dangerous auto-phrase feature (of
> Lucene's QP -- see LUCENE-2458) enabled.
> Lucene leaves this off by default, as does ElasticSearch
> (http://http://www.elasticsearch.org/).
> Furthermore, the "text" fieldType uses WhitespaceTokenizer when
> StandardTokenizer is a better cross-language default.
> Until we have language specific field types, I think we should fix
> the "text" fieldType to work well for all languages, by:
>   * Switching from WhitespaceTokenizer to StandardTokenizer
>   * Turning off auto-phrase

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2519) Improve the defaults for the "text" field type in default schema.xml

2011-05-16 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034154#comment-13034154
 ] 

Michael McCandless commented on SOLR-2519:
--

bq. I think maybe there's a misconception that the fieldType named "text" was 
meant to be generic for all languages.

Regardless of what the original intention was, "text" today has become
the generic text fieldType new users use on starting with Solr.  I
mean, it has the perfect name for that :)

bq. As I said in the thread, if I had to do it over again, I would have named 
it "text_en" because that's what it's purpose was.

Hindsight is 20/20... but, we can still fix this today.  We shouldn't
lock ourselves into poor defaults.

Especially, as things improve and we get better analyzers, etc., we
should be free to improve the defaults in schema.xml to take advantage
of these improvements.

bq. But at this point, it seems like the best way forward is to leave "text" as 
an english fieldType and simply add other fieldTypes that can support other 
languages.

I think this is a dangerous approach -- the name (ie, missing _en if
in fact it has such English-specific configuration) is misleading and
traps new users.

Ideally, in the future, we wouldn't even have a "text" fieldType, only
text_XX per-language examples and then maybe something like
text_general, which you use if you cannot find your language.

{quote}
Some downsides I see to this patch (i.e. trying to make the 'text' fieldType 
generic):

The current WordDelimiterFilter options the fieldType feel like a trap for 
non-whitespace-delimited languages. WDF is configured to index catenations as 
well as splits... so all of the tokens (words?) that are split out are also 
catenated together and indexed (which seems like it could lead to some truly 
huge tokens erroneously being indexed.)
{quote}
Ahh good point.  I think we should remove WDF altogether from the
generic "text" fieldType.

{quote}
You left the english stemmer on the "text" fieldType... but if it's supposed to 
be generic, couldn't this be bad for some other western languages where it 
could cause stemming collisions of words not related to each other?
{quote}

+1, we should remove the stemming too from "text".

bq. Taking into account all the existing users (and all the existing 
documentation, examples, tutorial, etc), I favor a more conservative approach 
of adding new fieldTypes rather than radically changing the behavior of 
existing ones.

Can you point to specific examples (docs, examples, tutorial)?  I'd
like to understand how much work it is to fix these...

My feeling is we should simply do the work here (I'll sign up to it)
and fix any places that actually rely on the specifics of "text"
fieldType, eg autophrase.

We shouldn't avoid fixing things well because it's gonna be more work
today, especially if someone (me) is signing up to do it.

Also: existing users would be unaffected by this?  They've already
copied over / edited their own schema.xml?  This is mainly about new
users?


> Improve the defaults for the "text" field type in default schema.xml
> 
>
> Key: SOLR-2519
> URL: https://issues.apache.org/jira/browse/SOLR-2519
> Project: Solr
>  Issue Type: Bug
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 3.2, 4.0
>
> Attachments: SOLR-2519.patch
>
>
> Spinoff from: http://lucene.markmail.org/thread/ww6mhfi3rfpngmc5
> The text fieldType in schema.xml is unusable for non-whitespace
> languages, because it has the dangerous auto-phrase feature (of
> Lucene's QP -- see LUCENE-2458) enabled.
> Lucene leaves this off by default, as does ElasticSearch
> (http://http://www.elasticsearch.org/).
> Furthermore, the "text" fieldType uses WhitespaceTokenizer when
> StandardTokenizer is a better cross-language default.
> Until we have language specific field types, I think we should fix
> the "text" fieldType to work well for all languages, by:
>   * Switching from WhitespaceTokenizer to StandardTokenizer
>   * Turning off auto-phrase

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2520) Solr creates invalid jsonp strings

2011-05-16 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034151#comment-13034151
 ] 

Hoss Man commented on SOLR-2520:


I'm confused here: As far as i can tell, the JSONResponseWriter does in fact 
output valid JSON (the link mentioned points out that there are control 
characters valid in JSON which are not valid in javascript, but that's what the 
response writer produces -- JSON) ... so what is the bug?

And what do you mean by "the query option to ask for jsonp" ? ...  i don't see 
that option in the JSONResponseWriter

(is this bug about some third party response writer?)

> Solr creates invalid jsonp strings
> --
>
> Key: SOLR-2520
> URL: https://issues.apache.org/jira/browse/SOLR-2520
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 4.0
>Reporter: Benson Margulies
>
> Please see http://timelessrepo.com/json-isnt-a-javascript-subset.
> If a stored field contains invalid Javascript characters, and you use the 
> query option to ask for jsonp, solr does *not* escape some invalid Unicode 
> characters, resulting in strings that explode on contact with browsers.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2519) Improve the defaults for the "text" field type in default schema.xml

2011-05-16 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034120#comment-13034120
 ] 

Yonik Seeley commented on SOLR-2519:


I think maybe there's a misconception that the fieldType named "text" was meant 
to be generic for all languages.  As I said in the thread, if I had to do it 
over again, I would have named it "text_en" because that's what it's purpose 
was.  But at this point, it seems like the best way forward is to leave "text" 
as an english fieldType and simply add other fieldTypes that can support other 
languages.

Some downsides I see to this patch (i.e. trying to make the 'text' fieldType 
generic):
- The current WordDelimiterFilter options the fieldType feel like a trap for 
non-whitespace-delimited languages.  WDF is configured to index catenations as 
well as splits... so all of the tokens (words?) that are split out are also 
catenated together and indexed (which seems like it could lead to some truly 
huge tokens erroneously being indexed.)
- You left the english stemmer on the "text" fieldType... but if it's supposed 
to be generic, couldn't this be bad for some other western languages where it 
could cause stemming collisions of words not related to each other?

Taking into account all the existing users (and all the existing documentation, 
examples, tutorial, etc), I favor a more conservative approach of adding new 
fieldTypes rather than radically changing the behavior of existing ones.

Random question: what are the implications of changing from WhitespaceTokenizer 
to StandardTokenizer, esp w.r.t. WDF?

> Improve the defaults for the "text" field type in default schema.xml
> 
>
> Key: SOLR-2519
> URL: https://issues.apache.org/jira/browse/SOLR-2519
> Project: Solr
>  Issue Type: Bug
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 3.2, 4.0
>
> Attachments: SOLR-2519.patch
>
>
> Spinoff from: http://lucene.markmail.org/thread/ww6mhfi3rfpngmc5
> The text fieldType in schema.xml is unusable for non-whitespace
> languages, because it has the dangerous auto-phrase feature (of
> Lucene's QP -- see LUCENE-2458) enabled.
> Lucene leaves this off by default, as does ElasticSearch
> (http://http://www.elasticsearch.org/).
> Furthermore, the "text" fieldType uses WhitespaceTokenizer when
> StandardTokenizer is a better cross-language default.
> Until we have language specific field types, I think we should fix
> the "text" fieldType to work well for all languages, by:
>   * Switching from WhitespaceTokenizer to StandardTokenizer
>   * Turning off auto-phrase

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3084) MergePolicy.OneMerge.segments should be List not SegmentInfos

2011-05-16 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-3084:
--

Attachment: LUCENE-3084-trunk-only.patch

New patch that also has BalancedMergePolicy from contrib refactored to new API 
(sorry that was missing).

> MergePolicy.OneMerge.segments should be List not SegmentInfos
> --
>
> Key: LUCENE-3084
> URL: https://issues.apache.org/jira/browse/LUCENE-3084
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 3.2, 4.0
>
> Attachments: LUCENE-3084-trunk-only.patch, 
> LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, 
> LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084.patch
>
>
> SegmentInfos carries a bunch of fields beyond the list of SI, but for merging 
> purposes these fields are unused.
> We should cutover to List instead.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3090) DWFlushControl does not take active DWPT out of the loop on fullFlush

2011-05-16 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034103#comment-13034103
 ] 

Simon Willnauer commented on LUCENE-3090:
-

Thanks mike for review and testing!! It makes me feel better with those asserts 
in there now... I will commit tomorrow.

> DWFlushControl does not take active DWPT out of the loop on fullFlush
> -
>
> Key: LUCENE-3090
> URL: https://issues.apache.org/jira/browse/LUCENE-3090
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 4.0
>Reporter: Simon Willnauer
>Assignee: Simon Willnauer
>Priority: Critical
> Fix For: 4.0
>
> Attachments: LUCENE-3090.patch, LUCENE-3090.patch, LUCENE-3090.patch
>
>
> We have seen several OOM on TestNRTThreads and all of them are caused by 
> DWFlushControl missing DWPT that are set as flushPending but can't full due 
> to a full flush going on. Yet that means that those DWPT are filling up in 
> the background while they should actually be checked out and blocked until 
> the full flush finishes. Even further we currently stall on the 
> maxNumThreadStates while we should stall on the num of active thread states. 
> I will attach a patch tomorrow.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-2027) Deprecate Directory.touchFile

2011-05-16 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-2027:
---

Attachment: LUCENE-2027.patch

Patch, removing Dir.touchFile from trunk.

For 3.x I'll deprecate.

> Deprecate Directory.touchFile
> -
>
> Key: LUCENE-2027
> URL: https://issues.apache.org/jira/browse/LUCENE-2027
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Store
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Trivial
> Fix For: 4.0
>
> Attachments: LUCENE-2027.patch
>
>
> Lucene doesn't use this method, and, FindBugs reports that FSDirectory's impl 
> shouldn't swallow the returned result from File.setLastModified.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Assigned] (LUCENE-2027) Deprecate Directory.touchFile

2011-05-16 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reassigned LUCENE-2027:
--

Assignee: Michael McCandless

> Deprecate Directory.touchFile
> -
>
> Key: LUCENE-2027
> URL: https://issues.apache.org/jira/browse/LUCENE-2027
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Store
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Trivial
> Fix For: 4.0
>
> Attachments: LUCENE-2027.patch
>
>
> Lucene doesn't use this method, and, FindBugs reports that FSDirectory's impl 
> shouldn't swallow the returned result from File.setLastModified.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2519) Improve the defaults for the "text" field type in default schema.xml

2011-05-16 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034101#comment-13034101
 ] 

Michael McCandless commented on SOLR-2519:
--

I think the attached patch is a good starting point. It fixes the
generic "text" fieldType to have good all around defaults for all
languages, so that non-whitespace languages work fine.

Then, I think we should iteratively add in custom languages over time
(as separate issues).  We can eg add text_en_autophrase, text_en,
text_zh, etc.  We should at least do first sweep of nice analyzers
module and add fieldTypes for them.

This way we will eventually get to the ideal future when we have
text_XX coverage for many languages.


> Improve the defaults for the "text" field type in default schema.xml
> 
>
> Key: SOLR-2519
> URL: https://issues.apache.org/jira/browse/SOLR-2519
> Project: Solr
>  Issue Type: Bug
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 3.2, 4.0
>
> Attachments: SOLR-2519.patch
>
>
> Spinoff from: http://lucene.markmail.org/thread/ww6mhfi3rfpngmc5
> The text fieldType in schema.xml is unusable for non-whitespace
> languages, because it has the dangerous auto-phrase feature (of
> Lucene's QP -- see LUCENE-2458) enabled.
> Lucene leaves this off by default, as does ElasticSearch
> (http://http://www.elasticsearch.org/).
> Furthermore, the "text" fieldType uses WhitespaceTokenizer when
> StandardTokenizer is a better cross-language default.
> Until we have language specific field types, I think we should fix
> the "text" fieldType to work well for all languages, by:
>   * Switching from WhitespaceTokenizer to StandardTokenizer
>   * Turning off auto-phrase

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Assigned] (LUCENE-3100) IW.commit() writes but fails to fsync the N.fnx file

2011-05-16 Thread Simon Willnauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer reassigned LUCENE-3100:
---

Assignee: Simon Willnauer

> IW.commit() writes but fails to fsync the N.fnx file
> 
>
> Key: LUCENE-3100
> URL: https://issues.apache.org/jira/browse/LUCENE-3100
> Project: Lucene - Java
>  Issue Type: Bug
>Reporter: Michael McCandless
>Assignee: Simon Willnauer
> Fix For: 4.0
>
>
> In making a unit test for NRTCachingDir (LUCENE-3092) I hit this surprising 
> bug!
> Because the new N.fnx file is written at the "last minute" along with the 
> segments file, it's not included in the sis.files() that IW uses to figure 
> out which files to sync.
> This bug means one could call IW.commit(), successfully, return, and then the 
> machine could crash and when it comes back up your index could be corrupted.
> We should hopefully first fix TestCrash so that it hits this bug (maybe it 
> needs more/better randomization?), then fix the bug

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-2521) TestJoin.testRandom fails

2011-05-16 Thread Michael McCandless (JIRA)
TestJoin.testRandom fails
-

 Key: SOLR-2521
 URL: https://issues.apache.org/jira/browse/SOLR-2521
 Project: Solr
  Issue Type: Bug
Reporter: Michael McCandless
 Fix For: 4.0


Hit this random failure; it reproduces on trunk:

{noformat}

[junit] Testsuite: org.apache.solr.TestJoin
[junit] Tests run: 1, Failures: 1, Errors: 0, Time elapsed: 4.512 sec
[junit] 
[junit] - Standard Error -
[junit] 2011-05-16 12:51:46 org.apache.solr.TestJoin testRandomJoin
[junit] SEVERE: GROUPING MISMATCH: mismatch: '0'!='1' @ response/numFound
[junit] 
request=LocalSolrQueryRequest{echoParams=all&indent=true&q={!join+from%3Dsmall_i+to%3Dsmall3_is}*:*&wt=json}
[junit] result={
[junit]   "responseHeader":{
[junit] "status":0,
[junit] "QTime":0,
[junit] "params":{
[junit]   "echoParams":"all",
[junit]   "indent":"true",
[junit]   "q":"{!join from=small_i to=small3_is}*:*",
[junit]   "wt":"json"}},
[junit]   "response":{"numFound":1,"start":0,"docs":[
[junit]   {
[junit] "id":"NXEA",
[junit] "score_f":87.90162,
[junit] "small3_ss":["N",
[junit]   "v",
[junit]   "n"],
[junit] "small_i":4,
[junit] "small2_i":1,
[junit] "small2_is":[2],
[junit] "small3_is":[69,
[junit]   88,
[junit]   54,
[junit]   80,
[junit]   75,
[junit]   83,
[junit]   57,
[junit]   73,
[junit]   85,
[junit]   52,
[junit]   50,
[junit]   88,
[junit]   51,
[junit]   89,
[junit]   12,
[junit]   8,
[junit]   19,
[junit]   23,
[junit]   53,
[junit]   75,
[junit]   26,
[junit]   99,
[junit]   0,
[junit]   44]}]
[junit]   }}
[junit] expected={"numFound":0,"start":0,"docs":[]}
[junit] model={"NXEA":"Doc(0):[id=NXEA, score_f=87.90162, small3_ss=[N, 
v, n], small_i=4, small2_i=1, small2_is=2, small3_is=[69, 88, 54, 80, 75, 83, 
57, 73, 85, 52, 50, 88, 51, 89, 12, 8, 19, 23, 53, 75, 26, 99, 0, 
44]]","JSLZ":"Doc(1):[id=JSLZ, score_f=11.198811, small2_ss=[c, d], 
small3_ss=[b, R, H, Q, O, f, C, e, Z, u, z, u, w, I, f, _, Y, r, w, u], 
small_i=6, small2_is=[2, 3], small3_is=[22, 1]]","FAWX":"Doc(2):[id=FAWX, 
score_f=25.524109, small_s=d, small3_ss=[O, D, X, `, W, z, k, M, j, m, r, [, E, 
P, w, ^, y, T, e, R, V, H, g, e, I], small_i=2, small2_is=[2, 1], 
small3_is=[95, 42]]","GDDZ":"Doc(3):[id=GDDZ, score_f=8.483642, small2_ss=[b, 
e], small3_ss=[o, i, y, l, I, O, r, O, f, d, E, e, d, f, b, P], small2_is=[6, 
6], small3_is=[36, 48, 9, 8, 40, 40, 68]]","RBIQ":"Doc(4):[id=RBIQ, 
score_f=97.06258, small_s=b, small2_s=c, small2_ss=[e, e], small_i=2, 
small2_is=6, small3_is=[13, 77, 96, 45]]","LRDM":"Doc(5):[id=LRDM, 
score_f=82.302124, small_s=b, small2_s=a, small2_ss=d, small3_ss=[H, m, O, D, 
I, J, U, D, f, N, ^, m, I, j, L, s, F, h, A, `, c, j], small2_i=2, 
small2_is=[2, 7], small3_is=[81, 31, 78, 23, 88, 1, 7, 86, 20, 7, 40, 52, 100, 
81, 34, 45, 87, 72, 14, 5]]"}
[junit] NOTE: reproduce with: ant test -Dtestcase=TestJoin 
-Dtestmethod=testRandomJoin 
-Dtests.seed=-4998031941344546449:8541928265064992444
[junit] NOTE: test params are: codec=RandomCodecProvider: {id=MockRandom, 
small2_ss=Standard, small2_is=MockFixedIntBlock(blockSize=1738), 
small2_s=MockFixedIntBlock(blockSize=1738), 
small3_is=MockVariableIntBlock(baseBlockSize=77), 
small_i=MockFixedIntBlock(blockSize=1738), 
small_s=MockVariableIntBlock(baseBlockSize=77), score_f=MockSep, 
small2_i=Pulsing(freqCutoff=9), small3_ss=SimpleText}, locale=sr_BA, 
timezone=America/Barbados
[junit] NOTE: all tests run in this JVM:
[junit] [TestJoin]
[junit] NOTE: Linux 2.6.33.6-147.fc13.x86_64 amd64/Sun Microsystems Inc. 
1.6.0_21 (64-bit)/cpus=24,threads=1,free=252342544,total=308084736
[junit] -  ---
[junit] Testcase: testRandomJoin(org.apache.solr.TestJoin): FAILED
[junit] mismatch: '0'!='1' @ response/numFound
[junit] junit.framework.AssertionFailedError: mismatch: '0'!='1' @ 
response/numFound
[junit] at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1282)
[junit] at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1211)
[junit] at org.apache.solr.TestJoin.testRandomJoin(TestJoin.java:172)
[junit] 
[junit] 
[junit] Test org.apache.solr.TestJoin FAILED
{noformat}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-

[jira] [Commented] (LUCENE-3090) DWFlushControl does not take active DWPT out of the loop on fullFlush

2011-05-16 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034095#comment-13034095
 ] 

Michael McCandless commented on LUCENE-3090:


Patch looks good but hairy Simon!

I ran 144 iters of all (Solr+lucene+lucene-contrib) tests.  I hit three fails 
(one in Solr's TestJoin.testRandomJoin, and two in Solr's HighlighterTest) but 
I don't think these are related to this patch.

> DWFlushControl does not take active DWPT out of the loop on fullFlush
> -
>
> Key: LUCENE-3090
> URL: https://issues.apache.org/jira/browse/LUCENE-3090
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 4.0
>Reporter: Simon Willnauer
>Assignee: Simon Willnauer
>Priority: Critical
> Fix For: 4.0
>
> Attachments: LUCENE-3090.patch, LUCENE-3090.patch, LUCENE-3090.patch
>
>
> We have seen several OOM on TestNRTThreads and all of them are caused by 
> DWFlushControl missing DWPT that are set as flushPending but can't full due 
> to a full flush going on. Yet that means that those DWPT are filling up in 
> the background while they should actually be checked out and blocked until 
> the full flush finishes. Even further we currently stall on the 
> maxNumThreadStates while we should stall on the num of active thread states. 
> I will attach a patch tomorrow.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3084) MergePolicy.OneMerge.segments should be List not SegmentInfos

2011-05-16 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034093#comment-13034093
 ] 

Michael McCandless commented on LUCENE-3084:


Uwe, this looks like a great step forward?  Even if there are other things to 
fix later, we should commit this first (progress not perfection)?  Thanks!

On backporting, this is an experimental API, and it's rather "expert" for code 
to be interacting with SegmentInfos, so I think we can just break it (and 
advertise we did so)?

> MergePolicy.OneMerge.segments should be List not SegmentInfos
> --
>
> Key: LUCENE-3084
> URL: https://issues.apache.org/jira/browse/LUCENE-3084
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 3.2, 4.0
>
> Attachments: LUCENE-3084-trunk-only.patch, 
> LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, 
> LUCENE-3084-trunk-only.patch, LUCENE-3084.patch
>
>
> SegmentInfos carries a bunch of fields beyond the list of SI, but for merging 
> purposes these fields are unused.
> We should cutover to List instead.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3102) Few issues with CachingCollector

2011-05-16 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034091#comment-13034091
 ] 

Michael McCandless commented on LUCENE-3102:


Patch looks great Shai -- +1 to commit!!

Yes that is very sneaky about the private fields in inner/outer classes -- it's 
good you added a comment explaining it!

> Few issues with CachingCollector
> 
>
> Key: LUCENE-3102
> URL: https://issues.apache.org/jira/browse/LUCENE-3102
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/*
>Reporter: Shai Erera
>Priority: Minor
> Fix For: 3.2, 4.0
>
> Attachments: LUCENE-3102.patch, LUCENE-3102.patch
>
>
> CachingCollector (introduced in LUCENE-1421) has few issues:
> # Since the wrapped Collector may support out-of-order collection, the 
> document IDs cached may be out-of-order (depends on the Query) and thus 
> replay(Collector) will forward document IDs out-of-order to a Collector that 
> may not support it.
> # It does not clear cachedScores + cachedSegs upon exceeding RAM limits
> # I think that instead of comparing curScores to null, in order to determine 
> if scores are requested, we should have a specific boolean - for clarity
> # This check "if (base + nextLength > maxDocsToCache)" (line 168) can be 
> relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the 
> maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want 
> to try and cache them?
> Also:
> * The TODO in line 64 (having Collector specify needsScores()) -- why do we 
> need that if CachingCollector ctor already takes a boolean "cacheScores"? I 
> think it's better defined explicitly than implicitly?
> * Let's introduce a factory method for creating a specialized version if 
> scoring is requested / not (i.e., impl the TODO in line 189)
> * I think it's a useful collector, which stands on its own and not specific 
> to grouping. Can we move it to core?
> * How about using OpenBitSet instead of int[] for doc IDs?
> ** If the number of hits is big, we'd gain some RAM back, and be able to 
> cache more entries
> ** NOTE: OpenBitSet can only be used for in-order collection only. So we can 
> use that if the wrapped Collector does not support out-of-order
> * Do you think we can modify this Collector to not necessarily wrap another 
> Collector? We have such Collector which stores (in-memory) all matching doc 
> IDs + scores (if required). Those are later fed into several processes that 
> operate on them (e.g. fetch more info from the index etc.). I am thinking, we 
> can make CachingCollector *optionally* wrap another Collector and then 
> someone can reuse it by setting RAM limit to unlimited (we should have a 
> constant for that) in order to simply collect all matching docs + scores.
> * I think a set of dedicated unit tests for this class alone would be good.
> That's it so far. Perhaps, if we do all of the above, more things will pop up.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3084) MergePolicy.OneMerge.segments should be List not SegmentInfos

2011-05-16 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-3084:
--

Attachment: LUCENE-3084-trunk-only.patch

Here updated patch that removes some List usage from DirectoryReader and 
IndexWriter for rollback when commit fails. I am still not happy with 
interacting of IndexWriter code directly with the list, but this should maybe 
fixed later.

This patch could also be backported to cleanup 3.x, but for backwards 
compatibility, the SegmentInfos class should still extend Vector, but we 
can make the fields "segment" simply point to this. I am not sure how to 
"deprecated" extension of a class? A possibility would be to add each Vector 
method as a overridden one-liner and deprecated, but thats a non-brainer and 
stupid to do :(

> MergePolicy.OneMerge.segments should be List not SegmentInfos
> --
>
> Key: LUCENE-3084
> URL: https://issues.apache.org/jira/browse/LUCENE-3084
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 3.2, 4.0
>
> Attachments: LUCENE-3084-trunk-only.patch, 
> LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, 
> LUCENE-3084-trunk-only.patch, LUCENE-3084.patch
>
>
> SegmentInfos carries a bunch of fields beyond the list of SI, but for merging 
> purposes these fields are unused.
> We should cutover to List instead.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (SOLR-2505) Output cluster scores

2011-05-16 Thread Stanislaw Osinski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislaw Osinski resolved SOLR-2505.
-

Resolution: Fixed

Committed to trunk and branch_3x.

> Output cluster scores
> -
>
> Key: SOLR-2505
> URL: https://issues.apache.org/jira/browse/SOLR-2505
> Project: Solr
>  Issue Type: Improvement
>  Components: contrib - Clustering
>Reporter: Stanislaw Osinski
>Assignee: Stanislaw Osinski
>Priority: Minor
> Fix For: 3.2, 4.0
>
>
> Carrot2 algorithms compute cluster scores; we could expose them on the output 
> from Solr clustering component. Along with scores, we can output a boolean 
> flag that marks the Other Topics groups.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (SOLR-2448) Upgrade Carrot2 to version 3.5.0

2011-05-16 Thread Stanislaw Osinski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislaw Osinski resolved SOLR-2448.
-

Resolution: Fixed

Committed to trunk and branch_3x.

> Upgrade Carrot2 to version 3.5.0
> 
>
> Key: SOLR-2448
> URL: https://issues.apache.org/jira/browse/SOLR-2448
> Project: Solr
>  Issue Type: Task
>  Components: contrib - Clustering
>Reporter: Stanislaw Osinski
>Assignee: Stanislaw Osinski
>Priority: Minor
> Fix For: 3.2, 4.0
>
> Attachments: SOLR-2448-2449-2450-2505-branch_3x.patch, 
> SOLR-2448-2449-2450-2505-trunk.patch, carrot2-core-3.5.0.jar
>
>
> Carrot2 version 3.5.0 should be available very soon. After the upgrade, it 
> will be possible to implement a few improvements to the clustering plugin; 
> I'll file separate issues for these.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (SOLR-2449) Loading of Carrot2 resources from Solr config directory

2011-05-16 Thread Stanislaw Osinski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislaw Osinski resolved SOLR-2449.
-

Resolution: Fixed

Committed to trunk and branch_3x.

> Loading of Carrot2 resources from Solr config directory
> ---
>
> Key: SOLR-2449
> URL: https://issues.apache.org/jira/browse/SOLR-2449
> Project: Solr
>  Issue Type: Improvement
>  Components: contrib - Clustering
>Reporter: Stanislaw Osinski
>Assignee: Stanislaw Osinski
> Fix For: 3.2, 4.0
>
> Attachments: SOLR-2449.patch
>
>
> Currently, Carrot2 clustering algorithms read linguistic resources (stop 
> words, stop labels) from the classpath (Carrot2 JAR), which makes them 
> difficult to edit/override. The directory from which Carrot2 should read its 
> resources (absolute, or relative to Solr config dir) could be specified in 
> the {{engine}} element. By default, the path could be e.g. 
> {{/clustering/carrot2}}.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (SOLR-2450) Carrot2 clustering should use both its own and Solr's stop words

2011-05-16 Thread Stanislaw Osinski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislaw Osinski resolved SOLR-2450.
-

Resolution: Fixed

Committed to trunk and branch_3x.

> Carrot2 clustering should use both its own and Solr's stop words
> 
>
> Key: SOLR-2450
> URL: https://issues.apache.org/jira/browse/SOLR-2450
> Project: Solr
>  Issue Type: Improvement
>  Components: contrib - Clustering
>Reporter: Stanislaw Osinski
>Assignee: Stanislaw Osinski
>Priority: Minor
> Fix For: 3.2, 4.0
>
> Attachments: SOLR-2450.patch
>
>
> While using only Solr's stop words for clustering isn't a good idea (compared 
> to indexing, clustering needs more aggressive stop word removal to get 
> reasonable cluster labels), it would be good if Carrot2 used both its own and 
> Solr's stop words.
> I'm not sure what the best way to implement this would be though. My first 
> thought was to simply load {{stopwords.txt}} from Solr config dir and merge 
> them with Carrot2's. But then, maybe a better approach would be to get the 
> stop words from the StopFilter being used? Ideally, we should also consider 
> the per-field stop filters configured on the fields used for clustering.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Field should accept BytesRef?

2011-05-16 Thread Robert Muir
On Mon, May 16, 2011 at 11:29 AM, Jason Rutherglen
 wrote:
>> But when you create an untokenized field (or even a binary field, which is 
>> stored-only at the moment), you could theoretically index the bytes directly
>
> Right, if I already have a BytesRef of what needs to be indexed, then
> passing the BR into Field/able should reduce garbage collection of
> strings?
>

you can do this with a tokenstream, see
http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/test/org/apache/lucene/index/Test2BTerms.java
for an example

(sorry i somehow was confused about your message earlier).

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3102) Few issues with CachingCollector

2011-05-16 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-3102:
---

Attachment: LUCENE-3102.patch

bq. Only thing is: I would be careful about directly setting those private 
fields of the cachedScorer; I think (not sure) this incurs an "access" check on 
each assignment. Maybe make them package protected? Or use a setter?

Good catch Mike. I read about it some and found this nice webpage which 
explains the implications (http://www.glenmccl.com/jperf/). Indeed, if the 
member is private (whether it's in the inner or outer class), there is an 
access check. So the right think to do is to declare is protected / 
package-private, which I did. Thanks for the opportunity to get some education !

Patch fixes this. I intend to commit this shortly + move the class to core + 
apply to trunk. Then, I'll continue w/ the rest of the improvements.

> Few issues with CachingCollector
> 
>
> Key: LUCENE-3102
> URL: https://issues.apache.org/jira/browse/LUCENE-3102
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/*
>Reporter: Shai Erera
>Priority: Minor
> Fix For: 3.2, 4.0
>
> Attachments: LUCENE-3102.patch, LUCENE-3102.patch
>
>
> CachingCollector (introduced in LUCENE-1421) has few issues:
> # Since the wrapped Collector may support out-of-order collection, the 
> document IDs cached may be out-of-order (depends on the Query) and thus 
> replay(Collector) will forward document IDs out-of-order to a Collector that 
> may not support it.
> # It does not clear cachedScores + cachedSegs upon exceeding RAM limits
> # I think that instead of comparing curScores to null, in order to determine 
> if scores are requested, we should have a specific boolean - for clarity
> # This check "if (base + nextLength > maxDocsToCache)" (line 168) can be 
> relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the 
> maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want 
> to try and cache them?
> Also:
> * The TODO in line 64 (having Collector specify needsScores()) -- why do we 
> need that if CachingCollector ctor already takes a boolean "cacheScores"? I 
> think it's better defined explicitly than implicitly?
> * Let's introduce a factory method for creating a specialized version if 
> scoring is requested / not (i.e., impl the TODO in line 189)
> * I think it's a useful collector, which stands on its own and not specific 
> to grouping. Can we move it to core?
> * How about using OpenBitSet instead of int[] for doc IDs?
> ** If the number of hits is big, we'd gain some RAM back, and be able to 
> cache more entries
> ** NOTE: OpenBitSet can only be used for in-order collection only. So we can 
> use that if the wrapped Collector does not support out-of-order
> * Do you think we can modify this Collector to not necessarily wrap another 
> Collector? We have such Collector which stores (in-memory) all matching doc 
> IDs + scores (if required). Those are later fed into several processes that 
> operate on them (e.g. fetch more info from the index etc.). I am thinking, we 
> can make CachingCollector *optionally* wrap another Collector and then 
> someone can reuse it by setting RAM limit to unlimited (we should have a 
> constant for that) in order to simply collect all matching docs + scores.
> * I think a set of dedicated unit tests for this class alone would be good.
> That's it so far. Perhaps, if we do all of the above, more things will pop up.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Field should accept BytesRef?

2011-05-16 Thread Jason Rutherglen
> But when you create an untokenized field (or even a binary field, which is 
> stored-only at the moment), you could theoretically index the bytes directly

Right, if I already have a BytesRef of what needs to be indexed, then
passing the BR into Field/able should reduce garbage collection of
strings?

On Sun, May 15, 2011 at 9:59 AM, Uwe Schindler  wrote:
> Hi,
>
> I think Jason meant the field value,  not the field name.
>
> Field names should stay Strings, as they are only "identifiers" making them 
> BytesRefs is not really useful.
>
> But when you create an untokenized field (or even a binary field, which is 
> stored-only at the moment), you could theoretically index the bytes directly.
>
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
>
>
>> -Original Message-
>> From: Robert Muir [mailto:rcm...@gmail.com]
>> Sent: Sunday, May 15, 2011 6:22 PM
>> To: dev@lucene.apache.org
>> Subject: Re: Field should accept BytesRef?
>>
>> On Sun, May 15, 2011 at 12:05 PM, Jason Rutherglen
>>  wrote:
>> > In the Field object a text value must be of type string, however I
>> > think we can allow a BytesRef to be passed in?
>> >
>>
>> it would be nice if we sorted them in byte order too? I think right now 
>> fields
>> are sorted in utf-16 order, but terms are sorted in utf-8 order? (if so, 
>> this is
>> confusing)
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional
>> commands, e-mail: dev-h...@lucene.apache.org
>
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Moving towards Lucene 4.0

2011-05-16 Thread Shai Erera
We anyway seem to mark every new API as @lucene.experimental these days, so
we shouldn't have too much problem when 4.0 is out :).

Experimental API is subject to change at any time. We can consider that as
an option as well (maybe it adds another option to Robert's?).

Though personally, I'm not a big fan of this notion - I think we deceive
ourselves and users when we have @experimental on a "stable" branch. Any
@experimental API on trunk today falls into this bucket after 4.0 is out.
And I'm sure there are a couple in 3.x already.

Don't get me wrong - I don't suggest we should stop using it. But I think we
should consider to review the @experimental API before every "stable"
release, and reduce it over time, not increase it.

Shai

On Mon, May 16, 2011 at 4:20 PM, Robert Muir  wrote:

> On Mon, May 16, 2011 at 9:12 AM, Simon Willnauer
>  wrote:
> > I have to admit that branch is very rough and the API is super hard to
> > use. For now!
> > Lets not be dragged away into discussion how this API should look like
> > there will be time
> > for that.
>
> +1, this is what i really meant by "decide how to handle". I don't
> think we will be able to quickly "decide how to fix" the branch
> itself, i think its really complicated. But we can admit its really
> complicated and won't be solved very soon, and try to figure out a
> release strategy with this in mind.
>
> (p.s. sorry simon, you got two copies of this message i accidentally
> hit reply instead of reply-all)
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


[jira] [Commented] (SOLR-1942) Ability to select codec per field

2011-05-16 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034059#comment-13034059
 ] 

Robert Muir commented on SOLR-1942:
---

ok thanks Grant. I'll take a look thru the patch some today and post back what 
I think.

> Ability to select codec per field
> -
>
> Key: SOLR-1942
> URL: https://issues.apache.org/jira/browse/SOLR-1942
> Project: Solr
>  Issue Type: New Feature
>Affects Versions: 4.0
>Reporter: Yonik Seeley
>Assignee: Grant Ingersoll
> Fix For: 4.0
>
> Attachments: SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, 
> SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch
>
>
> We should use PerFieldCodecWrapper to allow users to select the codec 
> per-field.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3090) DWFlushControl does not take active DWPT out of the loop on fullFlush

2011-05-16 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034053#comment-13034053
 ] 

Simon Willnauer commented on LUCENE-3090:
-

I did 150 runs for all Lucene Tests incl. contrib - no failure so far. Seems to 
be good to go.

> DWFlushControl does not take active DWPT out of the loop on fullFlush
> -
>
> Key: LUCENE-3090
> URL: https://issues.apache.org/jira/browse/LUCENE-3090
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 4.0
>Reporter: Simon Willnauer
>Assignee: Simon Willnauer
>Priority: Critical
> Fix For: 4.0
>
> Attachments: LUCENE-3090.patch, LUCENE-3090.patch, LUCENE-3090.patch
>
>
> We have seen several OOM on TestNRTThreads and all of them are caused by 
> DWFlushControl missing DWPT that are set as flushPending but can't full due 
> to a full flush going on. Yet that means that those DWPT are filling up in 
> the background while they should actually be checked out and blocked until 
> the full flush finishes. Even further we currently stall on the 
> maxNumThreadStates while we should stall on the num of active thread states. 
> I will attach a patch tomorrow.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-1942) Ability to select codec per field

2011-05-16 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034051#comment-13034051
 ] 

Grant Ingersoll commented on SOLR-1942:
---

I thought I would have time last week, but that turned out to not be the case.  
If you have time, Robert, feel free, otherwise I might be able to get to it 
later in the week (pending conf. prep).  From the sounds of it, it likely just 
needs to be updated to trunk and then it should be ready to go (we should also 
doc it on the wiki)

> Ability to select codec per field
> -
>
> Key: SOLR-1942
> URL: https://issues.apache.org/jira/browse/SOLR-1942
> Project: Solr
>  Issue Type: New Feature
>Affects Versions: 4.0
>Reporter: Yonik Seeley
>Assignee: Grant Ingersoll
> Fix For: 4.0
>
> Attachments: SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, 
> SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch
>
>
> We should use PerFieldCodecWrapper to allow users to select the codec 
> per-field.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3098) Grouped total count

2011-05-16 Thread Martijn van Groningen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034050#comment-13034050
 ] 

Martijn van Groningen commented on LUCENE-3098:
---

That is true. It is just a simple un-orded collection of all values of the 
group field that have matches the query. I'll include this as well.

> Grouped total count
> ---
>
> Key: LUCENE-3098
> URL: https://issues.apache.org/jira/browse/LUCENE-3098
> Project: Lucene - Java
>  Issue Type: New Feature
>Reporter: Martijn van Groningen
> Fix For: 3.2, 4.0
>
> Attachments: LUCENE-3098-3x.patch, LUCENE-3098.patch, 
> LUCENE-3098.patch, LUCENE-3098.patch
>
>
> When grouping currently you can get two counts:
> * Total hit count. Which counts all documents that matched the query.
> * Total grouped hit count. Which counts all documents that have been grouped 
> in the top N groups.
> Since the end user gets groups in his search result instead of plain 
> documents with grouping. The total number of groups as total count makes more 
> sense in many situations. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3098) Grouped total count

2011-05-16 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034040#comment-13034040
 ] 

Michael McCandless commented on LUCENE-3098:


Right, we'd make it clear the collection is unordered.

It just seems like, since we are building up this collection anyway, we may as 
well give access to the consumer?

> Grouped total count
> ---
>
> Key: LUCENE-3098
> URL: https://issues.apache.org/jira/browse/LUCENE-3098
> Project: Lucene - Java
>  Issue Type: New Feature
>Reporter: Martijn van Groningen
> Fix For: 3.2, 4.0
>
> Attachments: LUCENE-3098-3x.patch, LUCENE-3098.patch, 
> LUCENE-3098.patch, LUCENE-3098.patch
>
>
> When grouping currently you can get two counts:
> * Total hit count. Which counts all documents that matched the query.
> * Total grouped hit count. Which counts all documents that have been grouped 
> in the top N groups.
> Since the end user gets groups in his search result instead of plain 
> documents with grouping. The total number of groups as total count makes more 
> sense in many situations. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: svn commit: r1103709 - in /lucene/java/site: docs/whoweare.html docs/whoweare.pdf src/documentation/content/xdocs/whoweare.xml

2011-05-16 Thread Stanislaw Osinski
Hi Mark,

Thanks for clarifying the difference between contrib and full committers, I
was probably too shy to subscribe myself to the latter group right away :-)
For the time being, I'll most likely stick with maintaining the clustering
bit and will consult you guys if I have something to contribute in the other
areas of the code.

S.

On Mon, May 16, 2011 at 15:41, Mark Miller  wrote:

>
> Stanislav - we certainly nominated you in the spirit of maintaining the
> carrot2 contrib, but you are still a full committer. We have decided to stop
> adding new Contrib committers. A full committer may be someone that only
> works on part of the project. IMO, a full committer might be someone that
> only has commit bits so that he can update the website! We trust full
> committers to only mess with what they are comfortable with. So we trust
> that you will stick to Carrot2 or other areas you are strong in, and that if
> you want to move into other code, you will do so intelligently. Essentially,
> by making you a Committer, we are mostly just saying - "we trust you".
>
> But you are a full committer and not a contrib committer. We no longer mint
> new contrib committers.
>
> - Mark Miller
> lucidimagination.com
>
> Lucene/Solr User Conference
> May 25-26, San Francisco
> www.lucenerevolution.org
>
>
>
>
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


[jira] [Commented] (LUCENE-3098) Grouped total count

2011-05-16 Thread Martijn van Groningen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034025#comment-13034025
 ] 

Martijn van Groningen commented on LUCENE-3098:
---

Hmmm... So you get a list of all grouped values. That can be useful. Only 
remember that doesn't tell anything about the group head (most relevant 
document of a group), since we don't sort inside the groups.

> Grouped total count
> ---
>
> Key: LUCENE-3098
> URL: https://issues.apache.org/jira/browse/LUCENE-3098
> Project: Lucene - Java
>  Issue Type: New Feature
>Reporter: Martijn van Groningen
> Fix For: 3.2, 4.0
>
> Attachments: LUCENE-3098-3x.patch, LUCENE-3098.patch, 
> LUCENE-3098.patch, LUCENE-3098.patch
>
>
> When grouping currently you can get two counts:
> * Total hit count. Which counts all documents that matched the query.
> * Total grouped hit count. Which counts all documents that have been grouped 
> in the top N groups.
> Since the end user gets groups in his search result instead of plain 
> documents with grouping. The total number of groups as total count makes more 
> sense in many situations. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-1942) Ability to select codec per field

2011-05-16 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034018#comment-13034018
 ] 

Robert Muir commented on SOLR-1942:
---

any update on this? Would be nice to be able to hook in codecproviders and 
codecs this way.

> Ability to select codec per field
> -
>
> Key: SOLR-1942
> URL: https://issues.apache.org/jira/browse/SOLR-1942
> Project: Solr
>  Issue Type: New Feature
>Affects Versions: 4.0
>Reporter: Yonik Seeley
>Assignee: Grant Ingersoll
> Fix For: 4.0
>
> Attachments: SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, 
> SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch
>
>
> We should use PerFieldCodecWrapper to allow users to select the codec 
> per-field.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: svn commit: r1103709 - in /lucene/java/site: docs/whoweare.html docs/whoweare.pdf src/documentation/content/xdocs/whoweare.xml

2011-05-16 Thread Mark Miller

On May 16, 2011, at 8:55 AM, Stanislaw Osinski wrote:

> stanislav you are a full committer afaik?!
> 
> I've been working mostly on the clustering plugin for now, so I'm not sure if 
> it's right to move me to the core section right away :-)
> 
> Incidentally, I tried to svn up on /www/lucene.apache.org/java/docs at 
> people.apache.org to push the modifications live, but there is an SVN lock on 
> that directory. Am I missing anything? I'm assuming that's the right 
> directory for the commiters list?
> 
> S.
> 
> 

Stanislav - we certainly nominated you in the spirit of maintaining the carrot2 
contrib, but you are still a full committer. We have decided to stop adding new 
Contrib committers. A full committer may be someone that only works on part of 
the project. IMO, a full committer might be someone that only has commit bits 
so that he can update the website! We trust full committers to only mess with 
what they are comfortable with. So we trust that you will stick to Carrot2 or 
other areas you are strong in, and that if you want to move into other code, 
you will do so intelligently. Essentially, by making you a Committer, we are 
mostly just saying - "we trust you".

But you are a full committer and not a contrib committer. We no longer mint new 
contrib committers.

- Mark Miller
lucidimagination.com

Lucene/Solr User Conference
May 25-26, San Francisco
www.lucenerevolution.org






-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: svn commit: r1103709 - in /lucene/java/site: docs/whoweare.html docs/whoweare.pdf src/documentation/content/xdocs/whoweare.xml

2011-05-16 Thread Stanislaw Osinski
Hi Steve,

That explains everything, thanks! I somehow failed to locate that wiki page
and was looking at http://wiki.apache.org/solr/Website_Update_HOWTO instead.

S.

On Mon, May 16, 2011 at 15:25, Steven A Rowe  wrote:

> Hi Stanisław,
>
>
>
> You don’t need to be logged into people.apache.org to update the website.
>
>
>
> Have you seen these instructions?  The “unversioned website” section is
> what you want, I think:
>
>
>
> http://wiki.apache.org/lucene-java/HowToUpdateTheWebsite
>
>
>
> Steve
>
>
>
> *From:* stac...@gmail.com [mailto:stac...@gmail.com] *On Behalf Of *Stanislaw
> Osinski
> *Sent:* Monday, May 16, 2011 8:56 AM
>
> *To:* dev@lucene.apache.org; simon.willna...@gmail.com
> *Cc:* java-...@lucene.apache.org; java-comm...@lucene.apache.org
> *Subject:* Re: svn commit: r1103709 - in /lucene/java/site:
> docs/whoweare.html docs/whoweare.pdf
> src/documentation/content/xdocs/whoweare.xml
>
>
>
> stanislav you are a full committer afaik?!
>
>
>
> I've been working mostly on the clustering plugin for now, so I'm not sure
> if it's right to move me to the core section right away :-)
>
>
>
> Incidentally, I tried to svn up on /www/lucene.apache.org/java/docs at
> people.apache.org to push the modifications live, but there is an SVN lock
> on that directory. Am I missing anything? I'm assuming that's the right
> directory for the commiters list?
>
>
>
> S.
>
>
>
>
>


  1   2   >