Solr-trunk - Build # 1350 - Still Failing

2010-12-22 Thread Apache Hudson Server
Build: https://hudson.apache.org/hudson/job/Solr-trunk/1350/

1 tests failed.
FAILED:  
org.apache.solr.handler.clustering.DistributedClusteringComponentTest.testDistribSearch

Error Message:
Some threads threw uncaught exceptions!

Stack Trace:
junit.framework.AssertionFailedError: Some threads threw uncaught exceptions!
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1104)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1042)
at 
org.apache.lucene.util.LuceneTestCase.tearDown(LuceneTestCase.java:499)
at org.apache.solr.SolrTestCaseJ4.tearDown(SolrTestCaseJ4.java:92)
at 
org.apache.solr.BaseDistributedSearchTestCase.tearDown(BaseDistributedSearchTestCase.java:144)




Build Log (for compile errors):
[...truncated 9643 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-2129) Provide a Solr module for dynamic metadata extraction/indexing with Apache UIMA

2010-12-22 Thread Tommaso Teofili (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12974114#action_12974114
 ] 

Tommaso Teofili commented on SOLR-2129:
---

Hi Kamil,
can you please take a look at your trunk/solr/contrib/uima does the lib folder 
exist? Can you find the jars in there?
Let me know and thanks for your feedback

 Provide a Solr module for dynamic metadata extraction/indexing with Apache 
 UIMA
 ---

 Key: SOLR-2129
 URL: https://issues.apache.org/jira/browse/SOLR-2129
 Project: Solr
  Issue Type: New Feature
Reporter: Tommaso Teofili
Assignee: Robert Muir
 Attachments: lib-jars.zip, SOLR-2129-asf-headers.patch, 
 SOLR-2129-version2.patch, SOLR-2129-version3.patch, SOLR-2129.patch


 Provide components to enable Apache UIMA automatic metadata extraction to be 
 exploited when indexing documents.
 The purpose of this is to get unstructured information inside a document 
 and create structured metadata (as fields) to enrich each document.
 Basically this can be done with a custom UpdateRequestProcessor which 
 triggers UIMA while indexing documents.
 The basic UIMA implementation of UpdateRequestProcessor extracts sentences 
 (with a tokenizer and an hidden Markov model tagger), named entities, 
 language, suggested category, keywords and concepts (exploiting external 
 services from OpenCalais and AlchemyAPI). Such an implementation can be 
 easily extended adding or selecting different UIMA analysis engines, both 
 from UIMA repositories on the web or creating new ones from scratch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2830) Use StringBuilder instead of StringBuffer in benchmark

2010-12-22 Thread Shai Erera (JIRA)
Use StringBuilder instead of StringBuffer in benchmark
--

 Key: LUCENE-2830
 URL: https://issues.apache.org/jira/browse/LUCENE-2830
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Minor
 Fix For: 3.1, 4.0


Minor change - use StringBuilder instead of StringBuffer in benchmark's code. 
We don't need the synchronization of StringBuffer in all the places that I've 
checked.

The only place where it _could_ be a problem is in HtmlParser's API - one 
method accepts a StringBuffer and it's an interface. But I think it's ok to 
change benchmark's API, back-compat wise and so I'd like to either change it to 
accept a String, or remove the method altogether -- no code in benchmark uses 
it, and if anyone needs it, he can pass StringReader to the other method.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Lucene-Solr-tests-only-3.x - Build # 2822 - Still Failing

2010-12-22 Thread Apache Hudson Server
Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-3.x/2822/

1 tests failed.
FAILED:  
org.apache.solr.handler.clustering.DistributedClusteringComponentTest.testDistribSearch

Error Message:
Some threads threw uncaught exceptions!

Stack Trace:
junit.framework.AssertionFailedError: Some threads threw uncaught exceptions!
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:950)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:888)
at 
org.apache.lucene.util.LuceneTestCase.tearDown(LuceneTestCase.java:371)
at org.apache.solr.SolrTestCaseJ4.tearDown(SolrTestCaseJ4.java:78)
at 
org.apache.solr.BaseDistributedSearchTestCase.tearDown(BaseDistributedSearchTestCase.java:130)




Build Log (for compile errors):
[...truncated 10663 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2830) Use StringBuilder instead of StringBuffer in benchmark

2010-12-22 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-2830:
---

Attachment: LUCENE-2830.patch

Patch replaces StringBuffer with StringBuilder. I did not yet remove the 
parse() method from HtmlParser - if people are ok with it, I'll remove it. For 
now, I changed the parameter to String.

All tests pass.

 Use StringBuilder instead of StringBuffer in benchmark
 --

 Key: LUCENE-2830
 URL: https://issues.apache.org/jira/browse/LUCENE-2830
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Minor
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2830.patch


 Minor change - use StringBuilder instead of StringBuffer in benchmark's code. 
 We don't need the synchronization of StringBuffer in all the places that I've 
 checked.
 The only place where it _could_ be a problem is in HtmlParser's API - one 
 method accepts a StringBuffer and it's an interface. But I think it's ok to 
 change benchmark's API, back-compat wise and so I'd like to either change it 
 to accept a String, or remove the method altogether -- no code in benchmark 
 uses it, and if anyone needs it, he can pass StringReader to the other method.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Lucene-Solr-tests-only-trunk - Build # 2849 - Still Failing

2010-12-22 Thread Apache Hudson Server
Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-trunk/2849/

1 tests failed.
FAILED:  
org.apache.solr.handler.clustering.DistributedClusteringComponentTest.testDistribSearch

Error Message:
Some threads threw uncaught exceptions!

Stack Trace:
junit.framework.AssertionFailedError: Some threads threw uncaught exceptions!
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1104)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1042)
at 
org.apache.lucene.util.LuceneTestCase.tearDown(LuceneTestCase.java:499)
at org.apache.solr.SolrTestCaseJ4.tearDown(SolrTestCaseJ4.java:92)
at 
org.apache.solr.BaseDistributedSearchTestCase.tearDown(BaseDistributedSearchTestCase.java:144)




Build Log (for compile errors):
[...truncated 9739 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2694) MTQ rewrite + weight/scorer init should be single pass

2010-12-22 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12974121#action_12974121
 ] 

Simon Willnauer commented on LUCENE-2694:
-

{quote}
I think instead of ReaderView we could change Weight.scorer API so that instead 
of receiving IndexReader reader, it receives a struct that has parent reader, 
sub reader, ord of that sub?
It's easy to be back compat because we could just forward to prior scorer 
method with only the sub?
{quote}
Mike I am not sure if that helps us here. If you use this method you can not 
disambiguate between the set of readers that where used to create the 
PerReaderTermState and the once that have a certain ord assigned to it. 
Disambiguation would be more difficult if we do that. IMO sharing a ReaderView 
seems to be the best solution so far. I don't think we should bind it to an IR 
directly since users can easily build a ReaderView from a Composite Reader. 
Yet, for searching it would be nice to have a ReaderView on Seacher / 
IndexSearcher which can be triggered upon weight creation.
That way we can also disambiguate between PerReaderTermState given to the 
TermQuery ctor when we create the weight so that if the view doesn' t match we 
either create a new PerReaderTermState or just don't use it for this weight.

I thought about TermsEnum#ord() again. I don' t think we should really add it 
back though. Its really an implementation detail and folks that wanna use it 
should be aware of that and cast correctly. On the other hand I don't like to 
have the seek(ord) in TermsEnum either if we remove #ord(). I think we should 
remove it from the interface entirely though.

simon

 MTQ rewrite + weight/scorer init should be single pass
 --

 Key: LUCENE-2694
 URL: https://issues.apache.org/jira/browse/LUCENE-2694
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Michael McCandless
Assignee: Simon Willnauer
 Fix For: 4.0

 Attachments: LUCENE-2694-FTE.patch, LUCENE-2694.patch, 
 LUCENE-2694.patch, LUCENE-2694.patch, LUCENE-2694.patch, LUCENE-2694.patch, 
 LUCENE-2694.patch


 Spinoff of LUCENE-2690 (see the hacked patch on that issue)...
 Once we fix MTQ rewrite to be per-segment, we should take it further and make 
 weight/scorer init also run in the same single pass as rewrite.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Lucene-Solr-tests-only-3.x - Build # 2823 - Still Failing

2010-12-22 Thread Apache Hudson Server
Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-3.x/2823/

1 tests failed.
FAILED:  
org.apache.solr.handler.clustering.DistributedClusteringComponentTest.testDistribSearch

Error Message:
Some threads threw uncaught exceptions!

Stack Trace:
junit.framework.AssertionFailedError: Some threads threw uncaught exceptions!
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:950)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:888)
at 
org.apache.lucene.util.LuceneTestCase.tearDown(LuceneTestCase.java:371)
at org.apache.solr.SolrTestCaseJ4.tearDown(SolrTestCaseJ4.java:78)
at 
org.apache.solr.BaseDistributedSearchTestCase.tearDown(BaseDistributedSearchTestCase.java:130)




Build Log (for compile errors):
[...truncated 10602 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Lucene-Solr-tests-only-trunk - Build # 2850 - Still Failing

2010-12-22 Thread Apache Hudson Server
Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-trunk/2850/

1 tests failed.
FAILED:  
org.apache.solr.handler.clustering.DistributedClusteringComponentTest.testDistribSearch

Error Message:
Some threads threw uncaught exceptions!

Stack Trace:
junit.framework.AssertionFailedError: Some threads threw uncaught exceptions!
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1104)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1042)
at 
org.apache.lucene.util.LuceneTestCase.tearDown(LuceneTestCase.java:499)
at org.apache.solr.SolrTestCaseJ4.tearDown(SolrTestCaseJ4.java:92)
at 
org.apache.solr.BaseDistributedSearchTestCase.tearDown(BaseDistributedSearchTestCase.java:144)




Build Log (for compile errors):
[...truncated 9665 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Lucene-Solr-tests-only-3.x - Build # 2824 - Still Failing

2010-12-22 Thread Apache Hudson Server
Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-3.x/2824/

1 tests failed.
FAILED:  
org.apache.solr.handler.clustering.DistributedClusteringComponentTest.testDistribSearch

Error Message:
Some threads threw uncaught exceptions!

Stack Trace:
junit.framework.AssertionFailedError: Some threads threw uncaught exceptions!
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:950)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:888)
at 
org.apache.lucene.util.LuceneTestCase.tearDown(LuceneTestCase.java:371)
at org.apache.solr.SolrTestCaseJ4.tearDown(SolrTestCaseJ4.java:78)
at 
org.apache.solr.BaseDistributedSearchTestCase.tearDown(BaseDistributedSearchTestCase.java:130)




Build Log (for compile errors):
[...truncated 10666 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Lucene-Solr-tests-only-trunk - Build # 2851 - Still Failing

2010-12-22 Thread Apache Hudson Server
Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-trunk/2851/

1 tests failed.
FAILED:  
org.apache.solr.handler.clustering.DistributedClusteringComponentTest.testDistribSearch

Error Message:
Some threads threw uncaught exceptions!

Stack Trace:
junit.framework.AssertionFailedError: Some threads threw uncaught exceptions!
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1104)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1042)
at 
org.apache.lucene.util.LuceneTestCase.tearDown(LuceneTestCase.java:499)
at org.apache.solr.SolrTestCaseJ4.tearDown(SolrTestCaseJ4.java:92)
at 
org.apache.solr.BaseDistributedSearchTestCase.tearDown(BaseDistributedSearchTestCase.java:144)




Build Log (for compile errors):
[...truncated 9870 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Lucene-Solr-tests-only-3.x - Build # 2825 - Still Failing

2010-12-22 Thread Apache Hudson Server
Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-3.x/2825/

1 tests failed.
FAILED:  
org.apache.solr.handler.clustering.DistributedClusteringComponentTest.testDistribSearch

Error Message:
Some threads threw uncaught exceptions!

Stack Trace:
junit.framework.AssertionFailedError: Some threads threw uncaught exceptions!
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:950)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:888)
at 
org.apache.lucene.util.LuceneTestCase.tearDown(LuceneTestCase.java:371)
at org.apache.solr.SolrTestCaseJ4.tearDown(SolrTestCaseJ4.java:78)
at 
org.apache.solr.BaseDistributedSearchTestCase.tearDown(BaseDistributedSearchTestCase.java:130)




Build Log (for compile errors):
[...truncated 10804 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Lucene-Solr-tests-only-trunk - Build # 2852 - Still Failing

2010-12-22 Thread Apache Hudson Server
Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-trunk/2852/

1 tests failed.
FAILED:  
org.apache.solr.handler.clustering.DistributedClusteringComponentTest.testDistribSearch

Error Message:
Some threads threw uncaught exceptions!

Stack Trace:
junit.framework.AssertionFailedError: Some threads threw uncaught exceptions!
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1104)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1042)
at 
org.apache.lucene.util.LuceneTestCase.tearDown(LuceneTestCase.java:499)
at org.apache.solr.SolrTestCaseJ4.tearDown(SolrTestCaseJ4.java:92)
at 
org.apache.solr.BaseDistributedSearchTestCase.tearDown(BaseDistributedSearchTestCase.java:144)




Build Log (for compile errors):
[...truncated 9787 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Lucene-Solr-tests-only-3.x - Build # 2826 - Still Failing

2010-12-22 Thread Apache Hudson Server
Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-3.x/2826/

1 tests failed.
FAILED:  
org.apache.solr.handler.clustering.DistributedClusteringComponentTest.testDistribSearch

Error Message:
Some threads threw uncaught exceptions!

Stack Trace:
junit.framework.AssertionFailedError: Some threads threw uncaught exceptions!
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:950)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:888)
at 
org.apache.lucene.util.LuceneTestCase.tearDown(LuceneTestCase.java:371)
at org.apache.solr.SolrTestCaseJ4.tearDown(SolrTestCaseJ4.java:78)
at 
org.apache.solr.BaseDistributedSearchTestCase.tearDown(BaseDistributedSearchTestCase.java:130)




Build Log (for compile errors):
[...truncated 10609 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Lucene-Solr-tests-only-trunk - Build # 2853 - Still Failing

2010-12-22 Thread Apache Hudson Server
Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-trunk/2853/

1 tests failed.
FAILED:  
org.apache.solr.handler.clustering.DistributedClusteringComponentTest.testDistribSearch

Error Message:
Some threads threw uncaught exceptions!

Stack Trace:
junit.framework.AssertionFailedError: Some threads threw uncaught exceptions!
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1104)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1042)
at 
org.apache.lucene.util.LuceneTestCase.tearDown(LuceneTestCase.java:499)
at org.apache.solr.SolrTestCaseJ4.tearDown(SolrTestCaseJ4.java:92)
at 
org.apache.solr.BaseDistributedSearchTestCase.tearDown(BaseDistributedSearchTestCase.java:144)




Build Log (for compile errors):
[...truncated 9781 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Lucene-Solr-tests-only-3.x - Build # 2827 - Still Failing

2010-12-22 Thread Apache Hudson Server
Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-3.x/2827/

1 tests failed.
FAILED:  
org.apache.solr.handler.clustering.DistributedClusteringComponentTest.testDistribSearch

Error Message:
Some threads threw uncaught exceptions!

Stack Trace:
junit.framework.AssertionFailedError: Some threads threw uncaught exceptions!
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:950)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:888)
at 
org.apache.lucene.util.LuceneTestCase.tearDown(LuceneTestCase.java:371)
at org.apache.solr.SolrTestCaseJ4.tearDown(SolrTestCaseJ4.java:78)
at 
org.apache.solr.BaseDistributedSearchTestCase.tearDown(BaseDistributedSearchTestCase.java:130)




Build Log (for compile errors):
[...truncated 10742 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Lucene-Solr-tests-only-3.x - Build # 2828 - Still Failing

2010-12-22 Thread Apache Hudson Server
Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-3.x/2828/

1 tests failed.
FAILED:  
org.apache.solr.handler.clustering.DistributedClusteringComponentTest.testDistribSearch

Error Message:
Some threads threw uncaught exceptions!

Stack Trace:
junit.framework.AssertionFailedError: Some threads threw uncaught exceptions!
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:950)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:888)
at 
org.apache.lucene.util.LuceneTestCase.tearDown(LuceneTestCase.java:371)
at org.apache.solr.SolrTestCaseJ4.tearDown(SolrTestCaseJ4.java:78)
at 
org.apache.solr.BaseDistributedSearchTestCase.tearDown(BaseDistributedSearchTestCase.java:130)




Build Log (for compile errors):
[...truncated 10697 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Lucene-Solr-tests-only-trunk - Build # 2855 - Failure

2010-12-22 Thread Apache Hudson Server
Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-trunk/2855/

1 tests failed.
REGRESSION:  
org.apache.solr.handler.clustering.DistributedClusteringComponentTest.testDistribSearch

Error Message:
Some threads threw uncaught exceptions!

Stack Trace:
junit.framework.AssertionFailedError: Some threads threw uncaught exceptions!
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1104)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1042)
at 
org.apache.lucene.util.LuceneTestCase.tearDown(LuceneTestCase.java:499)
at org.apache.solr.SolrTestCaseJ4.tearDown(SolrTestCaseJ4.java:92)
at 
org.apache.solr.BaseDistributedSearchTestCase.tearDown(BaseDistributedSearchTestCase.java:144)




Build Log (for compile errors):
[...truncated 9786 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Lucene-Solr-tests-only-trunk - Build # 2839 - Still Failing

2010-12-22 Thread Robert Muir
On Wed, Dec 22, 2010 at 2:10 AM, Chris Hostetter
hossman_luc...@fucit.org wrote:

 : NOTE: reproduce with: ant test -Dtestcase=DistributedClusteringComponentTest
 : -Dtestmethod=testDistribSearch
 : -Dtests.seed=4959909076277587079:-8952133138041211916 -Dtests.multiplier=3
 :
 : But I couldn't reproduce it on my mac.

 It's failing consistently on both the trunk and 3x hudson jobs, for the
 past ~10 builds (as of right now) since you added the test, with a
 consistent SEVERE error in the logs -- i don't think it has anything to do
 with the random seed.

 I personally can't reproduce the failure on either trunk or 3x; regardless
 of wether i try to run just a single test, or all tests in parallel.


This test fails always on my computer too... its not just hudson.

I added an @Ignore until it can be resolved.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: svn commit: r1051872 - /lucene/dev/trunk/solr/contrib/clustering/src/test/java/org/apache/solr/handler/clustering/DistributedClusteringComponentTest.java

2010-12-22 Thread Koji Sekiguchi
Thank you!

Koji Sekiguchi from mobile


On 2010/12/22, at 21:27, rm...@apache.org wrote:

 Author: rmuir
 Date: Wed Dec 22 12:27:06 2010
 New Revision: 1051872
 
 URL: http://svn.apache.org/viewvc?rev=1051872view=rev
 Log:
 SOLR-2282: disable failing test
 
 Modified:

 lucene/dev/trunk/solr/contrib/clustering/src/test/java/org/apache/solr/handler/clustering/DistributedClusteringComponentTest.java
 
 Modified: 
 lucene/dev/trunk/solr/contrib/clustering/src/test/java/org/apache/solr/handler/clustering/DistributedClusteringComponentTest.java
 URL: 
 http://svn.apache.org/viewvc/lucene/dev/trunk/solr/contrib/clustering/src/test/java/org/apache/solr/handler/clustering/DistributedClusteringComponentTest.java?rev=1051872r1=1051871r2=1051872view=diff
 ==
 --- 
 lucene/dev/trunk/solr/contrib/clustering/src/test/java/org/apache/solr/handler/clustering/DistributedClusteringComponentTest.java
  (original)
 +++ 
 lucene/dev/trunk/solr/contrib/clustering/src/test/java/org/apache/solr/handler/clustering/DistributedClusteringComponentTest.java
  Wed Dec 22 12:27:06 2010
 @@ -20,6 +20,9 @@ package org.apache.solr.handler.clusteri
 import org.apache.solr.BaseDistributedSearchTestCase;
 import org.apache.solr.common.params.CommonParams;
 
 +import org.junit.Ignore;
 +
 +...@ignore(FIXME: test fails on hudson)
 public class DistributedClusteringComponentTest extends
 BaseDistributedSearchTestCase {
 
 
 

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1526) Client Side Tika integration

2010-12-22 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SOLR-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12974176#action_12974176
 ] 

Jan Høydahl commented on SOLR-1526:
---

I linked this issue to SOLR-1763, as they attempt to solve the same thing, on 
client vs server side.

Instead of creating two solutions, we should base these two on same code base 
and config, so that it is easy to switch between them. Perhaps someone starts 
with server-side extraction but then want to optimize performance by going 
client-side. The switch should be intuitive.

Thus, should we consider porting the whole UpdateProcessorChain to SolrJ? How 
cool would it be to choose whether to execute an UP on client or server side 
simply by configuration change? I realize that some UP's may depend on SolrCore 
or have other difficult dependencies, but it should be possible to work around, 
not?

 Client Side Tika integration
 

 Key: SOLR-1526
 URL: https://issues.apache.org/jira/browse/SOLR-1526
 Project: Solr
  Issue Type: New Feature
  Components: clients - java
Reporter: Grant Ingersoll
Priority: Minor
 Fix For: Next


 Often times it is cost prohibitive to send full, rich documents over the 
 wire.  The contrib/extraction library has server side integration with Tika, 
 but it would be nice to have a client side implementation as well.  It should 
 support both metadata and content or just metadata.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Created: (SOLR-2293) SolrCloud distributed indexing

2010-12-22 Thread JIRA
SolrCloud distributed indexing
--

 Key: SOLR-2293
 URL: https://issues.apache.org/jira/browse/SOLR-2293
 Project: Solr
  Issue Type: New Feature
  Components: SolrCloud
Reporter: Jan Høydahl


Add SolrCloud support for distributed indexing, as described in 
http://wiki.apache.org/solr/DistributedSearch#Distributed_Indexing and the 
Support user specified partitioning paragraph of 
http://wiki.apache.org/solr/SolrCloud#High_level_design_goals

Currently, the client needs to decide what shard indexer to talk to for each 
document. Common partitioning strategies include has-based, date-based and 
custom.

Solr should have the capability of accepting a document update on any of the 
nodes in a cluster, and perform partitioning and distribution of updates to 
correct shard, based on current ZK config. The ShardDistributionPolicy should 
be pluggable, with the most common provided out of the box.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2829) improve termquery pk lookup performance

2010-12-22 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12974212#action_12974212
 ] 

Yonik Seeley commented on LUCENE-2829:
--

Why not keep the TermState cache and use it for all queries except MTQ, while 
using a different mechanism for MTQ to avoid trashing the cache?

The cache has a number of advantages that may never be duplicated in a 
different type of API, including
- actually cache frequently used terms across different requests
- cache terms reused in the same request.  term proximity boosting is an 
example:   +united +states united states^10

 improve termquery pk lookup performance
 -

 Key: LUCENE-2829
 URL: https://issues.apache.org/jira/browse/LUCENE-2829
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Robert Muir
 Attachments: LUCENE-2829.patch


 For things that are like primary keys and don't exist in some segments (worst 
 case is primary/unique key that only exists in 1)
 we do wasted seeks.
 While LUCENE-2694 tries to solve some of this issue with TermState, I'm 
 concerned we could every backport that to 3.1 for example.
 This is a simpler solution here just to solve this one problem in 
 termquery... we could just revert it in trunk when we resolve LUCENE-2694,
 but I don't think we should leave things as they are in 3.x

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-2829) improve termquery pk lookup performance

2010-12-22 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12974212#action_12974212
 ] 

Yonik Seeley edited comment on LUCENE-2829 at 12/22/10 9:24 AM:


Why not keep the TermState cache and use it for all queries except MTQ, while 
using a different mechanism for MTQ to avoid trashing the cache?

The cache has a number of advantages that may never be duplicated in a 
different type of API, including
- actually cache frequently used terms across different requests
- cache terms reused in the same request.  term proximity boosting is an 
example:   +united +states united states^10

edit: and as robert previously pointed out, if we cached misses as well, then 
we could avoid needless seeks on segments that don't contain the term.

  was (Author: ysee...@gmail.com):
Why not keep the TermState cache and use it for all queries except MTQ, 
while using a different mechanism for MTQ to avoid trashing the cache?

The cache has a number of advantages that may never be duplicated in a 
different type of API, including
- actually cache frequently used terms across different requests
- cache terms reused in the same request.  term proximity boosting is an 
example:   +united +states united states^10
  
 improve termquery pk lookup performance
 -

 Key: LUCENE-2829
 URL: https://issues.apache.org/jira/browse/LUCENE-2829
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Robert Muir
 Attachments: LUCENE-2829.patch


 For things that are like primary keys and don't exist in some segments (worst 
 case is primary/unique key that only exists in 1)
 we do wasted seeks.
 While LUCENE-2694 tries to solve some of this issue with TermState, I'm 
 concerned we could every backport that to 3.1 for example.
 This is a simpler solution here just to solve this one problem in 
 termquery... we could just revert it in trunk when we resolve LUCENE-2694,
 but I don't think we should leave things as they are in 3.x

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2723) Speed up Lucene's low level bulk postings read API

2010-12-22 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12974221#action_12974221
 ] 

Yonik Seeley commented on LUCENE-2723:
--

Should we keep MultiBulkPostingsEnum?
Even when someone writes their code to work per-segment, not all IndexReader 
implementations may be able to provide segment-level readers.  ParallelReader 
is one that can't currently?

 Speed up Lucene's low level bulk postings read API
 --

 Key: LUCENE-2723
 URL: https://issues.apache.org/jira/browse/LUCENE-2723
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 4.0

 Attachments: LUCENE-2723-termscorer.patch, 
 LUCENE-2723-termscorer.patch, LUCENE-2723-termscorer.patch, 
 LUCENE-2723.patch, LUCENE-2723.patch, LUCENE-2723.patch, LUCENE-2723.patch, 
 LUCENE-2723.patch, LUCENE-2723_bulkvint.patch, LUCENE-2723_facetPerSeg.patch, 
 LUCENE-2723_facetPerSeg.patch, LUCENE-2723_openEnum.patch, 
 LUCENE-2723_termscorer.patch, LUCENE-2723_wastedint.patch


 Spinoff from LUCENE-1410.
 The flex DocsEnum has a simple bulk-read API that reads the next chunk
 of docs/freqs.  But it's a poor fit for intblock codecs like FOR/PFOR
 (from LUCENE-1410).  This is not unlike sucking coffee through those
 tiny plastic coffee stirrers they hand out airplanes that,
 surprisingly, also happen to function as a straw.
 As a result we see no perf gain from using FOR/PFOR.
 I had hacked up a fix for this, described at in my blog post at
 http://chbits.blogspot.com/2010/08/lucene-performance-with-pfordelta-codec.html
 I'm opening this issue to get that work to a committable point.
 So... I've worked out a new bulk-read API to address performance
 bottleneck.  It has some big changes over the current bulk-read API:
   * You can now also bulk-read positions (but not payloads), but, I
  have yet to cutover positional queries.
   * The buffer contains doc deltas, not absolute values, for docIDs
 and positions (freqs are absolute).
   * Deleted docs are not filtered out.
   * The doc  freq buffers need not be aligned.  For fixed intblock
 codecs (FOR/PFOR) they will be, but for varint codecs (Simple9/16,
 Group varint, etc.) they won't be.
 It's still a work in progress...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2829) improve termquery pk lookup performance

2010-12-22 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12974223#action_12974223
 ] 

Robert Muir commented on LUCENE-2829:
-

bq. edit: and as robert previously pointed out, if we cached misses as well, 
then we could avoid needless seeks on segments that don't contain the term.

True, this is a good idea, just a little tricker:
* In trunk, we have TermsEnum.seek(BytesRef text, boolean useCache), defaulting 
to true.
* FilteredTermsEnum passes false here, so the multitermqueries don't populate 
the cache with 
  garbage while enumerating (eg foo*),  only explicitly at the end with 
cacheTerm() (per-segment) 
  for the ones that were actually accepted. They sum up their docFreq 
themselves to prevent the 
  first wasted seek in TermQuery. 
* So this solution would make MTQ worse, as it would cause them to trash the 
caches in the 
  second wasted seek (the docsenum) where they do not today, with negative 
entries for the 
  segments where the term doesn't exist. Today they do this wasted seek, but 
they don't 
  trash the cache here. The only solution to prevent that is the 
PerReaderTermState 
  (or something equally complicated).
* We would have to look at other places where negative entries would hurt, for 
example 
  rebuilding spellcheck indexes uses this 'termExists()' method implemented 
with docFreq. 
  So we would have to likely change spellcheck's code to use a TermsEnum and 
  seek(term, false)... using a termsenum in parallel with the spellcheck 
dictionary would 
  obviously be more efficient for the index-based spellcheck case (forget about 
caching)
  versus docFreq()'ing every term... *but* we cannot assume the spellcheck 
Dictionary 
  is actually in term order, (imagine the File-based dictionary case), so we 
can't 
  implement this today.

On 3.x i think its slightly less complicated as there is already a hack in the 
cache to 
prevent sequential termsenums from trashing it (e.g. foo*), and pretty much all 
the MTQs 
just enumerate sequentially anyway... (except NRQ which doesn't enum many terms 
anyway, likely not a problem).

But we would have to at least fix the spellcheck case there too I think.

Not saying I don't like your idea... just saying there's more work to do it.


 improve termquery pk lookup performance
 -

 Key: LUCENE-2829
 URL: https://issues.apache.org/jira/browse/LUCENE-2829
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Robert Muir
 Attachments: LUCENE-2829.patch


 For things that are like primary keys and don't exist in some segments (worst 
 case is primary/unique key that only exists in 1)
 we do wasted seeks.
 While LUCENE-2694 tries to solve some of this issue with TermState, I'm 
 concerned we could every backport that to 3.1 for example.
 This is a simpler solution here just to solve this one problem in 
 termquery... we could just revert it in trunk when we resolve LUCENE-2694,
 but I don't think we should leave things as they are in 3.x

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-2290) the termsInfosDivisor for readers opened by indexWriter should be configurable in Solr

2010-12-22 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12974224#action_12974224
 ] 

Jason Rutherglen commented on SOLR-2290:


I think it'll require creating a new sub-element of mainIndex and indexDefaults 
called perhaps indexWriterConfig?  Because attributes such as unlockOnStartup 
and reopenReaders cannot be injected in, and we probably don't want to mix 
injected properties with non-injected properties?

 the termsInfosDivisor for readers opened by indexWriter should be 
 configurable in Solr
 --

 Key: SOLR-2290
 URL: https://issues.apache.org/jira/browse/SOLR-2290
 Project: Solr
  Issue Type: New Feature
Reporter: Tom Burton-West
Priority: Minor

 Solr allows users to set the termInfosIndexDivisor used by the  indexReader 
 during search time  in solrconfig.xml, but not in the  indexReader opened by 
 the IndexWriter when indexing/merging.
 When dealing with an index with a large number of unique terms, setting the 
 termInfosIndexDivisor at search time is helpful in  reducing memory use.  It 
 would also be helpful in reducing memory use during indexing/merging if it 
 was made configurable for indexReaders opened by indexWriter during 
 indexing/merging.
 This thread contains some background:
 http://www.lucidimagination.com/search/document/b5c756a366e1a0d6/memory_use_during_merges_oom
 In the Lucene 3.x branch it looks like this is done in 
 IndexWriterConfig.setReaderTermsIndexDivisor, although there is also this 
 method signature in IndexWriter.java: IndexReader getReader(int 
 termInfosIndexDivisor)
   

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2829) improve termquery pk lookup performance

2010-12-22 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12974229#action_12974229
 ] 

Robert Muir commented on LUCENE-2829:
-

On further thought Yonik, your idea is really completely unrelated.

We shouldn't be seeking to terms/relying upon the terms dictionary cache 
internally when we don't need to...

whether or not its populated with negative entries for the more general case is 
unrelated,
even if we go that route we shouldn't be lazy and rely upon that.


 improve termquery pk lookup performance
 -

 Key: LUCENE-2829
 URL: https://issues.apache.org/jira/browse/LUCENE-2829
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Robert Muir
 Attachments: LUCENE-2829.patch


 For things that are like primary keys and don't exist in some segments (worst 
 case is primary/unique key that only exists in 1)
 we do wasted seeks.
 While LUCENE-2694 tries to solve some of this issue with TermState, I'm 
 concerned we could every backport that to 3.1 for example.
 This is a simpler solution here just to solve this one problem in 
 termquery... we could just revert it in trunk when we resolve LUCENE-2694,
 but I don't think we should leave things as they are in 3.x

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (SOLR-2275) Spaces around mm parameter in dismax configuration cause NumberFormatException

2010-12-22 Thread Erick Erickson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erick Erickson updated SOLR-2275:
-

Attachment: SOLR-2275-3_x.patch

Hoss: 

Thanks for committing the trunk, here's the patch for the current (22-Dec) 3_x 
branch. It's ready to apply as far as I can tell. All tests pass.

 Spaces around mm parameter in dismax configuration cause NumberFormatException
 --

 Key: SOLR-2275
 URL: https://issues.apache.org/jira/browse/SOLR-2275
 Project: Solr
  Issue Type: Bug
  Components: Schema and Analysis
Affects Versions: Next
Reporter: Erick Erickson
Assignee: Erick Erickson
Priority: Minor
 Fix For: 4.0

 Attachments: SOLR-2275-3_x.patch, SOLR-2275.patch

   Original Estimate: 2h
  Remaining Estimate: 2h

 Any whitespace around simple mm parameters in the configuration file produces 
 a NumberFormatException at SolrPluginUtils.java:625. E.g. str 2 str. 
 Adding whitespace in tests also causes this error to occur.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer

2010-12-22 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12974247#action_12974247
 ] 

Jason Rutherglen commented on LUCENE-2312:
--

bq. Obtaining a normal fieldcache entry should work the same on an RT reader as 
any other reader

Yes.  I'm still confused as to how DocValues fits into all of this.  

bq. TOVC should continue to work as it does today

It should, otherwise there'll be performance considerations.  The main proposal 
here is incrementally updating FC values and how to continue to use 
DocTermsIndex for non-RT readers mixed with DocTerms for RT readers, either in 
TOVC or somewhere else.

 Search on IndexWriter's RAM Buffer
 --

 Key: LUCENE-2312
 URL: https://issues.apache.org/jira/browse/LUCENE-2312
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Search
Affects Versions: Realtime Branch
Reporter: Jason Rutherglen
Assignee: Michael Busch
 Fix For: Realtime Branch

 Attachments: LUCENE-2312-FC.patch, LUCENE-2312.patch


 In order to offer user's near realtime search, without incurring
 an indexing performance penalty, we can implement search on
 IndexWriter's RAM buffer. This is the buffer that is filled in
 RAM as documents are indexed. Currently the RAM buffer is
 flushed to the underlying directory (usually disk) before being
 made searchable. 
 Todays Lucene based NRT systems must incur the cost of merging
 segments, which can slow indexing. 
 Michael Busch has good suggestions regarding how to handle deletes using max 
 doc ids.  
 https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923
 The area that isn't fully fleshed out is the terms dictionary,
 which needs to be sorted prior to queries executing. Currently
 IW implements a specialized hash table. Michael B has a
 suggestion here: 
 https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [jira] Commented: (LUCENENET-385) Searching string with Special charactor not working

2010-12-22 Thread Peter Mateja
It looks like you're going to have to build from source:

https://svn.apache.org/repos/asf/lucene/lucene.net/tags/Lucene.Net_2_9_2/

Peter Mateja
peter.mat...@gmail.com



On Wed, Dec 22, 2010 at 4:22 AM, Abhilash C R (JIRA) j...@apache.orgwrote:


[
 https://issues.apache.org/jira/browse/LUCENENET-385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12974136#action_12974136]

 Abhilash C R commented on LUCENENET-385:
 

 Hi,
 How can download Lucene.Net 2.9.2?
 I couldnt do it from the website.
 Please guide me.
 thanks
 Abhilash




  Searching string with Special charactor not working
  ---
 
  Key: LUCENENET-385
  URL: https://issues.apache.org/jira/browse/LUCENENET-385
  Project: Lucene.Net
   Issue Type: Task
  Environment: .NET Framework 2.0+, C#.NET, ASP.NET, Webservices
 Reporter: Abhilash C R
 
  I have came acroos an issue with search option in our application which
 uses Lucene.Net 2.0 version.
  The scenario is if I try search a text TestTest (it is actually
 TestTest.doc, which is trying to search), it returns 0 hits. While
 debugging I could see that the line which wrote to Parse the query is giving
 the problem,
  Here is the error line code:
  Query
  q=null;
  q =
  new global::Lucene.Net.QueryParsers.QueryParser(content, new
 StandardAnalyzer()).Parse(query);
   The variable query at above point contains as this:
  (title:(TestTest) shorttitle:(TestTest) content:(TestTest)
 keywords:(TestTest) description:(TestTest) )
  and q will get as this:
  title:test test shorttitle:test test content:test test
 keywords:test test description:test test
 
  And hence the hit length will be 0 at
  IndexSearcher searcher = new IndexSearcher(indexPath);
  Hits hits = searcher.Search(q);
  I tried adding\ before , tried escape, tried enclosing the text in a
  but all result the same outcome.
  Could anyone please hlep me with any fix to it?
  If require I can post the full code here.
  Hope to hear from Lucene.Net.
  Many thanks
  Abhilash

 --
 This message is automatically generated by JIRA.
 -
 You can reply to this email to add a comment to the issue online.




[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer

2010-12-22 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12974260#action_12974260
 ] 

Simon Willnauer commented on LUCENE-2312:
-

bq. Yes. I'm still confused as to how DocValues fits into all of this.
DocValues == column stride fields 

does that help ?

simon

 Search on IndexWriter's RAM Buffer
 --

 Key: LUCENE-2312
 URL: https://issues.apache.org/jira/browse/LUCENE-2312
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Search
Affects Versions: Realtime Branch
Reporter: Jason Rutherglen
Assignee: Michael Busch
 Fix For: Realtime Branch

 Attachments: LUCENE-2312-FC.patch, LUCENE-2312.patch


 In order to offer user's near realtime search, without incurring
 an indexing performance penalty, we can implement search on
 IndexWriter's RAM buffer. This is the buffer that is filled in
 RAM as documents are indexed. Currently the RAM buffer is
 flushed to the underlying directory (usually disk) before being
 made searchable. 
 Todays Lucene based NRT systems must incur the cost of merging
 segments, which can slow indexing. 
 Michael Busch has good suggestions regarding how to handle deletes using max 
 doc ids.  
 https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923
 The area that isn't fully fleshed out is the terms dictionary,
 which needs to be sorted prior to queries executing. Currently
 IW implements a specialized hash table. Michael B has a
 suggestion here: 
 https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: strange problem of PForDelta decoder

2010-12-22 Thread Michael McCandless
Those are nice speedups!

Did you use the 4.0 branch (ie trunk) or the bulkpostings branch for this test?

Mike

On Tue, Dec 21, 2010 at 9:59 PM, Li Li fancye...@gmail.com wrote:
 great improvement!
 I did a test in our data set. doc count is about 2M+ and index size
 after optimization is about 13.3GB(including fdt)
 it seems lucene4's index format is better than lucene2.9.3. and PFor
 give good results.
 Besides BlockEncoder for frq and pos. is there any other modification
 for lucene 4?

       decoder    \ avg time     single word(ms)          and
 query(ms)     or query(ms)
  VINT in lucene 2.9                   11.2
 36.5                 38.6
  VINT in lucene 4 branch           10.6
 26.5                 35.4
  PFor in lucene 4 branch             8.1
 22.5                 30.7
 2010/12/21 Li Li fancye...@gmail.com:
 OK we should have a look at that one still.  We need to converge on a
 good default codec for 4.0.  Fortunately it's trivial to take any int
 block encoder (fixed or variable block) and make a Lucene codec out of
 it!

 I suggests you not to use this one, I fixed dozens of bugs but it
 still failed when with random tests. it's codes is hand coded rather
 than generated by program. But we may learn something from it.


 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2829) improve termquery pk lookup performance

2010-12-22 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12974262#action_12974262
 ] 

Michael McCandless commented on LUCENE-2829:


bq. The cache has a number of advantages that may never be duplicated in a 
different type of API

+1 -- I agree we should keep the TermState cache.  It has benefits outside of 
re-use win a single query.

But allowing term-lookup-intensive clients like MTQ  to do their own caching 
(ie pulling the TermState from the enum) is also important.  I think we need 
both.

On caching misses... that makes me nervous.  If there are apps out there that 
do alot of checking for terms that don't exist that can destroy the cache.

The cache is a great safety net but I think our core queries should be good 
consumers, when possible, and hold their own TermState.

 improve termquery pk lookup performance
 -

 Key: LUCENE-2829
 URL: https://issues.apache.org/jira/browse/LUCENE-2829
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Robert Muir
 Attachments: LUCENE-2829.patch


 For things that are like primary keys and don't exist in some segments (worst 
 case is primary/unique key that only exists in 1)
 we do wasted seeks.
 While LUCENE-2694 tries to solve some of this issue with TermState, I'm 
 concerned we could every backport that to 3.1 for example.
 This is a simpler solution here just to solve this one problem in 
 termquery... we could just revert it in trunk when we resolve LUCENE-2694,
 but I don't think we should leave things as they are in 3.x

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer

2010-12-22 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12974264#action_12974264
 ] 

Jason Rutherglen commented on LUCENE-2312:
--

bq. DocValues == column stride fields 

Ok, that makes sense! 

I'm going to leave this alone for now, however I agree that ideally we'd leave
TOVC alone and at a higher level intermix the ord and non-ord doc terms. It's
hard to immediately determine how that'd work given the slot concept, which
seems to be an ord or value per reader that's directly comparable? Is there an
example of mixing multiple comparators for a given field?

 Search on IndexWriter's RAM Buffer
 --

 Key: LUCENE-2312
 URL: https://issues.apache.org/jira/browse/LUCENE-2312
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Search
Affects Versions: Realtime Branch
Reporter: Jason Rutherglen
Assignee: Michael Busch
 Fix For: Realtime Branch

 Attachments: LUCENE-2312-FC.patch, LUCENE-2312.patch


 In order to offer user's near realtime search, without incurring
 an indexing performance penalty, we can implement search on
 IndexWriter's RAM buffer. This is the buffer that is filled in
 RAM as documents are indexed. Currently the RAM buffer is
 flushed to the underlying directory (usually disk) before being
 made searchable. 
 Todays Lucene based NRT systems must incur the cost of merging
 segments, which can slow indexing. 
 Michael Busch has good suggestions regarding how to handle deletes using max 
 doc ids.  
 https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923
 The area that isn't fully fleshed out is the terms dictionary,
 which needs to be sorted prior to queries executing. Currently
 IW implements a specialized hash table. Michael B has a
 suggestion here: 
 https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2694) MTQ rewrite + weight/scorer init should be single pass

2010-12-22 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12974265#action_12974265
 ] 

Michael McCandless commented on LUCENE-2694:


bq. after all I think this must be done in a different issue though

+1

If, when we now pass a naked IndexReader (eg to Weight.scorer, Weight.explain, 
Filter.getDocIdSet) we replace that with a ReaderContext which has reader, its 
parent, and its ord, then this precursor makes both TermState (this issue) and 
the awesome PK speedup (LUCENE-2829) much simpler.  And I agree we should break 
it out as its own issue.  It's good to do that as its own issue since that's a 
rote API cutover -- we are passing a struct instead of a naked reader, but 
otherwise no change.

This also lets us solve cases where the Filter needs the full context, eg 
LUCENE-2348.

Also, with this I think we should sharpen in the jdocs that when you call 
Query.rewrite the returned query must be searched only against he same reader 
you rewrote against.  Similarly when you create a Weight, it should only be 
used against the same Searcher used to create it from a Query.

 MTQ rewrite + weight/scorer init should be single pass
 --

 Key: LUCENE-2694
 URL: https://issues.apache.org/jira/browse/LUCENE-2694
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Michael McCandless
Assignee: Simon Willnauer
 Fix For: 4.0

 Attachments: LUCENE-2694-FTE.patch, LUCENE-2694.patch, 
 LUCENE-2694.patch, LUCENE-2694.patch, LUCENE-2694.patch, LUCENE-2694.patch, 
 LUCENE-2694.patch


 Spinoff of LUCENE-2690 (see the hacked patch on that issue)...
 Once we fix MTQ rewrite to be per-segment, we should take it further and make 
 weight/scorer init also run in the same single pass as rewrite.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2723) Speed up Lucene's low level bulk postings read API

2010-12-22 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12974272#action_12974272
 ] 

Michael McCandless commented on LUCENE-2723:


bq. Should we keep MultiBulkPostingsEnum?

I think we have to keep it.  EG if someone makes a SlowMultiReaderWrapper and 
then run searches on it...

bq. ParallelReader is one that can't currently?

ParallelReader is a tricky one.

If your ParallelReader only contains SegmentReaders (and eg you make a 
MultiReader on top), then everything's great, because ParallelReader dispatches 
by field to a unique SegmentReader.

But if instead you make a ParallelReader whose child readers are themselves 
MultiReaders, then, yes it's basically the same as wrapping all of these subs 
in a SlowMultiReaderWrapper.

 Speed up Lucene's low level bulk postings read API
 --

 Key: LUCENE-2723
 URL: https://issues.apache.org/jira/browse/LUCENE-2723
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 4.0

 Attachments: LUCENE-2723-termscorer.patch, 
 LUCENE-2723-termscorer.patch, LUCENE-2723-termscorer.patch, 
 LUCENE-2723.patch, LUCENE-2723.patch, LUCENE-2723.patch, LUCENE-2723.patch, 
 LUCENE-2723.patch, LUCENE-2723_bulkvint.patch, LUCENE-2723_facetPerSeg.patch, 
 LUCENE-2723_facetPerSeg.patch, LUCENE-2723_openEnum.patch, 
 LUCENE-2723_termscorer.patch, LUCENE-2723_wastedint.patch


 Spinoff from LUCENE-1410.
 The flex DocsEnum has a simple bulk-read API that reads the next chunk
 of docs/freqs.  But it's a poor fit for intblock codecs like FOR/PFOR
 (from LUCENE-1410).  This is not unlike sucking coffee through those
 tiny plastic coffee stirrers they hand out airplanes that,
 surprisingly, also happen to function as a straw.
 As a result we see no perf gain from using FOR/PFOR.
 I had hacked up a fix for this, described at in my blog post at
 http://chbits.blogspot.com/2010/08/lucene-performance-with-pfordelta-codec.html
 I'm opening this issue to get that work to a committable point.
 So... I've worked out a new bulk-read API to address performance
 bottleneck.  It has some big changes over the current bulk-read API:
   * You can now also bulk-read positions (but not payloads), but, I
  have yet to cutover positional queries.
   * The buffer contains doc deltas, not absolute values, for docIDs
 and positions (freqs are absolute).
   * Deleted docs are not filtered out.
   * The doc  freq buffers need not be aligned.  For fixed intblock
 codecs (FOR/PFOR) they will be, but for varint codecs (Simple9/16,
 Group varint, etc.) they won't be.
 It's still a work in progress...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2829) improve termquery pk lookup performance

2010-12-22 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12974274#action_12974274
 ] 

Earwin Burrfoot commented on LUCENE-2829:
-

Term lookup misses can be alleviated by a simple Bloom Filter.
No caching misses required, helps both PK and near-PK queries.

 improve termquery pk lookup performance
 -

 Key: LUCENE-2829
 URL: https://issues.apache.org/jira/browse/LUCENE-2829
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Robert Muir
 Attachments: LUCENE-2829.patch


 For things that are like primary keys and don't exist in some segments (worst 
 case is primary/unique key that only exists in 1)
 we do wasted seeks.
 While LUCENE-2694 tries to solve some of this issue with TermState, I'm 
 concerned we could every backport that to 3.1 for example.
 This is a simpler solution here just to solve this one problem in 
 termquery... we could just revert it in trunk when we resolve LUCENE-2694,
 but I don't think we should leave things as they are in 3.x

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2391) Spellchecker uses default IW mergefactor/ramMB settings of 300/10

2010-12-22 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-2391:


Attachment: LUCENE-2391.patch

Here's a patch to speed up the spellchecker build.

* i wired the default RamMB to IWConfig's default
* i didnt mess with the mergefactor for now (because the default is still to 
optimize)
* but i added an additional 'optimize' parameter so you can update your 
spellcheck index without re-optimizing.
* when updating, i changed the exists() to work per-segment, so its reasonable 
if the index isn't optimized.
* the exists() check now bypasses the term dictionary cache, which is stupid 
and just slows it down.
* we don't do any of the exists() logic if the index is empty (this is the case 
for i think solr which completely rebuilds
  and doesnt do an incremental update)
* the startXXX, endXXX, and word fields can only contain one term per document. 
I turned off norms, positions,
  and tf for these.
* the gramXXX field is unchanged, i didnt want to change spellchecker scoring 
in any way. But we could
  reasonably in the future likely omit norms here too since i think its gonna 
be very short.

{noformat}
trunk:
scratch build time: 229,803ms
index size: 214,322,200 bytes
no-op update time (updating but there is no new terms to add): 4,619ms

patch:
scratch build time: 99,214ms
index size: 177,781,273 bytes
no-op update time: 2,504ms
{noformat}

i still left the optimize default on, but really i think for most users (e.g. 
solr) they should set 
mergefactor to be maybe a bit more reasonable, set optimize to false, and the 
scratch build 
is then must faster (60,000 ms), but the no-op update time is heavier (eg 
16,000ms). Still, 
if you are rebuilding on every commit for smallish updates something like 20-30 
seconds 
is a lot better than 100seconds, but for now I kept the defaults as is 
(optimizing every time).


 Spellchecker uses default IW mergefactor/ramMB settings of 300/10
 -

 Key: LUCENE-2391
 URL: https://issues.apache.org/jira/browse/LUCENE-2391
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/spellchecker
Reporter: Mark Miller
Priority: Trivial
 Attachments: LUCENE-2391.patch


 These settings seem odd - I'd like to investigate what makes most sense here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1410) PFOR implementation

2010-12-22 Thread Paul Elschot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12974303#action_12974303
 ] 

Paul Elschot commented on LUCENE-1410:
--

bq. ... is it possible to encode # of exception bytes in header?

In the first implementation the start index of the exception chain is in the 
header (5 or 6 bits iirc). In the second implementation (by Hoa Yan) there is 
no exception chain, so the number of exceptions must somehow be encoded in the 
header.
That means encoding the # exception bytes in the header would be easier in the 
second implementation, but it is also possible in the first one.

I would expect that a few bits for the number of encoded integers would also be 
added in the header (think 32, 64, 128...).
The number of frame bits takes 5 bits.
That means that there are about 2 bytes unused in the header now, and I'd 
expect 1 byte to be enough to encode the number of bytes for the exceptions. 
For example a bad case in the first implementation of 10 exceptions of 4 bytes 
means 40 bytes data, that fits in 6 bits, the same
bad case in the second implementation would also need to store the indexes of 
the exceptions in 10*5 bits, totalling 90 bytes that can be
encoded in 7 bits. However, I don't know what the worst case # exceptions is. 
(This gets into vsencoding...)

For the moment I'll just leave this unchanged and get the tests working on the 
current first implementation.

 PFOR implementation
 ---

 Key: LUCENE-1410
 URL: https://issues.apache.org/jira/browse/LUCENE-1410
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Paul Elschot
Priority: Minor
 Fix For: Bulk Postings branch

 Attachments: autogen.tgz, for-summary.txt, 
 LUCENE-1410-codecs.tar.bz2, LUCENE-1410.patch, LUCENE-1410.patch, 
 LUCENE-1410.patch, LUCENE-1410.patch, LUCENE-1410b.patch, LUCENE-1410c.patch, 
 LUCENE-1410d.patch, LUCENE-1410e.patch, TermQueryTests.tgz, TestPFor2.java, 
 TestPFor2.java, TestPFor2.java

   Original Estimate: 21840h
  Remaining Estimate: 21840h

 Implementation of Patched Frame of Reference.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1410) PFOR implementation

2010-12-22 Thread Paul Elschot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12974307#action_12974307
 ] 

Paul Elschot commented on LUCENE-1410:
--

bq. I've tested everything I can think of and it seems this nio 
ByteBuffer/IntBuffer approach is always the fastest ...

Did you also test without a copy (without the readbytes() call) into the 
underlying byte array for the IntBuffer? That might be even faster,
and it could be possible when using for example a BufferedIndexInput or an 
MMapDirectory.
For decent buffer.get() speed the starting byte would need to be aligned at an 
int border.



 PFOR implementation
 ---

 Key: LUCENE-1410
 URL: https://issues.apache.org/jira/browse/LUCENE-1410
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Paul Elschot
Priority: Minor
 Fix For: Bulk Postings branch

 Attachments: autogen.tgz, for-summary.txt, 
 LUCENE-1410-codecs.tar.bz2, LUCENE-1410.patch, LUCENE-1410.patch, 
 LUCENE-1410.patch, LUCENE-1410.patch, LUCENE-1410b.patch, LUCENE-1410c.patch, 
 LUCENE-1410d.patch, LUCENE-1410e.patch, TermQueryTests.tgz, TestPFor2.java, 
 TestPFor2.java, TestPFor2.java

   Original Estimate: 21840h
  Remaining Estimate: 21840h

 Implementation of Patched Frame of Reference.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2829) improve termquery pk lookup performance

2010-12-22 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12974310#action_12974310
 ] 

Robert Muir commented on LUCENE-2829:
-

Bloom filters and negative caches are nice, but please open separate issues!
I am starting to feel like its mandatory to refactor the entirety of lucene to 
make a single incremental improvement.

So, I'd like to proceed with this issue as-is, to make TermWeight explicitly do 
less seeks.


 improve termquery pk lookup performance
 -

 Key: LUCENE-2829
 URL: https://issues.apache.org/jira/browse/LUCENE-2829
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Robert Muir
 Attachments: LUCENE-2829.patch


 For things that are like primary keys and don't exist in some segments (worst 
 case is primary/unique key that only exists in 1)
 we do wasted seeks.
 While LUCENE-2694 tries to solve some of this issue with TermState, I'm 
 concerned we could every backport that to 3.1 for example.
 This is a simpler solution here just to solve this one problem in 
 termquery... we could just revert it in trunk when we resolve LUCENE-2694,
 but I don't think we should leave things as they are in 3.x

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Reopened: (SOLR-2282) Distributed Support for Search Result Clustering

2010-12-22 Thread Hoss Man (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man reopened SOLR-2282:



Reopening issue.

The new test added by this issue...

org.apache.solr.handler.clustering.DistributedClusteringComponentTest.testDistribSearch

...was failing consistently on both hudson, and robert muir's machine, so rmuir 
disabled it with @Ignore.

we should get to the bottom of this before resolving

error from hudson...

{quote}
Error Message

Some threads threw uncaught exceptions!

Stacktrace

junit.framework.AssertionFailedError: Some threads threw uncaught exceptions!
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:950)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:888)
at 
org.apache.lucene.util.LuceneTestCase.tearDown(LuceneTestCase.java:371)
at org.apache.solr.SolrTestCaseJ4.tearDown(SolrTestCaseJ4.java:78)
at 
org.apache.solr.BaseDistributedSearchTestCase.tearDown(BaseDistributedSearchTestCase.java:130)

Standard Error

22-Dec-2010 6:27:38 AM org.apache.solr.common.SolrException log
SEVERE: java.lang.Error: Error: could not match input
at 
org.carrot2.text.analysis.ExtendedWhitespaceTokenizerImpl.zzScanError(ExtendedWhitespaceTokenizerImpl.java:687)
at 
org.carrot2.text.analysis.ExtendedWhitespaceTokenizerImpl.getNextToken(ExtendedWhitespaceTokenizerImpl.java:836)
at 
org.carrot2.text.analysis.ExtendedWhitespaceTokenizer.nextToken(ExtendedWhitespaceTokenizer.java:46)
at org.carrot2.text.preprocessing.Tokenizer.tokenize(Tokenizer.java:147)
at 
org.carrot2.text.preprocessing.pipeline.CompletePreprocessingPipeline.preprocess(CompletePreprocessingPipeline.java:54)
at 
org.carrot2.text.preprocessing.pipeline.BasicPreprocessingPipeline.preprocess(BasicPreprocessingPipeline.java:92)
at 
org.carrot2.clustering.lingo.LingoClusteringAlgorithm.cluster(LingoClusteringAlgorithm.java:199)
at 
org.carrot2.clustering.lingo.LingoClusteringAlgorithm.access$000(LingoClusteringAlgorithm.java:44)
at 
org.carrot2.clustering.lingo.LingoClusteringAlgorithm$1.process(LingoClusteringAlgorithm.java:178)
at 
org.carrot2.text.clustering.MultilingualClustering.clusterByLanguage(MultilingualClustering.java:222)
at 
org.carrot2.text.clustering.MultilingualClustering.process(MultilingualClustering.java:110)
at 
org.carrot2.clustering.lingo.LingoClusteringAlgorithm.process(LingoClusteringAlgorithm.java:171)
at 
org.carrot2.core.ControllerUtils.performProcessing(ControllerUtils.java:101)
at org.carrot2.core.Controller.process(Controller.java:287)
at org.carrot2.core.Controller.process(Controller.java:180)
at 
org.apache.solr.handler.clustering.carrot2.CarrotClusteringEngine.cluster(CarrotClusteringEngine.java:105)
at 
org.apache.solr.handler.clustering.ClusteringComponent.finishStage(ClusteringComponent.java:171)
at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:296)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1358)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:341)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:244)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
at 
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
at 
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
at 
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:326)
at 
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
at 
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
at 
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
at 
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)

NOTE: reproduce with: ant test -Dtestcase=DistributedClusteringComponentTest 
-Dtestmethod=testDistribSearch 
-Dtests.seed=41204997274180:6405396687385598457 -Dtests.multiplier=3
The following exceptions were thrown by threads:
*** Thread: Thread-13 ***
junit.framework.AssertionFailedError: .clusters.length:4!=5
at 

[jira] Issue Comment Edited: (LUCENE-1410) PFOR implementation

2010-12-22 Thread Paul Elschot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12974303#action_12974303
 ] 

Paul Elschot edited comment on LUCENE-1410 at 12/22/10 1:13 PM:


bq. ... is it possible to encode # of exception bytes in header?

In the first implementation the start index of the exception chain is in the 
header (5 or 6 bits iirc). In the second implementation (by Hoa Yan) there is 
no exception chain, so the number of exceptions must somehow be encoded in the 
header.
That means encoding the # exception bytes in the header would be easier in the 
second implementation, but it is also possible in the first one.

I would expect that a few bits for the number of encoded integers would also be 
added in the header (think 32, 64, 128...).
The number of frame bits takes 5 bits.
That means that there are about 2 bytes unused in the header now, and I'd 
expect 1 byte to be enough to encode the number of bytes for the exceptions. 
For example a bad case in the first implementation of 10 exceptions of 4 bytes 
means 40 bytes data, that fits in 6 bits, the same
bad case in the second implementation would also need to store the indexes of 
the exceptions in 10*5 bits,  for a total of  about 48 bytes that can be still 
be encoded in 6 bits. However, I don't know what the worst case # exceptions 
is. (This gets into vsencoding...)

For the moment I'll just leave this unchanged and get the tests working on the 
current first implementation.

  was (Author: paul.elsc...@xs4all.nl):
bq. ... is it possible to encode # of exception bytes in header?

In the first implementation the start index of the exception chain is in the 
header (5 or 6 bits iirc). In the second implementation (by Hoa Yan) there is 
no exception chain, so the number of exceptions must somehow be encoded in the 
header.
That means encoding the # exception bytes in the header would be easier in the 
second implementation, but it is also possible in the first one.

I would expect that a few bits for the number of encoded integers would also be 
added in the header (think 32, 64, 128...).
The number of frame bits takes 5 bits.
That means that there are about 2 bytes unused in the header now, and I'd 
expect 1 byte to be enough to encode the number of bytes for the exceptions. 
For example a bad case in the first implementation of 10 exceptions of 4 bytes 
means 40 bytes data, that fits in 6 bits, the same
bad case in the second implementation would also need to store the indexes of 
the exceptions in 10*5 bits, totalling 90 bytes that can be
encoded in 7 bits. However, I don't know what the worst case # exceptions is. 
(This gets into vsencoding...)

For the moment I'll just leave this unchanged and get the tests working on the 
current first implementation.
  
 PFOR implementation
 ---

 Key: LUCENE-1410
 URL: https://issues.apache.org/jira/browse/LUCENE-1410
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Paul Elschot
Priority: Minor
 Fix For: Bulk Postings branch

 Attachments: autogen.tgz, for-summary.txt, 
 LUCENE-1410-codecs.tar.bz2, LUCENE-1410.patch, LUCENE-1410.patch, 
 LUCENE-1410.patch, LUCENE-1410.patch, LUCENE-1410b.patch, LUCENE-1410c.patch, 
 LUCENE-1410d.patch, LUCENE-1410e.patch, TermQueryTests.tgz, TestPFor2.java, 
 TestPFor2.java, TestPFor2.java

   Original Estimate: 21840h
  Remaining Estimate: 21840h

 Implementation of Patched Frame of Reference.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-2282) Distributed Support for Search Result Clustering

2010-12-22 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12974314#action_12974314
 ] 

Stanislaw Osinski commented on SOLR-2282:
-

This may be related to a concurrency bug we fixed in the latest (3.4.2) release 
of Carrot2. Tomorrow morning I can prepare a Carrot2 upgrade patch, which 
should hopefully fix the problem.

 Distributed Support for Search Result Clustering
 

 Key: SOLR-2282
 URL: https://issues.apache.org/jira/browse/SOLR-2282
 Project: Solr
  Issue Type: New Feature
  Components: contrib - Clustering
Affects Versions: 1.4, 1.4.1
Reporter: Koji Sekiguchi
Assignee: Koji Sekiguchi
Priority: Minor
 Fix For: 3.1, 4.0

 Attachments: SOLR-2282.patch, SOLR-2282.patch, SOLR-2282.patch, 
 SOLR-2282.patch, SOLR-2282.patch


 Brad Giaccio contributed a patch for this in SOLR-769. I'd like to 
 incorporate it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1410) PFOR implementation

2010-12-22 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12974315#action_12974315
 ] 

Robert Muir commented on LUCENE-1410:
-

{quote}
Did you also test without a copy (without the readbytes() call) into the 
underlying byte array for the IntBuffer? That might be even faster,
and it could be possible when using for example a BufferedIndexInput or an 
MMapDirectory.
For decent buffer.get() speed the starting byte would need to be aligned at an 
int border.
{quote}

Yes, for the mmap case I tried the original dangerous hack, exposing in 
Intbuffer view of its internal mapped byte buffer.
I also tried mmapindexinput keeping track of its own intbuffer view.

we might be able to have some gains by allowing a directory to return an 
IntBufferIndexInput of some sort (separate from DataInput/IndexInput)
that basically just positions an IntBuffer view (the default implementation 
would fill from an indexinput into a bytebuffer like we do now),
but I haven't tested this across all the directories yet... it might help NIOFS 
though as it would bypass the double-buffering of BufferedIndexInput.
For SimpleFS it would be the same, and for MMap i'm not very hopeful it would 
be better, but maybe not worse.

if that worked maybe we could do the same with Long, for things like simple-8b 
(http://onlinelibrary.wiley.com/doi/10.1002/spe.948/abstract)


 PFOR implementation
 ---

 Key: LUCENE-1410
 URL: https://issues.apache.org/jira/browse/LUCENE-1410
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Paul Elschot
Priority: Minor
 Fix For: Bulk Postings branch

 Attachments: autogen.tgz, for-summary.txt, 
 LUCENE-1410-codecs.tar.bz2, LUCENE-1410.patch, LUCENE-1410.patch, 
 LUCENE-1410.patch, LUCENE-1410.patch, LUCENE-1410b.patch, LUCENE-1410c.patch, 
 LUCENE-1410d.patch, LUCENE-1410e.patch, TermQueryTests.tgz, TestPFor2.java, 
 TestPFor2.java, TestPFor2.java

   Original Estimate: 21840h
  Remaining Estimate: 21840h

 Implementation of Patched Frame of Reference.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2830) Use StringBuilder instead of StringBuffer in benchmark

2010-12-22 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-2830:
---

Attachment: LUCENE-2830.patch

Since parse(*, StringBuffer, *) is not used, and whoever wants to use it can 
use the Reader variant and pass new StringReader(), I removed it.

I plan to commit tomorrow.

 Use StringBuilder instead of StringBuffer in benchmark
 --

 Key: LUCENE-2830
 URL: https://issues.apache.org/jira/browse/LUCENE-2830
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Minor
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2830.patch, LUCENE-2830.patch


 Minor change - use StringBuilder instead of StringBuffer in benchmark's code. 
 We don't need the synchronization of StringBuffer in all the places that I've 
 checked.
 The only place where it _could_ be a problem is in HtmlParser's API - one 
 method accepts a StringBuffer and it's an interface. But I think it's ok to 
 change benchmark's API, back-compat wise and so I'd like to either change it 
 to accept a String, or remove the method altogether -- no code in benchmark 
 uses it, and if anyone needs it, he can pass StringReader to the other method.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



LuceneTestCase.threadCleanup incorrectly reports left running threads

2010-12-22 Thread Shai Erera
Hi

I noticed that some tests report threads are left running, even when those
tests never create and start a Thread. Digging deeper I found out that the
tests report Signal Dispatcher and Attach handler as two threads that
are left running. If I run the test from eclipse, then a ReaderThread and
Signal Dispatcher are reported. ReaderThread belongs to JUnit framework
and the other two are initiated by some framework, and definitely not from
our tests.

So I was thinking if instead of reporting those threads, we should inspect
each running Thread's stacktrace and report it only if it contains an
org.apache.lucene/solr package. Otherwise it cannot be started from our
tests?

What do you think?

Shai


Re: LuceneTestCase.threadCleanup incorrectly reports left running threads

2010-12-22 Thread Robert Muir
On Wed, Dec 22, 2010 at 2:14 PM, Shai Erera ser...@gmail.com wrote:
 Hi

 I noticed that some tests report threads are left running, even when those
 tests never create and start a Thread. Digging deeper I found out that the
 tests report Signal Dispatcher and Attach handler as two threads that
 are left running. If I run the test from eclipse, then a ReaderThread and
 Signal Dispatcher are reported. ReaderThread belongs to JUnit framework
 and the other two are initiated by some framework, and definitely not from
 our tests.

 So I was thinking if instead of reporting those threads, we should inspect
 each running Thread's stacktrace and report it only if it contains an
 org.apache.lucene/solr package. Otherwise it cannot be started from our
 tests?

 What do you think?

are you running the tests from eclipse or something in this case (i
think i've seen these from eclipse)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: LuceneTestCase.threadCleanup incorrectly reports left running threads

2010-12-22 Thread Robert Muir
here is a (imperfect) patch for eclipse, can you try this? any threads
running at this point are not our own.

Index: lucene/src/test/org/apache/lucene/util/LuceneTestCase.java
===
--- lucene/src/test/org/apache/lucene/util/LuceneTestCase.java
(revision 1051872)
+++ lucene/src/test/org/apache/lucene/util/LuceneTestCase.java  (working copy)
@@ -522,6 +522,13 @@
   // jvm-wide list of 'rogue threads' we found, so they only get reported once.
   private final static IdentityHashMapThread,Boolean rogueThreads =
new IdentityHashMapThread,Boolean();

+  static {
+// just a hack for things like eclipse threads
+for (Thread t : Thread.getAllStackTraces().keySet()) {
+  rogueThreads.put(t, true);
+}
+  }
+
   /**
* Looks for leftover running threads, trying to kill them off,
* so they don't fail future tests.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2829) improve termquery pk lookup performance

2010-12-22 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12974350#action_12974350
 ] 

Earwin Burrfoot commented on LUCENE-2829:
-

Nobody halts your progress, we're merely discussing.

I, on the other hand, have a feeling that Lucene is overflowing with single 
incremental improvements aka hacks, as they are easier and faster to 
implement than trying to get a bigger picture, and, yes, rebuilding everything 
:)
For example, better term dict code will make this issue (somewhat hackish, 
admit it?) irrelevant. Whether we implement bloom filters, or just guarantee to 
keep the whole term dict in memory with reasonable lookup routine (eg. as FST).

Having said that, I reiterate, I'm not here to stop you or turn this issue into 
something else.

 improve termquery pk lookup performance
 -

 Key: LUCENE-2829
 URL: https://issues.apache.org/jira/browse/LUCENE-2829
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Robert Muir
 Attachments: LUCENE-2829.patch


 For things that are like primary keys and don't exist in some segments (worst 
 case is primary/unique key that only exists in 1)
 we do wasted seeks.
 While LUCENE-2694 tries to solve some of this issue with TermState, I'm 
 concerned we could every backport that to 3.1 for example.
 This is a simpler solution here just to solve this one problem in 
 termquery... we could just revert it in trunk when we resolve LUCENE-2694,
 but I don't think we should leave things as they are in 3.x

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2829) improve termquery pk lookup performance

2010-12-22 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12974354#action_12974354
 ] 

Robert Muir commented on LUCENE-2829:
-

bq. For example, better term dict code will make this issue (somewhat hackish, 
admit it?) irrelevant. 

Right, it is hackish, but what is a worse hack is wasted seeks in our next 3.1 
release because we can't
keep scope under control and fix small problems without rewriting everything, 
which means less 
gets backported to our stable branch.

Anyway, I'm just gonna mark this won't fix so I don't have to deal with it 
anymore.

 improve termquery pk lookup performance
 -

 Key: LUCENE-2829
 URL: https://issues.apache.org/jira/browse/LUCENE-2829
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Robert Muir
 Attachments: LUCENE-2829.patch


 For things that are like primary keys and don't exist in some segments (worst 
 case is primary/unique key that only exists in 1)
 we do wasted seeks.
 While LUCENE-2694 tries to solve some of this issue with TermState, I'm 
 concerned we could every backport that to 3.1 for example.
 This is a simpler solution here just to solve this one problem in 
 termquery... we could just revert it in trunk when we resolve LUCENE-2694,
 but I don't think we should leave things as they are in 3.x

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Created: (SOLR-2294) How to combine OR with geofilt

2010-12-22 Thread Bill Bell (JIRA)
How to combine OR with geofilt
--

 Key: SOLR-2294
 URL: https://issues.apache.org/jira/browse/SOLR-2294
 Project: Solr
  Issue Type: Bug
Affects Versions: 3.1
Reporter: Bill Bell
 Fix For: 3.1


We would like to combine fq={!geofilt} OR state:CO...

This generates an error.

Are there other ways to do an OR between fq= ?



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Lucene-3.x - Build # 219 - Still Failing

2010-12-22 Thread Apache Hudson Server
Build: https://hudson.apache.org/hudson/job/Lucene-3.x/219/

All tests passed

Build Log (for compile errors):
[...truncated 21431 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: RT branch status

2010-12-22 Thread Earwin Burrfoot
Cool! I'm getting to this on a weekend.

On Tue, Dec 21, 2010 at 11:44, Michael Busch busch...@gmail.com wrote:
 After merging trunk into the RT branch it's finally compiling again and
 up-to-date.

 Several tests are failing now after the merge (43 out of 1427 are failing),
 which is not too surprising, because so many things have changed
 (segment-deletes, flush control, termsHash refactoring, removal of doc
 stores, etc).

 Especially IndexWriter and DocumentsWriter are in a somewhat messy state,
 but I wanted to share my current state, so I committed the merge.  I'll try
 this week to understand the new changes (especially deletes) and make them
 work with the DWPT.  The following areas need work:
  * deletes
  * thread-safety
  * error handling and aborting
  * flush-by-ram (LUCENE-2573)

 Also, some tests deadlock.  Not surprisingly either, cause flushcontrol etc.
 introduce new synchronized blocks.

 Before the merge all tests were passing, except the ones testing
 flush-by-ram functionality.  I'll keep working on getting the branch back
 into that state again soon.

 Help is definitely welcome!  I'd love to get this branch ready so that we
 can merge it into trunk as soon as possible.  As Mike's experiments show
 having DWPTs will not only be beneficial for RT search, but also increase
 indexing performance in general.

  Michael

 PS: Thanks for the patience!

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org





-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Phone: +7 (495) 683-567-4
ICQ: 104465785

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: strange problem of PForDelta decoder

2010-12-22 Thread Li Li
I used the bulkpostings
branch(https://svn.apache.org/repos/asf/lucene/dev/branches/bulkpostings/lucene)
does trunk have PForDelta decoder/encoder ?

2010/12/23 Michael McCandless luc...@mikemccandless.com:
 Those are nice speedups!

 Did you use the 4.0 branch (ie trunk) or the bulkpostings branch for this 
 test?

 Mike

 On Tue, Dec 21, 2010 at 9:59 PM, Li Li fancye...@gmail.com wrote:
 great improvement!
 I did a test in our data set. doc count is about 2M+ and index size
 after optimization is about 13.3GB(including fdt)
 it seems lucene4's index format is better than lucene2.9.3. and PFor
 give good results.
 Besides BlockEncoder for frq and pos. is there any other modification
 for lucene 4?

       decoder    \ avg time     single word(ms)          and
 query(ms)     or query(ms)
  VINT in lucene 2.9                   11.2
 36.5                 38.6
  VINT in lucene 4 branch           10.6
 26.5                 35.4
  PFor in lucene 4 branch             8.1
 22.5                 30.7
 2010/12/21 Li Li fancye...@gmail.com:
 OK we should have a look at that one still.  We need to converge on a
 good default codec for 4.0.  Fortunately it's trivial to take any int
 block encoder (fixed or variable block) and make a Lucene codec out of
 it!

 I suggests you not to use this one, I fixed dozens of bugs but it
 still failed when with random tests. it's codes is hand coded rather
 than generated by program. But we may learn something from it.


 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2010-12-22 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12974484#action_12974484
 ] 

Jason Rutherglen commented on LUCENE-2324:
--

Also, it'd be great if we could summarize the changes trunk - DWPT branch.  

 Per thread DocumentsWriters that write their own private segments
 -

 Key: LUCENE-2324
 URL: https://issues.apache.org/jira/browse/LUCENE-2324
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: Realtime Branch

 Attachments: lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch


 See LUCENE-2293 for motivation and more details.
 I'm copying here Mike's summary he posted on 2293:
 Change the approach for how we buffer in RAM to a more isolated
 approach, whereby IW has N fully independent RAM segments
 in-process and when a doc needs to be indexed it's added to one of
 them. Each segment would also write its own doc stores and
 normal segment merging (not the inefficient merge we now do on
 flush) would merge them. This should be a good simplification in
 the chain (eg maybe we can remove the *PerThread classes). The
 segments can flush independently, letting us make much better
 concurrent use of IO  CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2010-12-22 Thread Jason Rutherglen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Rutherglen updated LUCENE-2324:
-

Attachment: test.out

Here's ant test-core output.  Looks like it's deadlocking in TestIndexWriter?  
There are some IR.reopen failures, a null pointer, and a delete count I'll look 
at.

 Per thread DocumentsWriters that write their own private segments
 -

 Key: LUCENE-2324
 URL: https://issues.apache.org/jira/browse/LUCENE-2324
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: Realtime Branch

 Attachments: lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch, 
 test.out


 See LUCENE-2293 for motivation and more details.
 I'm copying here Mike's summary he posted on 2293:
 Change the approach for how we buffer in RAM to a more isolated
 approach, whereby IW has N fully independent RAM segments
 in-process and when a doc needs to be indexed it's added to one of
 them. Each segment would also write its own doc stores and
 normal segment merging (not the inefficient merge we now do on
 flush) would merge them. This should be a good simplification in
 the chain (eg maybe we can remove the *PerThread classes). The
 segments can flush independently, letting us make much better
 concurrent use of IO  CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2010-12-22 Thread Jason Rutherglen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Rutherglen updated LUCENE-2324:
-

Attachment: LUCENE-2324-SMALL.patch

Small patch fixing the num deletes test null pointer. 

The TestIndexReaderReopen failure seems to have something to do with flushing 
deletes.

 Per thread DocumentsWriters that write their own private segments
 -

 Key: LUCENE-2324
 URL: https://issues.apache.org/jira/browse/LUCENE-2324
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: Realtime Branch

 Attachments: LUCENE-2324-SMALL.patch, lucene-2324.patch, 
 lucene-2324.patch, LUCENE-2324.patch, test.out


 See LUCENE-2293 for motivation and more details.
 I'm copying here Mike's summary he posted on 2293:
 Change the approach for how we buffer in RAM to a more isolated
 approach, whereby IW has N fully independent RAM segments
 in-process and when a doc needs to be indexed it's added to one of
 them. Each segment would also write its own doc stores and
 normal segment merging (not the inefficient merge we now do on
 flush) would merge them. This should be a good simplification in
 the chain (eg maybe we can remove the *PerThread classes). The
 segments can flush independently, letting us make much better
 concurrent use of IO  CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-2830) Use StringBuilder instead of StringBuffer in benchmark

2010-12-22 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera resolved LUCENE-2830.


Resolution: Fixed

Committed revision 1052180 (3x).
Committed revision 1052182 (trunk).

 Use StringBuilder instead of StringBuffer in benchmark
 --

 Key: LUCENE-2830
 URL: https://issues.apache.org/jira/browse/LUCENE-2830
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Minor
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2830.patch, LUCENE-2830.patch


 Minor change - use StringBuilder instead of StringBuffer in benchmark's code. 
 We don't need the synchronization of StringBuffer in all the places that I've 
 checked.
 The only place where it _could_ be a problem is in HtmlParser's API - one 
 method accepts a StringBuffer and it's an interface. But I think it's ok to 
 change benchmark's API, back-compat wise and so I'd like to either change it 
 to accept a String, or remove the method altogether -- no code in benchmark 
 uses it, and if anyone needs it, he can pass StringReader to the other method.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-2294) How to combine OR with geofilt

2010-12-22 Thread Bill Bell (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12974551#action_12974551
 ] 

Bill Bell commented on SOLR-2294:
-

I did find a way to do it It only works in this order -

{code}
http://localhost:8983/solr/select?q=*:*qt=standardfq=state:CO OR 
_query_:{!geofilt} ...
{code}

This does not work:

{code}
http://localhost:8983/solr/select?q=*:*qt=standardfq={!geofilt} OR state:CO 
...
{code}

 How to combine OR with geofilt
 --

 Key: SOLR-2294
 URL: https://issues.apache.org/jira/browse/SOLR-2294
 Project: Solr
  Issue Type: Bug
Affects Versions: 3.1
Reporter: Bill Bell
 Fix For: 3.1


 We would like to combine fq={!geofilt} OR state:CO...
 This generates an error.
 Are there other ways to do an OR between fq= ?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: LuceneTestCase.threadCleanup incorrectly reports left running threads

2010-12-22 Thread Shai Erera
I ran the test from both eclipse and Ant, and got similar warnings.

With your patch most of the 'false alarms' do not show up again, but I still
see a strange failure. I add this to after the System.err.print(left thread
running): System.err.println(Arrays.toString(t.getStackTrace())); -- it
prints the stack trace. And here is what I get:

[junit] - Standard Error -
[junit] WARNING: test method: 'testIndexAndSearchTasks' left thread
running: Thread[file lock watchdog,6,main]
[junit] [java.lang.Object.wait(Native Method),
java.lang.Object.wait(Object.java:167),
java.util.Timer$TimerImpl.run(Timer.java:226)]
[junit] RESOURCE LEAK: test method: 'testIndexAndSearchTasks' left 1
thread(s) running
[junit] NOTE: reproduce with: ant test -Dtestcase=TestPerfTasksLogic
-Dtestmethod=testIndexAndSearchTasks
-Dtests.seed=-792089523312439823:1164084411683706634

I don't know where this Timer is created, but I'll dig more.

At any rate, I think your patch is good, and perhaps we should add the
stacktrace print as well, to help with the debugging?

Shai

On Wed, Dec 22, 2010 at 9:35 PM, Robert Muir rcm...@gmail.com wrote:

  static {
 +// just a hack for things like eclipse threads
 +for (Thread t : Thread.getAllStackTraces().keySet()) {
 +  rogueThreads.put(t, true);
 +}
 +  }



Solr-3.x - Build # 205 - Still Failing

2010-12-22 Thread Apache Hudson Server
Build: https://hudson.apache.org/hudson/job/Solr-3.x/205/

All tests passed

Build Log (for compile errors):
[...truncated 20638 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org