RE: Icubator Infra: svn and Committers

2011-02-06 Thread Digy
I copied Lucene.Net to the new location
(https://svn.apache.org/repos/asf/incubator/lucene.net/) and changed some
links in the index.html page.

DIGY

-Original Message-
From: Stefan Bodewig [mailto:bode...@apache.org] 
Sent: Sunday, February 06, 2011 9:55 AM
To: lucene-net-dev@lucene.apache.org
Subject: Re: Icubator Infra: svn and Committers

Hi all,

the vote is closed and has passed now, so I've started to kick off the
infrastructure tasks (more tomorrow, need to have a Family Sunday now
8-).

DIGY, you should have write access to the incubator svn area now.

I've created an empty status template at the incubator site which should
become visible in a few hours.

Cheers

Stefan



Re: Arabic Analyzer

2011-02-06 Thread Ben Foster
Is it still possible to use fixed term queries in Arabic (i.e. NOT using an
Analyzer)?

Thanks
Ben

On 6 February 2011 00:51, Prescott Nasser geobmx...@hotmail.com wrote:


 Unfortunately, I don't think we have that. We're working on creating a new
 port of the java lucene code, but I don't know the timeline yet - I'm sure
 there will be a lot of chatter on this mailing list soon.

 ~Prescott





 
  Date: Sat, 5 Feb 2011 22:57:11 +
  Subject: Arabic Analyzer
  From: b...@planetcloud.co.uk
  To: lucene-net-dev@lucene.apache.org
 
  Is there an Arabic Analyzer available for Lucene.NET. I see there has
 been
  one contributed to the Java project but wasn't sure if this has been
 ported.
 
  Thanks,
 
  Ben




-- 

Ben Foster

planetcloud
The Elms, Hawton
Newark-on-Trent
Nottinghamshire
NG24 3RL

www.planetcloud.co.uk


[jira] Updated: (LUCENENET-391) Luke.Net for Lucene.Net

2011-02-06 Thread Pasha Bizhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENENET-391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pasha Bizhan updated LUCENENET-391:
---

Attachment: luke-net-src.zip
luke-net-bin.zip

binary and source code

 Luke.Net for Lucene.Net
 ---

 Key: LUCENENET-391
 URL: https://issues.apache.org/jira/browse/LUCENENET-391
 Project: Lucene.Net
  Issue Type: New Feature
Reporter: Pasha Bizhan
Priority: Minor
 Attachments: luke-net-bin.zip, luke-net-src.zip


 .net port of Luke.Net for Lucene.Net 1.4

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Created: (LUCENENET-392) Arabic Analyzer

2011-02-06 Thread Digy (JIRA)
Arabic Analyzer
---

 Key: LUCENENET-392
 URL: https://issues.apache.org/jira/browse/LUCENENET-392
 Project: Lucene.Net
  Issue Type: New Feature
 Environment: Lucene.Net 2.9.2 VS2010
Reporter: Digy
Priority: Trivial
 Attachments: Lucene.Net.Analyzers.zip

A quick port of Lucene.Java's Arabic analyzer.
All unit tests pass.

DIGY

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: Arabic Analyzer

2011-02-06 Thread Ben Foster
Thanks Digy, I'll give it a whirl tomorrow.
-Original Message-
From: Digy digyd...@gmail.com
Date: Sun, 6 Feb 2011 23:58:01 
To: lucene-net-dev@lucene.apache.org
Reply-To: lucene-net-dev@lucene.apache.org
Subject: RE: Arabic Analyzer

Here is a port of lucene.java's arabic analyzer (
https://issues.apache.org/jira/browse/LUCENENET-392 )

You can safely remove nunit dependency and test cases from the project.

DIGY

-Original Message-
From: Ben Foster [mailto:b...@planetcloud.co.uk] 
Sent: Sunday, February 06, 2011 5:47 PM
To: lucene-net-dev@lucene.apache.org
Subject: Re: Arabic Analyzer

Is it still possible to use fixed term queries in Arabic (i.e. NOT using an
Analyzer)?

Thanks
Ben

On 6 February 2011 00:51, Prescott Nasser geobmx...@hotmail.com wrote:


 Unfortunately, I don't think we have that. We're working on creating a new
 port of the java lucene code, but I don't know the timeline yet - I'm sure
 there will be a lot of chatter on this mailing list soon.

 ~Prescott





 
  Date: Sat, 5 Feb 2011 22:57:11 +
  Subject: Arabic Analyzer
  From: b...@planetcloud.co.uk
  To: lucene-net-dev@lucene.apache.org
 
  Is there an Arabic Analyzer available for Lucene.NET. I see there has
 been
  one contributed to the Java project but wasn't sure if this has been
 ported.
 
  Thanks,
 
  Ben




-- 

Ben Foster

planetcloud
The Elms, Hawton
Newark-on-Trent
Nottinghamshire
NG24 3RL

www.planetcloud.co.uk



RE: Arabic Analyzer

2011-02-06 Thread Digy
By Hand. Find-Replace is your best friend :)

DIGY


-Original Message-
From: Prescott Nasser [mailto:geobmx...@hotmail.com] 
Sent: Monday, February 07, 2011 12:02 AM
To: lucene-net-dev@lucene.apache.org
Subject: RE: Arabic Analyzer


Hey Digy,
 
Do you use sharpen (or some other conversion tool) or for such minimal
amounts of code, do you just port it by hand?
 
~Prescott






 From: digyd...@gmail.com
 To: lucene-net-dev@lucene.apache.org
 Subject: RE: Arabic Analyzer
 Date: Sun, 6 Feb 2011 23:58:01 +0200

 Here is a port of lucene.java's arabic analyzer (
 https://issues.apache.org/jira/browse/LUCENENET-392 )

 You can safely remove nunit dependency and test cases from the project.

 DIGY

 -Original Message-
 From: Ben Foster [mailto:b...@planetcloud.co.uk]
 Sent: Sunday, February 06, 2011 5:47 PM
 To: lucene-net-dev@lucene.apache.org
 Subject: Re: Arabic Analyzer

 Is it still possible to use fixed term queries in Arabic (i.e. NOT using
an
 Analyzer)?

 Thanks
 Ben

 On 6 February 2011 00:51, Prescott Nasser wrote:

 
  Unfortunately, I don't think we have that. We're working on creating a
new
  port of the java lucene code, but I don't know the timeline yet - I'm
sure
  there will be a lot of chatter on this mailing list soon.
 
  ~Prescott
 
 
 
 
 
  
   Date: Sat, 5 Feb 2011 22:57:11 +
   Subject: Arabic Analyzer
   From: b...@planetcloud.co.uk
   To: lucene-net-dev@lucene.apache.org
  
   Is there an Arabic Analyzer available for Lucene.NET. I see there has
  been
   one contributed to the Java project but wasn't sure if this has been
  ported.
  
   Thanks,
  
   Ben
 



 --

 Ben Foster

 planetcloud
 The Elms, Hawton
 Newark-on-Trent
 Nottinghamshire
 NG24 3RL

 www.planetcloud.co.uk
 =



Re: Icubator Infra: svn and Committers

2011-02-06 Thread Stefan Bodewig
On 2011-02-07, Stefan Bodewig wrote:

 On 2011-02-06, Digy wrote:

 I copied Lucene.Net to the new location
 (https://svn.apache.org/repos/asf/incubator/lucene.net/)

 Isn't that one lucene.net too much?

forget that, didn't look close enough.

Stefan


[jira] Commented: (LUCENE-2609) Generate jar containing test classes.

2011-02-06 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12991132#comment-12991132
 ] 

Shai Erera commented on LUCENE-2609:


Thanks Steven !

Committed revision 1067623 (3x).

Merging to trunk now ...

 Generate jar containing test classes.
 -

 Key: LUCENE-2609
 URL: https://issues.apache.org/jira/browse/LUCENE-2609
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.0.2
Reporter: Drew Farris
Assignee: Shai Erera
Priority: Minor
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2609.patch, LUCENE-2609.patch, LUCENE-2609.patch, 
 LUCENE-2609.patch, LUCENE-2609.patch, LUCENE-2609.patch, LUCENE-2609.patch, 
 LUCENE-2609.patch, LUCENE-2609.patch, LUCENE-2609.patch


 The test classes are useful for writing unit tests for code external to the 
 Lucene project. It would be helpful to build a jar of these classes and 
 publish them as a maven dependency.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2907) termsenum bug when running with multithreaded search

2011-02-06 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12991133#comment-12991133
 ] 

Robert Muir commented on LUCENE-2907:
-

bq. Have you found out what happens or where a thread-safety issue could be?

Yes, i found the bug... unfortunately it is actually my automaton problem :(
I will create a nice patch today.

bq. The information on this issue is too small, there seems to be lots of 
IRC/GTalk communication in parallel.

what do you mean? mike was working a long time on the bug, but quickly had to 
stop working on it, so he emailed me all of his state. I took over from there 
for a while, and i opened this issue with my debugging... though I didn't have 
much time to work on it yesterday (only like 1 hour), because I already had 
plans.

I tried to be completely open and dump all of my state/debugging 
information/brainstorming on this JIRA issue, but it only resulted in me 
reporting misleading and confusing information... so I think the information on 
this issue is actually too much?


 termsenum bug when running with multithreaded search
 

 Key: LUCENE-2907
 URL: https://issues.apache.org/jira/browse/LUCENE-2907
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Robert Muir
 Attachments: LUCENE-2907_repro.patch, correct_seeks.txt, 
 incorrect_seeks.txt, seeks_diff.txt


 This one popped in hudson (with a test that runs the same query against 
 fieldcache, and with a filter rewrite, and compares results)
 However, its actually worse and unrelated to the fieldcache: you can set both 
 to filter rewrite and it will still fail.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2907) termsenum bug when running with multithreaded search

2011-02-06 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12991137#comment-12991137
 ] 

Uwe Schindler commented on LUCENE-2907:
---

A bug in automaton that only hapoens in multi-threaded? So its the cache there?

 termsenum bug when running with multithreaded search
 

 Key: LUCENE-2907
 URL: https://issues.apache.org/jira/browse/LUCENE-2907
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Robert Muir
 Attachments: LUCENE-2907_repro.patch, correct_seeks.txt, 
 incorrect_seeks.txt, seeks_diff.txt


 This one popped in hudson (with a test that runs the same query against 
 fieldcache, and with a filter rewrite, and compares results)
 However, its actually worse and unrelated to the fieldcache: you can set both 
 to filter rewrite and it will still fail.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2907) termsenum bug when running with multithreaded search

2011-02-06 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12991140#comment-12991140
 ] 

Robert Muir commented on LUCENE-2907:
-

in combination with other things. in my opinion the problem is the cache in 
getNumberedStates.

But the real solution (in my opinion) is to clean up all this crap so the 
termsenum only
takes a completely immutable view of what it needs and for the Query to compile 
once in its ctor,
and remove any stupid caching. 

So, this is what I am working on now.


 termsenum bug when running with multithreaded search
 

 Key: LUCENE-2907
 URL: https://issues.apache.org/jira/browse/LUCENE-2907
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Robert Muir
 Attachments: LUCENE-2907_repro.patch, correct_seeks.txt, 
 incorrect_seeks.txt, seeks_diff.txt


 This one popped in hudson (with a test that runs the same query against 
 fieldcache, and with a filter rewrite, and compares results)
 However, its actually worse and unrelated to the fieldcache: you can set both 
 to filter rewrite and it will still fail.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[HUDSON] Lucene-Solr-tests-only-3.x - Build # 4555 - Failure

2011-02-06 Thread Apache Hudson Server
Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-3.x/4555/

1 tests failed.
REGRESSION:  
org.apache.lucene.benchmark.byTask.feeds.TrecContentSourceTest.testTrecFeedDirAllTypes

Error Message:
expected:TEST-00[0] but was:TEST-00[1]

Stack Trace:
at 
org.apache.lucene.benchmark.byTask.feeds.TrecContentSourceTest.assertDocData(TrecContentSourceTest.java:70)
at 
org.apache.lucene.benchmark.byTask.feeds.TrecContentSourceTest.testTrecFeedDirAllTypes(TrecContentSourceTest.java:369)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1045)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:977)




Build Log (for compile errors):
[...truncated 6504 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1540) Improvements to contrib.benchmark for TREC collections

2011-02-06 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12991145#comment-12991145
 ] 

Michael McCandless commented on LUCENE-1540:


I think this commit has caused a failure on at least 3.x?
{noformat}
[junit] Testcase: 
testTrecFeedDirAllTypes(org.apache.lucene.benchmark.byTask.feeds.TrecContentSourceTest):
  Caused an ERROR
[junit] expected:TEST-00[0] but was:TEST-00[1]
[junit] at 
org.apache.lucene.benchmark.byTask.feeds.TrecContentSourceTest.assertDocData(TrecContentSourceTest.java:70)
[junit] at 
org.apache.lucene.benchmark.byTask.feeds.TrecContentSourceTest.testTrecFeedDirAllTypes(TrecContentSourceTest.java:369)
[junit] at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1045)
[junit] at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:977)
[junit] 
[junit] 
[junit] Tests run: 6, Failures: 0, Errors: 1, Time elapsed: 0.488 sec
[junit] 
[junit] - Standard Error -
[junit] WARNING: test method: 'testBadDate' left thread running: 
Thread[Thread-6,5,main]
[junit] RESOURCE LEAK: test method: 'testBadDate' left 1 thread(s) running
[junit] NOTE: reproduce with: ant test -Dtestcase=TrecContentSourceTest 
-Dtestmethod=testBadDate -Dtests.seed=-1485993969467368126:6510043524258948665 
-Dtests.multiplier=5
[junit] NOTE: reproduce with: ant test -Dtestcase=TrecContentSourceTest 
-Dtestmethod=testTrecFeedDirAllTypes 
-Dtests.seed=-1485993969467368126:-9055415333820766139 -Dtests.multiplier=5
[junit] NOTE: test params are: locale=tr_TR, timezone=Europe/Zagreb
[junit] NOTE: all tests run in this JVM:
[junit] [TrecContentSourceTest]
[junit] NOTE: FreeBSD 8.2-RC2 amd64/Sun Microsystems Inc. 1.6.0 
(64-bit)/cpus=16,threads=1,free=66439840,total=86376448
[junit] -  ---
{noformat}

 Improvements to contrib.benchmark for TREC collections
 --

 Key: LUCENE-1540
 URL: https://issues.apache.org/jira/browse/LUCENE-1540
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark
Reporter: Tim Armstrong
Assignee: Doron Cohen
Priority: Minor
 Fix For: 3.1, 4.0

 Attachments: LUCENE-1540.patch, LUCENE-1540.patch, LUCENE-1540.patch, 
 LUCENE-1540.patch, trecdocs.zip


 The benchmarking utilities for  TREC test collections (http://trec.nist.gov) 
 are quite limited and do not support some of the variations in format of 
 older TREC collections.  
 I have been doing some benchmarking work with Lucene and have had to modify 
 the package to support:
 * Older TREC document formats, which the current parser fails on due to 
 missing document headers.
 * Variations in query format - newlines after title tag causing the query 
 parser to get confused.
 * Ability to detect and read in uncompressed text collections
 * Storage of document numbers by default without storing full text.
 I can submit a patch if there is interest, although I will probably want to 
 write unit tests for the new functionality first.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2907) automaton termsenum bug when running with multithreaded search

2011-02-06 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-2907:


Summary: automaton termsenum bug when running with multithreaded search  
(was: termsenum bug when running with multithreaded search)

editing description so its not confusing, sorry :)

 automaton termsenum bug when running with multithreaded search
 --

 Key: LUCENE-2907
 URL: https://issues.apache.org/jira/browse/LUCENE-2907
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Robert Muir
 Attachments: LUCENE-2907_repro.patch, correct_seeks.txt, 
 incorrect_seeks.txt, seeks_diff.txt


 This one popped in hudson (with a test that runs the same query against 
 fieldcache, and with a filter rewrite, and compares results)
 However, its actually worse and unrelated to the fieldcache: you can set both 
 to filter rewrite and it will still fail.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2907) automaton termsenum bug when running with multithreaded search

2011-02-06 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-2907:


Attachment: LUCENE-2907.patch

attached is a patch. I removed all the transient/synchronized stuff from the 
query.

Instead: AutomatonTermsEnum only takes an immutable, compiled form of the 
automaton (essentially a sorted transitions array).

the query computes this compiled form (or any other simpler rewritten form) in 
its ctor.


 automaton termsenum bug when running with multithreaded search
 --

 Key: LUCENE-2907
 URL: https://issues.apache.org/jira/browse/LUCENE-2907
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Robert Muir
 Attachments: LUCENE-2907.patch, LUCENE-2907_repro.patch, 
 correct_seeks.txt, incorrect_seeks.txt, seeks_diff.txt


 This one popped in hudson (with a test that runs the same query against 
 fieldcache, and with a filter rewrite, and compares results)
 However, its actually worse and unrelated to the fieldcache: you can set both 
 to filter rewrite and it will still fail.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2906) Filter to process output of ICUTokenizer and create overlapping bigrams for CJK

2011-02-06 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-2906:


Attachment: LUCENE-2906.patch

here's a patch going in a slightly different direction (though we can still add 
some special icu-only stuff here).

instead the patch synchronizes the token types of ICUTokenizer with 
StandardTokenizer, adds the necessarily types to both, and then adds the 
bigramming logic to standardfilter.

this way, cjk works easily out of box, for all of unicode (e.g. 
supplementaries) and plays well with other languages. i deprecated cjktokenizer 
in the patch and pulled out its special full-width filter into a separate 
tokenfilter.


 Filter to process output of ICUTokenizer and create overlapping bigrams for 
 CJK 
 

 Key: LUCENE-2906
 URL: https://issues.apache.org/jira/browse/LUCENE-2906
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Analysis
Reporter: Tom Burton-West
Priority: Minor
 Attachments: LUCENE-2906.patch


 The ICUTokenizer produces unigrams for CJK. We would like to use the 
 ICUTokenizer but have overlapping bigrams created for CJK as in the CJK 
 Analyzer.  This filter would take the output of the ICUtokenizer, read the 
 ScriptAttribute and for selected scripts (Han, Kana), would produce 
 overlapping bigrams.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[HUDSON] Lucene-Solr-tests-only-3.x - Build # 4561 - Failure

2011-02-06 Thread Apache Hudson Server
Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-3.x/4561/

1 tests failed.
REGRESSION:  org.apache.solr.client.solrj.TestLBHttpSolrServer.testReliability

Error Message:
No live SolrServers available to handle this request

Stack Trace:
org.apache.solr.client.solrj.SolrServerException: No live SolrServers available 
to handle this request
at 
org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(LBHttpSolrServer.java:222)
at 
org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:89)
at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:118)
at 
org.apache.solr.client.solrj.TestLBHttpSolrServer.testReliability(TestLBHttpSolrServer.java:177)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1045)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:977)
Caused by: org.apache.solr.client.solrj.SolrServerException: 
java.net.SocketTimeoutException: Read timed out
at 
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:484)
at 
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:245)
at 
org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(LBHttpSolrServer.java:206)
Caused by: java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:146)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
at 
org.apache.commons.httpclient.HttpParser.readRawLine(HttpParser.java:78)
at 
org.apache.commons.httpclient.HttpParser.readLine(HttpParser.java:106)
at 
org.apache.commons.httpclient.HttpConnection.readLine(HttpConnection.java:1116)
at 
org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$HttpConnectionAdapter.readLine(MultiThreadedHttpConnectionManager.java:1413)
at 
org.apache.commons.httpclient.HttpMethodBase.readStatusLine(HttpMethodBase.java:1973)
at 
org.apache.commons.httpclient.HttpMethodBase.readResponse(HttpMethodBase.java:1735)
at 
org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.java:1098)
at 
org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:398)
at 
org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171)
at 
org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
at 
org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323)
at 
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:428)




Build Log (for compile errors):
[...truncated 10090 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[HUDSON-MAVEN] Lucene-Solr-Maven-trunk #17: POMs out of sync

2011-02-06 Thread Apache Hudson Server
Build: https://hudson.apache.org/hudson/job/Lucene-Solr-Maven-trunk/17/

No tests ran.

Build Log (for compile errors):
[...truncated 7757 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Reopened: (LUCENE-2894) Use of google-code-prettify for Lucene/Solr Javadoc

2011-02-06 Thread Koji Sekiguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi reopened LUCENE-2894:



Reopening the issue.

Lucene javadoc on hudson looks fine (syntax highlighting works correctly):

https://hudson.apache.org/hudson/job/Lucene-trunk/javadoc/all/overview-summary.html

but Solr javadoc on hudson looks not good:

https://hudson.apache.org/hudson/job/Solr-trunk/javadoc/org/apache/solr/handler/component/TermsComponent.html

Building of both javadocs on my local is working fine.

 Use of google-code-prettify for Lucene/Solr Javadoc
 ---

 Key: LUCENE-2894
 URL: https://issues.apache.org/jira/browse/LUCENE-2894
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Javadocs
Reporter: Koji Sekiguchi
Assignee: Koji Sekiguchi
Priority: Minor
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2894.patch, LUCENE-2894.patch, LUCENE-2894.patch, 
 LUCENE-2894.patch


 My company, RONDHUIT uses google-code-prettify (Apache License 2.0) in 
 Javadoc for syntax highlighting:
 http://www.rondhuit-demo.com/RCSS/api/com/rondhuit/solr/analysis/JaReadingSynonymFilterFactory.html
 I think we can use it for Lucene javadoc (java sample code in overview.html 
 etc) and Solr javadoc (Analyzer Factories etc) to improve or simplify our 
 life.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2894) Use of google-code-prettify for Lucene/Solr Javadoc

2011-02-06 Thread Koji Sekiguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12991165#comment-12991165
 ] 

Koji Sekiguchi commented on LUCENE-2894:


On my mac, there is prettify correctly under api directory after ant package:

{code}
$ cd solr
$ ant clean set-fsdir package
$ ls build/docs/api/
allclasses-frame.html  deprecated-list.html   package-list
allclasses-noframe.htmlhelp-doc.html  prettify
constant-values.html   index-all.html resources
contrib-solr-analysis-extras   index.html 
serialized-form.html
contrib-solr-cell  orgsolr
contrib-solr-clusteringoverview-frame.htmlsolrj
contrib-solr-dataimporthandler overview-summary.html  
stylesheet+prettify.css
contrib-solr-uima  overview-tree.html
{code}


 Use of google-code-prettify for Lucene/Solr Javadoc
 ---

 Key: LUCENE-2894
 URL: https://issues.apache.org/jira/browse/LUCENE-2894
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Javadocs
Reporter: Koji Sekiguchi
Assignee: Koji Sekiguchi
Priority: Minor
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2894.patch, LUCENE-2894.patch, LUCENE-2894.patch, 
 LUCENE-2894.patch


 My company, RONDHUIT uses google-code-prettify (Apache License 2.0) in 
 Javadoc for syntax highlighting:
 http://www.rondhuit-demo.com/RCSS/api/com/rondhuit/solr/analysis/JaReadingSynonymFilterFactory.html
 I think we can use it for Lucene javadoc (java sample code in overview.html 
 etc) and Solr javadoc (Analyzer Factories etc) to improve or simplify our 
 life.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2908) clean up serialization in the codebase

2011-02-06 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-2908:


Attachment: LUCENE-2908.patch

attached is a patch. all tests pass.

 clean up serialization in the codebase
 --

 Key: LUCENE-2908
 URL: https://issues.apache.org/jira/browse/LUCENE-2908
 Project: Lucene - Java
  Issue Type: Task
Reporter: Robert Muir
 Fix For: 4.0

 Attachments: LUCENE-2908.patch


 We removed contrib/remote, but forgot to cleanup serialization hell 
 everywhere.
 this is no longer needed, never really worked (e.g. across versions), and 
 slows 
 development (e.g. i wasted a long time debugging stupid serialization of 
 Similarity.idfExplain when trying to make a patch for the scoring system).

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1799) Unicode compression

2011-02-06 Thread DM Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12991170#comment-12991170
 ] 

DM Smith commented on LUCENE-1799:
--

Any idea as to when this will be released?

 Unicode compression
 ---

 Key: LUCENE-1799
 URL: https://issues.apache.org/jira/browse/LUCENE-1799
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Store
Affects Versions: 2.4.1
Reporter: DM Smith
Priority: Minor
 Attachments: Benchmark.java, Benchmark.java, Benchmark.java, 
 LUCENE-1779.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
 LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
 LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
 LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799_big.patch


 In lucene-1793, there is the off-topic suggestion to provide compression of 
 Unicode data. The motivation was a custom encoding in a Russian analyzer. The 
 original supposition was that it provided a more compact index.
 This led to the comment that a different or compressed encoding would be a 
 generally useful feature. 
 BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM 
 with an implementation in ICU. If Lucene provide it's own implementation a 
 freely avIlable, royalty-free license would need to be obtained.
 SCSU is another Unicode compression algorithm that could be used. 
 An advantage of these methods is that they work on the whole of Unicode. If 
 that is not needed an encoding such as iso8859-1 (or whatever covers the 
 input) could be used.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2894) Use of google-code-prettify for Lucene/Solr Javadoc

2011-02-06 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12991173#comment-12991173
 ] 

Steven Rowe commented on LUCENE-2894:
-

Both of the nightly Hudson Maven builds failed because javadoc jars were not 
produced by the Ant build (scroll down to the bottom to see the error about 
javadoc jars not being available to deploy): 

https://hudson.apache.org/hudson/job/Lucene-Solr-Maven-trunk/17/consoleText
https://hudson.apache.org/hudson/job/Lucene-Solr-Maven-3.x/16/consoleText

 Use of google-code-prettify for Lucene/Solr Javadoc
 ---

 Key: LUCENE-2894
 URL: https://issues.apache.org/jira/browse/LUCENE-2894
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Javadocs
Reporter: Koji Sekiguchi
Assignee: Koji Sekiguchi
Priority: Minor
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2894.patch, LUCENE-2894.patch, LUCENE-2894.patch, 
 LUCENE-2894.patch


 My company, RONDHUIT uses google-code-prettify (Apache License 2.0) in 
 Javadoc for syntax highlighting:
 http://www.rondhuit-demo.com/RCSS/api/com/rondhuit/solr/analysis/JaReadingSynonymFilterFactory.html
 I think we can use it for Lucene javadoc (java sample code in overview.html 
 etc) and Solr javadoc (Analyzer Factories etc) to improve or simplify our 
 life.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2907) automaton termsenum bug when running with multithreaded search

2011-02-06 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-2907:


Attachment: LUCENE-2907.patch

here's the same patch, but cleaned up a bit (e.g. making some things private, 
final, etc)

 automaton termsenum bug when running with multithreaded search
 --

 Key: LUCENE-2907
 URL: https://issues.apache.org/jira/browse/LUCENE-2907
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Robert Muir
 Attachments: LUCENE-2907.patch, LUCENE-2907.patch, 
 LUCENE-2907_repro.patch, correct_seeks.txt, incorrect_seeks.txt, 
 seeks_diff.txt


 This one popped in hudson (with a test that runs the same query against 
 fieldcache, and with a filter rewrite, and compares results)
 However, its actually worse and unrelated to the fieldcache: you can set both 
 to filter rewrite and it will still fail.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1540) Improvements to contrib.benchmark for TREC collections

2011-02-06 Thread Doron Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12991176#comment-12991176
 ] 

Doron Cohen commented on LUCENE-1540:
-

I am able to reproduce this on Linux.
The test fails with *locale tr_TR* because TrecDocParser was upper-casing the 
file names for deciding which parser to apply.
Problem with this is that toUpperCase is locale sensitive, and so the file name 
no longer matched the enum name.
Fixed by adding a lower case dirName member to the enums.
Also recreated the test files zip with '-UN u' for UTF8 handling of file names 
in the zip.

committed at r1067699 for 3x.

In trunk the test passes with same args also in Linux, but fails if you pass 
the locale that was randomly selected in 3x, i.e. like this: 
ant test -Dtestcase=TrecContentSourceTest -Dtestmethod=testTrecFeedDirAllTypes 
-Dtests.locale=tr_TR

Will merge the fix to trunk shortly.

 Improvements to contrib.benchmark for TREC collections
 --

 Key: LUCENE-1540
 URL: https://issues.apache.org/jira/browse/LUCENE-1540
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark
Reporter: Tim Armstrong
Assignee: Doron Cohen
Priority: Minor
 Fix For: 3.1, 4.0

 Attachments: LUCENE-1540.patch, LUCENE-1540.patch, LUCENE-1540.patch, 
 LUCENE-1540.patch, trecdocs.zip


 The benchmarking utilities for  TREC test collections (http://trec.nist.gov) 
 are quite limited and do not support some of the variations in format of 
 older TREC collections.  
 I have been doing some benchmarking work with Lucene and have had to modify 
 the package to support:
 * Older TREC document formats, which the current parser fails on due to 
 missing document headers.
 * Variations in query format - newlines after title tag causing the query 
 parser to get confused.
 * Ability to detect and read in uncompressed text collections
 * Storage of document numbers by default without storing full text.
 I can submit a patch if there is interest, although I will probably want to 
 write unit tests for the new functionality first.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2908) clean up serialization in the codebase

2011-02-06 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12991177#comment-12991177
 ] 

Simon Willnauer commented on LUCENE-2908:
-

big +1 to get rid of Serializable its broken anyway, slow and not really 
working across versions! Folks that want to send stuff through the wire using 
java serialization should put api sugar on top.



 clean up serialization in the codebase
 --

 Key: LUCENE-2908
 URL: https://issues.apache.org/jira/browse/LUCENE-2908
 Project: Lucene - Java
  Issue Type: Task
Reporter: Robert Muir
 Fix For: 4.0

 Attachments: LUCENE-2908.patch


 We removed contrib/remote, but forgot to cleanup serialization hell 
 everywhere.
 this is no longer needed, never really worked (e.g. across versions), and 
 slows 
 development (e.g. i wasted a long time debugging stupid serialization of 
 Similarity.idfExplain when trying to make a patch for the scoring system).

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2907) automaton termsenum bug when running with multithreaded search

2011-02-06 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12991178#comment-12991178
 ] 

Simon Willnauer commented on LUCENE-2907:
-

patch looks good - just being super picky: you don't need all the this.bla in 
CompiledAutomaton ;)

I am not sure if CompiledAutomation is a good name since it is not really an 
automaton is it?

simon

 automaton termsenum bug when running with multithreaded search
 --

 Key: LUCENE-2907
 URL: https://issues.apache.org/jira/browse/LUCENE-2907
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Robert Muir
 Attachments: LUCENE-2907.patch, LUCENE-2907.patch, 
 LUCENE-2907_repro.patch, correct_seeks.txt, incorrect_seeks.txt, 
 seeks_diff.txt


 This one popped in hudson (with a test that runs the same query against 
 fieldcache, and with a filter rewrite, and compares results)
 However, its actually worse and unrelated to the fieldcache: you can set both 
 to filter rewrite and it will still fail.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1540) Improvements to contrib.benchmark for TREC collections

2011-02-06 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12991179#comment-12991179
 ] 

Robert Muir commented on LUCENE-1540:
-

Hi Doron, about the test random seeds:

It is complicated (though maybe we could fix this!) for the same random seed in 
trunk to work just like 3.x

But for the locales: the way it picks a random locale is from the available 
system locales. This changes from jre to jre,
so unfortunately we cannot guarantee that the same seed chooses the same locale 
randomly... Its the same with 
timezones too... and these even change in minor jdk updates! 

I wish we knew of a good solution, because I hate it when things aren't 
completely reproducible everywhere.


 Improvements to contrib.benchmark for TREC collections
 --

 Key: LUCENE-1540
 URL: https://issues.apache.org/jira/browse/LUCENE-1540
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark
Reporter: Tim Armstrong
Assignee: Doron Cohen
Priority: Minor
 Fix For: 3.1, 4.0

 Attachments: LUCENE-1540.patch, LUCENE-1540.patch, LUCENE-1540.patch, 
 LUCENE-1540.patch, trecdocs.zip


 The benchmarking utilities for  TREC test collections (http://trec.nist.gov) 
 are quite limited and do not support some of the variations in format of 
 older TREC collections.  
 I have been doing some benchmarking work with Lucene and have had to modify 
 the package to support:
 * Older TREC document formats, which the current parser fails on due to 
 missing document headers.
 * Variations in query format - newlines after title tag causing the query 
 parser to get confused.
 * Ability to detect and read in uncompressed text collections
 * Storage of document numbers by default without storing full text.
 I can submit a patch if there is interest, although I will probably want to 
 write unit tests for the new functionality first.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2907) automaton termsenum bug when running with multithreaded search

2011-02-06 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12991180#comment-12991180
 ] 

Robert Muir commented on LUCENE-2907:
-

bq. I am not sure if CompiledAutomation is a good name since it is not really 
an automaton is it?

it is a compiled form of the automaton... and it is a dfa, mathematically.

At the end of the day this CompiledAutomaton is an internal api, we can change 
its name at any time.


 automaton termsenum bug when running with multithreaded search
 --

 Key: LUCENE-2907
 URL: https://issues.apache.org/jira/browse/LUCENE-2907
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Robert Muir
 Attachments: LUCENE-2907.patch, LUCENE-2907.patch, 
 LUCENE-2907_repro.patch, correct_seeks.txt, incorrect_seeks.txt, 
 seeks_diff.txt


 This one popped in hudson (with a test that runs the same query against 
 fieldcache, and with a filter rewrite, and compares results)
 However, its actually worse and unrelated to the fieldcache: you can set both 
 to filter rewrite and it will still fail.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1540) Improvements to contrib.benchmark for TREC collections

2011-02-06 Thread Doron Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12991181#comment-12991181
 ] 

Doron Cohen commented on LUCENE-1540:
-

Fix for the locale issue merged to trunk at r1076605.
Keeping open for a day or so to make sure there are no more failures.

 Improvements to contrib.benchmark for TREC collections
 --

 Key: LUCENE-1540
 URL: https://issues.apache.org/jira/browse/LUCENE-1540
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark
Reporter: Tim Armstrong
Assignee: Doron Cohen
Priority: Minor
 Fix For: 3.1, 4.0

 Attachments: LUCENE-1540.patch, LUCENE-1540.patch, LUCENE-1540.patch, 
 LUCENE-1540.patch, trecdocs.zip


 The benchmarking utilities for  TREC test collections (http://trec.nist.gov) 
 are quite limited and do not support some of the variations in format of 
 older TREC collections.  
 I have been doing some benchmarking work with Lucene and have had to modify 
 the package to support:
 * Older TREC document formats, which the current parser fails on due to 
 missing document headers.
 * Variations in query format - newlines after title tag causing the query 
 parser to get confused.
 * Ability to detect and read in uncompressed text collections
 * Storage of document numbers by default without storing full text.
 I can submit a patch if there is interest, although I will probably want to 
 write unit tests for the new functionality first.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2906) Filter to process output of ICUTokenizer and create overlapping bigrams for CJK

2011-02-06 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12991186#comment-12991186
 ] 

Robert Muir commented on LUCENE-2906:
-

{quote}
How will this differ from the SmartChineseAnalyzer?
{quote}

The SmartChineseAnalyzer is for Simplified Chinese only... this is about the 
language-independent technique similar to what CJKAnalyzer does today.

{quote}
I doubt it but can this be in 3.1?
{quote}

Well i hate the way CJKAnalyzer treats things like supplementary characters 
(wrongly).
This is definitely a bug, and fixed here. Part of me wants to fix this as 
quickly as possible.

At the same time though, I would prefer 3.2... otherwise I would feel like I am 
rushing things.

I don't think 3.2 needs to come a year after 3.1... in fact since we have a 
stable branch I think its
stupid to make bugfix releases like 3.1.1 when we could just push out a new 
minor version (3.2) with
bugfixes instead. The whole branch is intended to be stable changes, so I think 
this is better use
of our time. But this is just my opinion, we can discuss it later on the list 
as one idea to promote 
more rapid releases.


 Filter to process output of ICUTokenizer and create overlapping bigrams for 
 CJK 
 

 Key: LUCENE-2906
 URL: https://issues.apache.org/jira/browse/LUCENE-2906
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Analysis
Reporter: Tom Burton-West
Priority: Minor
 Attachments: LUCENE-2906.patch


 The ICUTokenizer produces unigrams for CJK. We would like to use the 
 ICUTokenizer but have overlapping bigrams created for CJK as in the CJK 
 Analyzer.  This filter would take the output of the ICUtokenizer, read the 
 ScriptAttribute and for selected scripts (Han, Kana), would produce 
 overlapping bigrams.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2908) clean up serialization in the codebase

2011-02-06 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12991187#comment-12991187
 ] 

Uwe Schindler commented on LUCENE-2908:
---

+1

 clean up serialization in the codebase
 --

 Key: LUCENE-2908
 URL: https://issues.apache.org/jira/browse/LUCENE-2908
 Project: Lucene - Java
  Issue Type: Task
Reporter: Robert Muir
 Fix For: 4.0

 Attachments: LUCENE-2908.patch


 We removed contrib/remote, but forgot to cleanup serialization hell 
 everywhere.
 this is no longer needed, never really worked (e.g. across versions), and 
 slows 
 development (e.g. i wasted a long time debugging stupid serialization of 
 Similarity.idfExplain when trying to make a patch for the scoring system).

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-2256) CommonsHttpSolrServer.deleteById(emptyList) causes SolrException: missing_content_stream

2011-02-06 Thread Stevo Slavic (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12991191#comment-12991191
 ] 

Stevo Slavic commented on SOLR-2256:


I've experienced similar behavior with SolrJ 1.4.1 - later discovered that 
actual problem was that index schema was outdated, it was missing a field which 
was present in document.

 CommonsHttpSolrServer.deleteById(emptyList) causes SolrException: 
 missing_content_stream
 

 Key: SOLR-2256
 URL: https://issues.apache.org/jira/browse/SOLR-2256
 Project: Solr
  Issue Type: Bug
  Components: clients - java
Affects Versions: 1.4.1
Reporter: Maxim Valyanskiy
Priority: Minor

 Call to deleteById method of CommonsHttpSolrServer with empty list causes 
 following exception:
 org.apache.solr.common.SolrException: missing_content_stream
 missing_content_stream
 request: http://127.0.0.1:8983/solr/update/javabin
 at 
 org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:435)
 at 
 org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
 at 
 org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
 at org.apache.solr.client.solrj.SolrServer.deleteById(SolrServer.java:106)
 at 
 ru.org.linux.spring.SearchQueueListener.reindexMessage(SearchQueueListener.java:89)
 Here is TCP stream captured by Wireshark:
 =
 POST /solr/update HTTP/1.1
 Content-Type: application/x-www-form-urlencoded; charset=UTF-8
 User-Agent: Solr[org.apache.solr.client.solrj.impl.CommonsHttpSolrServer] 1.0
 Host: 127.0.0.1:8983
 Content-Length: 20
 wt=javabinversion=1
 =
 HTTP/1.1 400 missing_content_stream
 Content-Type: text/html; charset=iso-8859-1
 Content-Length: 1401
 Server: Jetty(6.1.3)
 = [ html reply skipped ] ===

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-2907) automaton termsenum bug when running with multithreaded search

2011-02-06 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir resolved LUCENE-2907.
-

Resolution: Fixed
  Assignee: Robert Muir

Committed revision 1067720.

 automaton termsenum bug when running with multithreaded search
 --

 Key: LUCENE-2907
 URL: https://issues.apache.org/jira/browse/LUCENE-2907
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Robert Muir
Assignee: Robert Muir
 Attachments: LUCENE-2907.patch, LUCENE-2907.patch, 
 LUCENE-2907_repro.patch, correct_seeks.txt, incorrect_seeks.txt, 
 seeks_diff.txt


 This one popped in hudson (with a test that runs the same query against 
 fieldcache, and with a filter rewrite, and compares results)
 However, its actually worse and unrelated to the fieldcache: you can set both 
 to filter rewrite and it will still fail.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2907) automaton termsenum bug when running with multithreaded search

2011-02-06 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12991198#comment-12991198
 ] 

Uwe Schindler commented on LUCENE-2907:
---

Thanks, really nice now :-)

 automaton termsenum bug when running with multithreaded search
 --

 Key: LUCENE-2907
 URL: https://issues.apache.org/jira/browse/LUCENE-2907
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Robert Muir
Assignee: Robert Muir
 Attachments: LUCENE-2907.patch, LUCENE-2907.patch, 
 LUCENE-2907_repro.patch, correct_seeks.txt, incorrect_seeks.txt, 
 seeks_diff.txt


 This one popped in hudson (with a test that runs the same query against 
 fieldcache, and with a filter rewrite, and compares results)
 However, its actually worse and unrelated to the fieldcache: you can set both 
 to filter rewrite and it will still fail.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-2609) Generate jar containing test classes.

2011-02-06 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera resolved LUCENE-2609.


Resolution: Fixed

Committed revision 1067738.

Thanks all for your comments and help !

 Generate jar containing test classes.
 -

 Key: LUCENE-2609
 URL: https://issues.apache.org/jira/browse/LUCENE-2609
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.0.2
Reporter: Drew Farris
Assignee: Shai Erera
Priority: Minor
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2609.patch, LUCENE-2609.patch, LUCENE-2609.patch, 
 LUCENE-2609.patch, LUCENE-2609.patch, LUCENE-2609.patch, LUCENE-2609.patch, 
 LUCENE-2609.patch, LUCENE-2609.patch, LUCENE-2609.patch


 The test classes are useful for writing unit tests for code external to the 
 Lucene project. It would be helpful to build a jar of these classes and 
 publish them as a maven dependency.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1540) Improvements to contrib.benchmark for TREC collections

2011-02-06 Thread Doron Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12991210#comment-12991210
 ] 

Doron Cohen commented on LUCENE-1540:
-

bq. I wish we knew of a good solution, because I hate it when things aren't 
completely reproducible everywhere.

Thanks Robert, I am actually very pleased with this array of testing with 
various parameters like locale and others randomly selected - it is very 
powreful, and since the failure printed all the parameters used and even the 
ant line to reproduce(\!)  - it was possible to reproduce in 3x, and, once 
understanding what the problem was also possible to reproduce in trunk - to me 
this is testing's heaven...

 Improvements to contrib.benchmark for TREC collections
 --

 Key: LUCENE-1540
 URL: https://issues.apache.org/jira/browse/LUCENE-1540
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark
Reporter: Tim Armstrong
Assignee: Doron Cohen
Priority: Minor
 Fix For: 3.1, 4.0

 Attachments: LUCENE-1540.patch, LUCENE-1540.patch, LUCENE-1540.patch, 
 LUCENE-1540.patch, trecdocs.zip


 The benchmarking utilities for  TREC test collections (http://trec.nist.gov) 
 are quite limited and do not support some of the variations in format of 
 older TREC collections.  
 I have been doing some benchmarking work with Lucene and have had to modify 
 the package to support:
 * Older TREC document formats, which the current parser fails on due to 
 missing document headers.
 * Variations in query format - newlines after title tag causing the query 
 parser to get confused.
 * Ability to detect and read in uncompressed text collections
 * Storage of document numbers by default without storing full text.
 I can submit a patch if there is interest, although I will probably want to 
 write unit tests for the new functionality first.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: svn commit: r1067699 - in /lucene/dev/branches/branch_3x/lucene/contrib/benchmark/src: java/org/apache/lucene/benchmark/byTask/feeds/TrecDocParser.java test/org/apache/lucene/benchmark/byTask/feed

2011-02-06 Thread Doron Cohen
Interesting... Thanks Robert for pointing this out!

 To obtain correct results for locale insensitive strings, use
toUpperCase(Locale.ENGLISH)

Actually this is one of the things I tried and did solve it - with
toUpperCase(Locale.US) - not exactly Locale.ENGLISH but quite similar I
assume -  and as you suggest it felt wrong, for wrong reasons...

Perhaps I'll change it like this, case insensitivity is a good think when
running in various OS's.

On Sun, Feb 6, 2011 at 6:55 PM, Robert Muir rcm...@gmail.com wrote:

 Thanks for catching this Doron. Another option if you want to keep the
 case-insensitive feature here would be to use
 toUpperCase(Locale.ENGLISH)

 It might look bad, but its actually recommended by the JDK for
 locale-insensitive strings:

 http://download.oracle.com/javase/6/docs/api/java/lang/String.html#toUpperCase()

 On Sun, Feb 6, 2011 at 11:43 AM,  dor...@apache.org wrote:
  Author: doronc
  Date: Sun Feb  6 16:43:54 2011
  New Revision: 1067699
 
  URL: http://svn.apache.org/viewvc?rev=1067699view=rev
  Log:
  LUCENE-1540: Improvements to contrib.benchmark for TREC collections - fix
 test failures in some locales due to toUpperCase()
 
  Modified:
 
  
 lucene/dev/branches/branch_3x/lucene/contrib/benchmark/src/java/org/apache/lucene/benchmark/byTask/feeds/TrecDocParser.java
 
  
 lucene/dev/branches/branch_3x/lucene/contrib/benchmark/src/test/org/apache/lucene/benchmark/byTask/feeds/trecdocs.zip
 
  Modified:
 lucene/dev/branches/branch_3x/lucene/contrib/benchmark/src/java/org/apache/lucene/benchmark/byTask/feeds/TrecDocParser.java
  URL:
 http://svn.apache.org/viewvc/lucene/dev/branches/branch_3x/lucene/contrib/benchmark/src/java/org/apache/lucene/benchmark/byTask/feeds/TrecDocParser.java?rev=1067699r1=1067698r2=1067699view=diff
 
 ==
  ---
 lucene/dev/branches/branch_3x/lucene/contrib/benchmark/src/java/org/apache/lucene/benchmark/byTask/feeds/TrecDocParser.java
 (original)
  +++
 lucene/dev/branches/branch_3x/lucene/contrib/benchmark/src/java/org/apache/lucene/benchmark/byTask/feeds/TrecDocParser.java
 Sun Feb  6 16:43:54 2011
  @@ -29,7 +29,12 @@ import java.util.Map;
   public abstract class TrecDocParser {
 
/** Types of trec parse paths, */
  -  public enum ParsePathType { GOV2, FBIS, FT, FR94, LATIMES }
  +  public enum ParsePathType { GOV2(gov2), FBIS(fbis), FT(ft),
 FR94(fr94), LATIMES(latimes);
  +public final String dirName;
  +private ParsePathType(String dirName) {
  +  this.dirName = dirName;
  +}
  +  }
 
/** trec parser type used for unknown extensions */
public static final ParsePathType DEFAULT_PATH_TYPE  =
 ParsePathType.GOV2;
  @@ -46,7 +51,7 @@ public abstract class TrecDocParser {
static final MapString,ParsePathType pathName2Type = new
 HashMapString,ParsePathType();
static {
  for (ParsePathType ppt : ParsePathType.values()) {
  -  pathName2Type.put(ppt.name(),ppt);
  +  pathName2Type.put(ppt.dirName,ppt);
  }
}
 
  @@ -59,7 +64,7 @@ public abstract class TrecDocParser {
public static ParsePathType pathType(File f) {
  int pathLength = 0;
  while (f != null  ++pathLength  MAX_PATH_LENGTH) {
  -  ParsePathType ppt = pathName2Type.get(f.getName().toUpperCase());
  +  ParsePathType ppt = pathName2Type.get(f.getName());
if (ppt!=null) {
  return ppt;
}
 
  Modified:
 lucene/dev/branches/branch_3x/lucene/contrib/benchmark/src/test/org/apache/lucene/benchmark/byTask/feeds/trecdocs.zip
  URL:
 http://svn.apache.org/viewvc/lucene/dev/branches/branch_3x/lucene/contrib/benchmark/src/test/org/apache/lucene/benchmark/byTask/feeds/trecdocs.zip?rev=1067699r1=1067698r2=1067699view=diff
 
 ==
  Binary files - no diff available.
 
 
 

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec

2011-02-06 Thread hao yan (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12991220#comment-12991220
 ] 

hao yan commented on LUCENE-2903:
-

HI, Michael

Did u try FrameOfRef and PatchedFrameOfRef? 

 Improvement of PForDelta Codec
 --

 Key: LUCENE-2903
 URL: https://issues.apache.org/jira/browse/LUCENE-2903
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: hao yan
 Attachments: LUCENE_2903.patch, LUCENE_2903.patch


 There are 3 versions of PForDelta implementations in the Bulk Branch: 
 FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2.
 The FrameOfRef is a very basic one which is essentially a binary encoding 
 (may result in huge index size).
 The PatchedFrameOfRef is the implmentation based on the original version of 
 PForDelta in the literatures.
 The PatchedFrameOfRef2 is my previous implementation which are improved this 
 time. (The Codec name is changed to NewPForDelta.).
 In particular, the changes are:
 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the 
 old PForDelta does not support very large exceptions (since
 the Simple16 does not support very large numbers). Now this has been fixed in 
 the new LCPForDelta.
 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other 
 two PForDelta implementation in the bulk branch (FrameOfRef and 
 PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the 
 CodecProvider and PForDeltaFixedIntBlockCodec.
 3. The performance test results are:
 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef 
 for almost all kinds of queries, slightly worse then BulkVInt.
 2) My NewPForDelta codec can result in the smallest index size among all 4 
 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself)
 3) All performance test results are achieved by running with -server 
 instead of -client

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec

2011-02-06 Thread hao yan (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12991222#comment-12991222
 ] 

hao yan commented on LUCENE-2903:
-

And it sure complicate the pfordelta algorithm a lot by using intbuffer.set/get.

 Improvement of PForDelta Codec
 --

 Key: LUCENE-2903
 URL: https://issues.apache.org/jira/browse/LUCENE-2903
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: hao yan
 Attachments: LUCENE_2903.patch, LUCENE_2903.patch


 There are 3 versions of PForDelta implementations in the Bulk Branch: 
 FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2.
 The FrameOfRef is a very basic one which is essentially a binary encoding 
 (may result in huge index size).
 The PatchedFrameOfRef is the implmentation based on the original version of 
 PForDelta in the literatures.
 The PatchedFrameOfRef2 is my previous implementation which are improved this 
 time. (The Codec name is changed to NewPForDelta.).
 In particular, the changes are:
 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the 
 old PForDelta does not support very large exceptions (since
 the Simple16 does not support very large numbers). Now this has been fixed in 
 the new LCPForDelta.
 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other 
 two PForDelta implementation in the bulk branch (FrameOfRef and 
 PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the 
 CodecProvider and PForDeltaFixedIntBlockCodec.
 3. The performance test results are:
 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef 
 for almost all kinds of queries, slightly worse then BulkVInt.
 2) My NewPForDelta codec can result in the smallest index size among all 4 
 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself)
 3) All performance test results are achieved by running with -server 
 instead of -client

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec

2011-02-06 Thread hao yan (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12991221#comment-12991221
 ] 

hao yan commented on LUCENE-2903:
-

Hi, Paul

I tested ByteBuffer-IntBuffer, it is not faster than converting int[] - 
byte[]. 

 Improvement of PForDelta Codec
 --

 Key: LUCENE-2903
 URL: https://issues.apache.org/jira/browse/LUCENE-2903
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: hao yan
 Attachments: LUCENE_2903.patch, LUCENE_2903.patch


 There are 3 versions of PForDelta implementations in the Bulk Branch: 
 FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2.
 The FrameOfRef is a very basic one which is essentially a binary encoding 
 (may result in huge index size).
 The PatchedFrameOfRef is the implmentation based on the original version of 
 PForDelta in the literatures.
 The PatchedFrameOfRef2 is my previous implementation which are improved this 
 time. (The Codec name is changed to NewPForDelta.).
 In particular, the changes are:
 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the 
 old PForDelta does not support very large exceptions (since
 the Simple16 does not support very large numbers). Now this has been fixed in 
 the new LCPForDelta.
 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other 
 two PForDelta implementation in the bulk branch (FrameOfRef and 
 PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the 
 CodecProvider and PForDeltaFixedIntBlockCodec.
 3. The performance test results are:
 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef 
 for almost all kinds of queries, slightly worse then BulkVInt.
 2) My NewPForDelta codec can result in the smallest index size among all 4 
 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself)
 3) All performance test results are achieved by running with -server 
 instead of -client

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1540) Improvements to contrib.benchmark for TREC collections

2011-02-06 Thread Doron Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12991224#comment-12991224
 ] 

Doron Cohen commented on LUCENE-1540:
-

Following suggestions by Robert, brought back case insensitivity of path names 
by upper casing with Locale.ENGLISH as suggested in 
[toUpperCase()|http://download.oracle.com/javase/6/docs/api/java/lang/String.html#toUpperCase%28%29].
 
Committed:
- r1067764 - 3x
- r1067772 - trunk

 Improvements to contrib.benchmark for TREC collections
 --

 Key: LUCENE-1540
 URL: https://issues.apache.org/jira/browse/LUCENE-1540
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark
Reporter: Tim Armstrong
Assignee: Doron Cohen
Priority: Minor
 Fix For: 3.1, 4.0

 Attachments: LUCENE-1540.patch, LUCENE-1540.patch, LUCENE-1540.patch, 
 LUCENE-1540.patch, trecdocs.zip


 The benchmarking utilities for  TREC test collections (http://trec.nist.gov) 
 are quite limited and do not support some of the variations in format of 
 older TREC collections.  
 I have been doing some benchmarking work with Lucene and have had to modify 
 the package to support:
 * Older TREC document formats, which the current parser fails on due to 
 missing document headers.
 * Variations in query format - newlines after title tag causing the query 
 parser to get confused.
 * Ability to detect and read in uncompressed text collections
 * Storage of document numbers by default without storing full text.
 I can submit a patch if there is interest, although I will probably want to 
 write unit tests for the new functionality first.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (SOLR-2341) Shard distribution policy

2011-02-06 Thread William Mayor (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

William Mayor updated SOLR-2341:


Attachment: SOLR-2341.patch

This patch makes the implemented policy deterministic. This is missing from the 
previous patch. The policy code has also been refactored into it's own package.

 Shard distribution policy
 -

 Key: SOLR-2341
 URL: https://issues.apache.org/jira/browse/SOLR-2341
 Project: Solr
  Issue Type: New Feature
Reporter: William Mayor
Priority: Minor
 Attachments: SOLR-2341.patch, SOLR-2341.patch


 A first crack at creating policies to be used for determining to which of a 
 list of shards a document should go. See discussion on Distributed Indexing 
 on dev-list.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Distributed Indexing

2011-02-06 Thread William Mayor
Hi

Good call about the policies being deterministic, should've thought of that
earlier.

We've changed the patch to include this and I've removed the random
assignment one (for obvious reasons).

Take a look and let me know what's to do. (
https://issues.apache.org/jira/browse/SOLR-2341)

Cheers

William

On Thu, Feb 3, 2011 at 5:00 PM, Upayavira u...@odoko.co.uk wrote:


  On Thu, 03 Feb 2011 15:12 +, Alex Cowell alxc...@gmail.com wrote:

 Hi all,

 Just a couple of questions that have arisen.

 1. For handling non-distributed update requests (shards param is not
 present or is invalid), our code currently

- assumes the user would like the data indexed, so gets the request
handler assigned to /update
- executes the request using core.execute() for the SolrCore associated
with the original request

 Is this what we want it to do and is using core.execute() from within a
 request handler a valid method of passing on the update request?


 Take a look at how it is done in
 handler.component.SearchHandler.handleRequestBody(). I'd say try to follow
 as similar approach as possible. E.g. it is the SearchHandler that does much
 of the work, branching depending on whether it found a shards parameter.


 2. We have partially implemented an update processor which actually
 generates and sends the split update requests to each specified shard (as
 designated by the policy). As it stands, the code shares a lot in common
 with the HttpCommComponent class used for distributed search. Should we look
 at opening up the HttpCommComponent class so it could be used by our
 request handler as well or should we continue with our current
 implementation and worry about that later?


 I agree that you are going to want to implement an UpdateRequestProcessor.
 However, it would seem to me that, unlike search, you're not going to want
 to bother with the existing processor and associated component chain, you're
 going to want to replace the processor with a distributed version.

 As to the HttpCommComponent, I'd suggest you make your own educated
 decision. How similar is the class? Could one serve both needs effectively?


 3. Our update processor uses a MultiThreadedHttpConnectionManager to send
 parallel updates to shards, can anyone give some appropriate values to be
 used for the defaultMaxConnectionsPerHost and maxTotalConnections params?
 Won't the  values used for distributed search be a little high for
 distributed indexing?


 You are right, these will likely be lower for distributed indexing, however
 I'd suggest not worrying about it for now, as it is easy to tweak later.

 Upayavira

  ---
 Enterprise Search Consultant at Sourcesense UK,
 Making Sense of Open Source



RE: Arabic Analyzer

2011-02-06 Thread Digy
Here is a port of lucene.java's arabic analyzer (
https://issues.apache.org/jira/browse/LUCENENET-392 )

You can safely remove nunit dependency and test cases from the project.

DIGY

-Original Message-
From: Ben Foster [mailto:b...@planetcloud.co.uk] 
Sent: Sunday, February 06, 2011 5:47 PM
To: lucene-net-...@lucene.apache.org
Subject: Re: Arabic Analyzer

Is it still possible to use fixed term queries in Arabic (i.e. NOT using an
Analyzer)?

Thanks
Ben

On 6 February 2011 00:51, Prescott Nasser geobmx...@hotmail.com wrote:


 Unfortunately, I don't think we have that. We're working on creating a new
 port of the java lucene code, but I don't know the timeline yet - I'm sure
 there will be a lot of chatter on this mailing list soon.

 ~Prescott





 
  Date: Sat, 5 Feb 2011 22:57:11 +
  Subject: Arabic Analyzer
  From: b...@planetcloud.co.uk
  To: lucene-net-...@lucene.apache.org
 
  Is there an Arabic Analyzer available for Lucene.NET. I see there has
 been
  one contributed to the Java project but wasn't sure if this has been
 ported.
 
  Thanks,
 
  Ben




-- 

Ben Foster

planetcloud
The Elms, Hawton
Newark-on-Trent
Nottinghamshire
NG24 3RL

www.planetcloud.co.uk



[jira] Issue Comment Edited: (SOLR-2341) Shard distribution policy

2011-02-06 Thread William Mayor (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12991225#comment-12991225
 ] 

William Mayor edited comment on SOLR-2341 at 2/6/11 10:00 PM:
--

This patch makes the implemented policy deterministic. This is missing from the 
previous patch. The policy code has also been refactored into its own package.

  was (Author: williammayor):
This patch makes the implemented policy deterministic. This is missing from 
the previous patch. The policy code has also been refactored into it's own 
package.
  
 Shard distribution policy
 -

 Key: SOLR-2341
 URL: https://issues.apache.org/jira/browse/SOLR-2341
 Project: Solr
  Issue Type: New Feature
Reporter: William Mayor
Priority: Minor
 Attachments: SOLR-2341.patch, SOLR-2341.patch


 A first crack at creating policies to be used for determining to which of a 
 list of shards a document should go. See discussion on Distributed Indexing 
 on dev-list.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Distributed Indexing

2011-02-06 Thread Alex Cowell
Hey,

We're making good progress, but our DistributedUpdateRequestHandler is
having a bit of an identity crisis, so we thought we'd ask what other
people's opinions are. The current situation is as follows:

We've added a method to ContentStreamHandlerBase to check if an update
request is distributed or not (based on the presence/validity of the
'shards' parameter). So a non-distributed request will proceed as normal but
a distributed request would be passed on to the
DistributedUpdateRequestHandler to deal with.

The reason this choice is made in the ContentStreamHandlerBase is so that
the DistributedUpdateRequestHandler can use the URL the request came in on
to determine where to distribute update requests. Eg. an update request is
sent to:
http://localhost:8983/solr/update/csv?shards=shard1,shard2...
then the DistributedUpdateRequestHandler knows to send requests to:
shard1/update/csv
shard2/update/csv

Alternatively, if the request wasn't distributed, it would simply be handled
by whichever request handler /update/csv uses.

Herein lies the problem. The DistributedUpdateRequestHandler is not really a
request handler in the same way as the CSVRequestHandler or
XmlUpdateRequestHandlers are. If anything, it's more like a plugin for the
various existing update request handlers, to allow them to deal with
distributed requests - a distributor if you will. It isn't designed to be
able to receive and handle requests directly.

We would like this DistributedUpdateRequestHandler to be defined in the
solrconfig to allow flexibility for setting up multiple different
DistributedUpdateRequestHandlers with different ShardDistributionPolicies
etc.and also to allow us to get the appropriate instance from the core in
the code. There seem to be two paths for doing this:

1. Leave it as an implementation of SolrRequestHandler and hope the user
doesn't directly send update requests to it (ie. a request to
http://localhost:8983/solr/distrib update handler path would most likely
cripple something). So it would be defined in the solrconfig something like:
requestHandler name=distrib-update
class=solr.DistributedUpdateRequestHandler /

2. Create a new plugin type for the solrconfig, say
updateRequestDistributor which would involve creating a new interface for
the DistributedUpdateRequestHandler to implement, then registering it with
the core. It would be defined in the solrconfig something like:
updateRequestDistributor name=distrib-update
class=solr.DistributedUpdateRequestHandler
  lst name=defaults
str name=policysolr.HashedDistributionPolicy/str
  /lst
/updateRequestDistributor

This would mean that it couldn't directly receive requests, but that an
instance could still easily be retrieved from the core to handle the
distribution of update requests.

Any thoughts on the above issue (or a more succinct, descriptive name for
the class) are most welcome!

Alex


[jira] Created: (LUCENE-2909) NGramTokenFilter may generate offsets that exceed the length of original text

2011-02-06 Thread Shinya Kasatani (JIRA)
NGramTokenFilter may generate offsets that exceed the length of original text
-

 Key: LUCENE-2909
 URL: https://issues.apache.org/jira/browse/LUCENE-2909
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/analyzers
Affects Versions: 2.9.4
Reporter: Shinya Kasatani
Priority: Minor


Whan using NGramTokenFilter combined with CharFilters that lengthen the 
original text (such as ß - ss), the generated offsets exceed the length of 
the origianal text.
This causes InvalidTokenOffsetsException when you try to highlight the text in 
Solr.

While it is not possible to know the accurate offset of each character once you 
tokenize the whole text with tokenizers like KeywordTokenizer, NGramTokenFilter 
should at least avoid generating invalid offsets.


-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2909) NGramTokenFilter may generate offsets that exceed the length of original text

2011-02-06 Thread Shinya Kasatani (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shinya Kasatani updated LUCENE-2909:


Attachment: TokenFilterOffset.patch

The patch that fixes the problem, including tests.

 NGramTokenFilter may generate offsets that exceed the length of original text
 -

 Key: LUCENE-2909
 URL: https://issues.apache.org/jira/browse/LUCENE-2909
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/analyzers
Affects Versions: 2.9.4
Reporter: Shinya Kasatani
Priority: Minor
 Attachments: TokenFilterOffset.patch


 Whan using NGramTokenFilter combined with CharFilters that lengthen the 
 original text (such as ß - ss), the generated offsets exceed the length 
 of the origianal text.
 This causes InvalidTokenOffsetsException when you try to highlight the text 
 in Solr.
 While it is not possible to know the accurate offset of each character once 
 you tokenize the whole text with tokenizers like KeywordTokenizer, 
 NGramTokenFilter should at least avoid generating invalid offsets.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Assigned: (LUCENE-2909) NGramTokenFilter may generate offsets that exceed the length of original text

2011-02-06 Thread Koji Sekiguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi reassigned LUCENE-2909:
--

Assignee: Koji Sekiguchi

 NGramTokenFilter may generate offsets that exceed the length of original text
 -

 Key: LUCENE-2909
 URL: https://issues.apache.org/jira/browse/LUCENE-2909
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/analyzers
Affects Versions: 2.9.4
Reporter: Shinya Kasatani
Assignee: Koji Sekiguchi
Priority: Minor
 Attachments: TokenFilterOffset.patch


 Whan using NGramTokenFilter combined with CharFilters that lengthen the 
 original text (such as ß - ss), the generated offsets exceed the length 
 of the origianal text.
 This causes InvalidTokenOffsetsException when you try to highlight the text 
 in Solr.
 While it is not possible to know the accurate offset of each character once 
 you tokenize the whole text with tokenizers like KeywordTokenizer, 
 NGramTokenFilter should at least avoid generating invalid offsets.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1395) Integrate Katta

2011-02-06 Thread JohnWu (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12991279#comment-12991279
 ] 

JohnWu commented on SOLR-1395:
--

TomLiu:

as you said:QueryComponent returns DocSlice, but XMLWrite or EmbeddedServer 
returns SolrDocumentList from DocList.

I set the requestHandler to solr.MultiEmbeddedSearchHandler but the 
queryComponent still return the DocSlice.

Can you give me some advices?

Thanks!

JohnWu


 Integrate Katta
 ---

 Key: SOLR-1395
 URL: https://issues.apache.org/jira/browse/SOLR-1395
 Project: Solr
  Issue Type: New Feature
Affects Versions: 1.4
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: Next

 Attachments: SOLR-1395.patch, SOLR-1395.patch, SOLR-1395.patch, 
 back-end.log, front-end.log, hadoop-core-0.19.0.jar, katta-core-0.6-dev.jar, 
 katta-solrcores.jpg, katta.node.properties, katta.zk.properties, 
 log4j-1.2.13.jar, solr-1395-1431-3.patch, solr-1395-1431-4.patch, 
 solr-1395-1431-katta0.6.patch, solr-1395-1431-katta0.6.patch, 
 solr-1395-1431.patch, solr-1395-katta-0.6.2-1.patch, 
 solr-1395-katta-0.6.2-2.patch, solr-1395-katta-0.6.2-3.patch, 
 solr-1395-katta-0.6.2.patch, test-katta-core-0.6-dev.jar, 
 zkclient-0.1-dev.jar, zookeeper-3.2.1.jar

   Original Estimate: 336h
  Remaining Estimate: 336h

 We'll integrate Katta into Solr so that:
 * Distributed search uses Hadoop RPC
 * Shard/SolrCore distribution and management
 * Zookeeper based failover
 * Indexes may be built using Hadoop

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-1540) Improvements to contrib.benchmark for TREC collections

2011-02-06 Thread Doron Cohen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doron Cohen resolved LUCENE-1540.
-

Resolution: Fixed

ok no new failures, closing as fixed, Thanks Shai and Robert for your help here!

 Improvements to contrib.benchmark for TREC collections
 --

 Key: LUCENE-1540
 URL: https://issues.apache.org/jira/browse/LUCENE-1540
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark
Reporter: Tim Armstrong
Assignee: Doron Cohen
Priority: Minor
 Fix For: 3.1, 4.0

 Attachments: LUCENE-1540.patch, LUCENE-1540.patch, LUCENE-1540.patch, 
 LUCENE-1540.patch, trecdocs.zip


 The benchmarking utilities for  TREC test collections (http://trec.nist.gov) 
 are quite limited and do not support some of the variations in format of 
 older TREC collections.  
 I have been doing some benchmarking work with Lucene and have had to modify 
 the package to support:
 * Older TREC document formats, which the current parser fails on due to 
 missing document headers.
 * Variations in query format - newlines after title tag causing the query 
 parser to get confused.
 * Ability to detect and read in uncompressed text collections
 * Storage of document numbers by default without storing full text.
 I can submit a patch if there is interest, although I will probably want to 
 write unit tests for the new functionality first.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org