[Lucene.Net] hudkins update

2011-03-13 Thread Michael Herndon
for everyones benefit.

I started putting together a home server this weekend (not because we're
going use that as a build server, but because I needed a sandbox to install
jenkins to see which plugins work and which do not)

It seems like there has been constant trickle in of issues from upgrading to
jenkins on the build list.  So it'll be wise to do a due diligence mock
setup.  It will probably take me a lil over a week to work through this.
 I'll either screen shot the results or see if I can somehow allow for a
demo within a 72 hour period.

- Michael


Use JCC wrapper from C++ (no python)

2011-03-13 Thread Ian McCullough
Hello all,

I'm curious how closely tied the JCC/Lucene wrappers are to Python. I've
been looking for a way to use Lucene from C/C++.  I'm aware of CLucene, but
it appears to have some serious issues, and I was pointed to PyLucene. Is it
conceivable that one could use the JCC-generated code from C/C++ in the
absence of Python? Or is the JCC code closely tied to the Python interfaces
to use without heavy modification?

Thanks,
Ian


Re: GPU acceleration

2011-03-13 Thread Earwin Burrfoot
On Sun, Mar 13, 2011 at 00:15, Ken O'Brien k...@kenobrien.org wrote:
 To clarify, I've not yet written any code. I aim to bring a large speedup to
 any functionality that is computationally expensive. I'm wondering which
 components are candidates for this.

 I'll be looking through the code but if anyone is aware of parallelizable
 code, I'll start with that.
More like 'vectorizable' code, huh?

Guys from Yandex use modified group varint encoding plus handcrafted
SSE magic to decode/intersect posting lists and claim tremendous
speedups over original group varint.
They also use SSE to run the decision trees used in ranking.

There were experiments with moving both pieces of code to the GPU, and
GPU did well in terms of speed, but they say getting data in and out
of GPU made the approach unfeasible.

 I'll basically replicate existing functionality to run on the gpu.

 On 12/03/11 21:08, Simon Willnauer wrote:

 On Sat, Mar 12, 2011 at 9:21 PM, Ken O'Brienk...@kenobrien.org  wrote:

 Hi,

 Is anyone looking at GPU acceleration for Solr? If not, I'd like to
 contribute code which adds this functionality.

 As I'm not familiar with the codebase, does anyone know which areas of
 functionality could benefit from high degrees of parallelism.

 Very interesting can you elaborate a little more what kind of
 functionality you exposed / try to expose to the GPU?

 simon

 Regards,

 Ken



 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org





-- 
Kirill Zakharenko/Кирилл Захаренко
E-Mail/Jabber: ear...@gmail.com
Phone: +7 (495) 683-567-4
ICQ: 104465785

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[HUDSON] Lucene-Solr-tests-only-trunk - Build # 5887 - Failure

2011-03-13 Thread Apache Hudson Server
Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-trunk/5887/

2 tests failed.
REGRESSION:  
org.apache.solr.uima.processor.UIMAUpdateRequestProcessorTest.testProcessing

Error Message:
org.apache.uima.analysis_engine.AnalysisEngineProcessException

Stack Trace:
java.lang.RuntimeException: 
org.apache.uima.analysis_engine.AnalysisEngineProcessException
at 
org.apache.solr.uima.processor.UIMAUpdateRequestProcessor.processAdd(UIMAUpdateRequestProcessor.java:82)
at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:127)
at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:77)
at 
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:55)
at 
org.apache.solr.uima.processor.UIMAUpdateRequestProcessorTest.addDoc(UIMAUpdateRequestProcessorTest.java:120)
at 
org.apache.solr.uima.processor.UIMAUpdateRequestProcessorTest.testProcessing(UIMAUpdateRequestProcessorTest.java:76)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1214)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1146)
Caused by: org.apache.uima.analysis_engine.AnalysisEngineProcessException
at 
org.apache.uima.annotator.calais.OpenCalaisAnnotator.process(OpenCalaisAnnotator.java:206)
at 
org.apache.uima.analysis_component.CasAnnotator_ImplBase.process(CasAnnotator_ImplBase.java:56)
at 
org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:377)
at 
org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.processAndOutputNewCASes(PrimitiveAnalysisEngine_impl.java:295)
at 
org.apache.uima.analysis_engine.asb.impl.ASB_impl$AggregateCasIterator.processUntilNextOutputCas(ASB_impl.java:567)
at 
org.apache.uima.analysis_engine.asb.impl.ASB_impl$AggregateCasIterator.init(ASB_impl.java:409)
at 
org.apache.uima.analysis_engine.asb.impl.ASB_impl.process(ASB_impl.java:342)
at 
org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine_impl.processAndOutputNewCASes(AggregateAnalysisEngine_impl.java:267)
at 
org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.process(AnalysisEngineImplBase.java:267)
at 
org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.process(AnalysisEngineImplBase.java:280)
at 
org.apache.solr.uima.processor.UIMAUpdateRequestProcessor.processText(UIMAUpdateRequestProcessor.java:122)
at 
org.apache.solr.uima.processor.UIMAUpdateRequestProcessor.processAdd(UIMAUpdateRequestProcessor.java:69)
Caused by: java.io.IOException: Server returned HTTP response code: 503 for 
URL: http://api.opencalais.com/enlighten/calais.asmx/Enlighten
at 
sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1269)
at 
org.apache.uima.annotator.calais.OpenCalaisAnnotator.callServiceOnText(OpenCalaisAnnotator.java:234)
at 
org.apache.uima.annotator.calais.OpenCalaisAnnotator.process(OpenCalaisAnnotator.java:126)


REGRESSION:  
org.apache.solr.uima.processor.UIMAUpdateRequestProcessorTest.testTwoUpdates

Error Message:
org.apache.uima.analysis_engine.AnalysisEngineProcessException

Stack Trace:
java.lang.RuntimeException: 
org.apache.uima.analysis_engine.AnalysisEngineProcessException
at 
org.apache.solr.uima.processor.UIMAUpdateRequestProcessor.processAdd(UIMAUpdateRequestProcessor.java:82)
at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:127)
at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:77)
at 
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:55)
at 
org.apache.solr.uima.processor.UIMAUpdateRequestProcessorTest.addDoc(UIMAUpdateRequestProcessorTest.java:120)
at 
org.apache.solr.uima.processor.UIMAUpdateRequestProcessorTest.testTwoUpdates(UIMAUpdateRequestProcessorTest.java:94)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1214)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1146)
Caused by: org.apache.uima.analysis_engine.AnalysisEngineProcessException
at 
org.apache.uima.annotator.calais.OpenCalaisAnnotator.process(OpenCalaisAnnotator.java:206)
at 
org.apache.uima.analysis_component.CasAnnotator_ImplBase.process(CasAnnotator_ImplBase.java:56)
at 
org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:377)
at 
org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.processAndOutputNewCASes(PrimitiveAnalysisEngine_impl.java:295)
at 
org.apache.uima.analysis_engine.asb.impl.ASB_impl$AggregateCasIterator.processUntilNextOutputCas(ASB_impl.java:567)

[jira] Commented: (LUCENE-2308) Separately specify a field's type

2011-03-13 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13006182#comment-13006182
 ] 

Simon Willnauer commented on LUCENE-2308:
-

Brief Summary for GSoC Students:

FieldType aims on the one hand to separate field properties from the
actual value and on the other make Field's extensibility easier. Both
seem equally important while far from easy to achieve. Fieldable and
Field are a core API and changes to it need to well thought. Further
this issue can easily cause drastic performance degradation if not
done right. Consider this as a massive change since fields are used
almost all over lucene and solr.

 Separately specify a field's type
 -

 Key: LUCENE-2308
 URL: https://issues.apache.org/jira/browse/LUCENE-2308
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
  Labels: gsoc2011, lucene-gsoc-11
 Fix For: 4.0


 This came up from dicussions on IRC.  I'm summarizing here...
 Today when you make a Field to add to a document you can set things
 index or not, stored or not, analyzed or not, details like omitTfAP,
 omitNorms, index term vectors (separately controlling
 offsets/positions), etc.
 I think we should factor these out into a new class (FieldType?).
 Then you could re-use this FieldType instance across multiple fields.
 The Field instance would still hold the actual value.
 We could then do per-field analyzers by adding a setAnalyzer on the
 FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise
 for per-field codecs (with flex), where we now have
 PerFieldCodecWrapper).
 This would NOT be a schema!  It's just refactoring what we already
 specify today.  EG it's not serialized into the index.
 This has been discussed before, and I know Michael Busch opened a more
 ambitious (I think?) issue.  I think this is a good first baby step.  We could
 consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold
 off on that for starters...

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2621) Extend Codec to handle also stored fields and term vectors

2011-03-13 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13006181#comment-13006181
 ] 

Simon Willnauer commented on LUCENE-2621:
-

Brief Summary for GSoC Students:

This issue is about extend Codec to handle also stored fields and term vectors
This is a very interesting and at the same time very much needed
feature which involves API Design, Refactoring and in depth
understanding of how IndexWriter and its internals work. The API which
needs to be refactored (Codec API) was made to consume PostingLists
once an in memory index segment is flushed to disc. Yet, to expose
Stored Fields to this API we need to prepare it to consume data for
every document while we build the in memory segment. So there is a
little paradigm missmatch here which needs to be addressed.

 Extend Codec to handle also stored fields and term vectors
 --

 Key: LUCENE-2621
 URL: https://issues.apache.org/jira/browse/LUCENE-2621
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 4.0
Reporter: Andrzej Bialecki 
  Labels: gsoc2011, lucene-gsoc-11

 Currently Codec API handles only writing/reading of term-related data, while 
 stored fields data and term frequency vector data writing/reading is handled 
 elsewhere.
 I propose to extend the Codec API to handle this data as well.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2621) Extend Codec to handle also stored fields and term vectors

2011-03-13 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13006180#comment-13006180
 ] 

Simon Willnauer commented on LUCENE-2621:
-

Brief Summary for GSoC Students:

This issue is about extend Codec to handle also stored fields and term vectors
This is a very interesting and at the same time very much needed
feature which involves API Design, Refactoring and in depth
understanding of how IndexWriter and its internals work. The API which
needs to be refactored (Codec API) was made to consume PostingLists
once an in memory index segment is flushed to disc. Yet, to expose
Stored Fields to this API we need to prepare it to consume data for
every document while we build the in memory segment. So there is a
little paradigm missmatch here which needs to be addressed.

 Extend Codec to handle also stored fields and term vectors
 --

 Key: LUCENE-2621
 URL: https://issues.apache.org/jira/browse/LUCENE-2621
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 4.0
Reporter: Andrzej Bialecki 
  Labels: gsoc2011, lucene-gsoc-11

 Currently Codec API handles only writing/reading of term-related data, while 
 stored fields data and term frequency vector data writing/reading is handled 
 elsewhere.
 I propose to extend the Codec API to handle this data as well.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Participating in GSoC'11 with Lucene

2011-03-13 Thread Simon Willnauer
On Sun, Mar 13, 2011 at 12:11 AM, Michael McCandless
luc...@mikemccandless.com wrote:
 Simon these are great summaries -- can you post them on the issues too?  
 Thanks!

done!

simon

 On Sat, Mar 12, 2011 at 4:35 PM, Simon Willnauer
 simon.willna...@googlemail.com wrote:
 Hey,

 On Sat, Mar 12, 2011 at 5:32 PM, Zhijie Shen zjshe...@gmail.com wrote:
 Hi developers,

 I'm a graduate student from National University of Singapore, majoring in
 Computer Science. The enthusiasm of open source and information retrieval
 drives me to participate in GSoC'11 with your community. I first got to know
 Lucene when I was in a software engineer intern in IBM, working on Lotus
 Connections.

 Awesome and welcome to Lucene :)

 Now I've already checked out the source code and successfully built it
 locally. Meanwhile, I begin to read through the Jira issues, and are more
 interested in Issue 2308, 2309 and 2621, which seem to be the refactoring
 tasks (Please correct me if I'm wrong). My personal feeling is that these
 tasks will be more appropriate for a beginner to get in. Moreover, I think
 to start with such a big project, it is more efficient to read through the
 discussion on Jira to understand the problem, and then dive into the related
 code with the problem kept in mind. What is your opinion? I'm looking
 forward to your guidance.

 Apparently you survived the first steps to get into lucene and solr!
 Great! You also looked at JIRA which is even better. So lemme tell you
 some words about the issues you have listed.

 LUCENE-2621 - Extend Codec to handle also stored fields and term vectors
 This is a very interesting and at the same time very much needed
 feature which involves API Design, Refactoring and in depth
 understanding of how IndexWriter and its internals work. The API which
 needs to be refactored (Codec API) was made to consume PostingLists
 once an in memory index segment is flushed to disc. Yet, to expose
 Stored Fields to this API we need to prepare it to consume data for
 every document while we build the in memory segment. So there is a
 little paradigm missmatch here which needs to be addressed.

 LUCENE-2309 - Fully decouple IndexWriter from analyzers

 This one is something I look forward to have for quite a while which
 would flatten the way for other analysis capabilities than the one
 lucene offers today. This seems to be refactoring-heavier that the
 other but might be require less knowledge about the IndexWriter (IW)
 internals than the codec one. Yet, it still is a very interesting
 issue / project to work on and fairly self-contained.

 LUCENE-2308 - Separately specify a field's type

 FieldType aims on the one hand to separate field properties from the
 actual value and on the other make Field's extensibility easier. Both
 seem equally important while far from easy to achieve. Fieldable and
 Field are a core API and changes to it need to well thought. Further
 this issue can easily cause drastic performance degradation if not
 done right. Consider this as a massive change since fields are used
 almost all over lucene and solr.

 I wrote those little summaries not to scare you away, not at all! I
 rather tried to find out what to expect from the issues and to make it
 easier for you to pick either one or another which you would like to
 work on. I will try to update the description of those issues if they
 are not already clear enough ( LUCENE-2621  seems kind of too brief
 though) in the next couple of days.

 If you have any question regarding those issues or any other, feel
 free to ask here on the list or on the issue directly (you might need
 a JIRA account if you don't have one already you should get one :)
 Reading the JIRA issue might help you to understand what those issues
 about but those are usually written by core devs or long time
 contributors so please as any question you have and don't hesitate to
 ask if you have problems with anything.

 Simon

 Regards,
 Zhijie

 --
 Zhijie Shen
 School of Computing
 National University of Singapore



 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org





 --
 Mike

 http://blog.mikemccandless.com


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2308) Separately specify a field's type

2011-03-13 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13006185#comment-13006185
 ] 

Robert Muir commented on LUCENE-2308:
-

{quote}
Moving stuff from Solr to Lucene involves lots of politics. It is way easier to 
let Solr adopt eventually than fight your way through the politics (this is my 
opinion though.)
{quote}

Then why do we still have merged codebases?
If this is the way things are, then we should un-merge the two projects. 

because as a lucene developer, i spend a lot of time trying to do my part to 
fix various things in Solr... if its a one-way-street then we need to un-merge.


 Separately specify a field's type
 -

 Key: LUCENE-2308
 URL: https://issues.apache.org/jira/browse/LUCENE-2308
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
  Labels: gsoc2011, lucene-gsoc-11
 Fix For: 4.0


 This came up from dicussions on IRC.  I'm summarizing here...
 Today when you make a Field to add to a document you can set things
 index or not, stored or not, analyzed or not, details like omitTfAP,
 omitNorms, index term vectors (separately controlling
 offsets/positions), etc.
 I think we should factor these out into a new class (FieldType?).
 Then you could re-use this FieldType instance across multiple fields.
 The Field instance would still hold the actual value.
 We could then do per-field analyzers by adding a setAnalyzer on the
 FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise
 for per-field codecs (with flex), where we now have
 PerFieldCodecWrapper).
 This would NOT be a schema!  It's just refactoring what we already
 specify today.  EG it's not serialized into the index.
 This has been discussed before, and I know Michael Busch opened a more
 ambitious (I think?) issue.  I think this is a good first baby step.  We could
 consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold
 off on that for starters...

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2308) Separately specify a field's type

2011-03-13 Thread Chris Male (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13006187#comment-13006187
 ] 

Chris Male commented on LUCENE-2308:


bq. I'm surprised to barely even see a mention to Solr here which already, of 
course obviously, already has a FieldType. Might it be ported?

I think there is a lot of overlap but Solr's FieldTypes also integrate with its 
schema through SchemaField so maybe its an option to port the overlap and then 
let Solr extend whatever is created, to provide its schema integration/Solr 
specific functions?

 Separately specify a field's type
 -

 Key: LUCENE-2308
 URL: https://issues.apache.org/jira/browse/LUCENE-2308
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
  Labels: gsoc2011, lucene-gsoc-11
 Fix For: 4.0


 This came up from dicussions on IRC.  I'm summarizing here...
 Today when you make a Field to add to a document you can set things
 index or not, stored or not, analyzed or not, details like omitTfAP,
 omitNorms, index term vectors (separately controlling
 offsets/positions), etc.
 I think we should factor these out into a new class (FieldType?).
 Then you could re-use this FieldType instance across multiple fields.
 The Field instance would still hold the actual value.
 We could then do per-field analyzers by adding a setAnalyzer on the
 FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise
 for per-field codecs (with flex), where we now have
 PerFieldCodecWrapper).
 This would NOT be a schema!  It's just refactoring what we already
 specify today.  EG it's not serialized into the index.
 This has been discussed before, and I know Michael Busch opened a more
 ambitious (I think?) issue.  I think this is a good first baby step.  We could
 consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold
 off on that for starters...

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[HUDSON] Lucene-Solr-tests-only-3.x - Build # 5874 - Failure

2011-03-13 Thread Apache Hudson Server
Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-3.x/5874/

1 tests failed.
REGRESSION:  
org.apache.solr.client.solrj.embedded.SolrExampleStreamingTest.testCommitWithin

Error Message:
expected:0 but was:1

Stack Trace:
junit.framework.AssertionFailedError: expected:0 but was:1
at 
org.apache.solr.client.solrj.SolrExampleTests.testCommitWithin(SolrExampleTests.java:327)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1075)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1007)




Build Log (for compile errors):
[...truncated 10129 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2944) BytesRef reuse bugs in QueryParser and analysis.jsp

2011-03-13 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13006192#comment-13006192
 ] 

Simon Willnauer commented on LUCENE-2944:
-

I had a brief review - naming looks good from my side though...

simon

 BytesRef reuse bugs in QueryParser and analysis.jsp
 ---

 Key: LUCENE-2944
 URL: https://issues.apache.org/jira/browse/LUCENE-2944
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Robert Muir
Assignee: Robert Muir
 Fix For: 4.0

 Attachments: LUCENE-2944.patch, LUCENE-2944.patch, LUCENE-2944.patch, 
 LUCENE-2944.patch, LUCENE-2944_option2.patch


 Some code uses BytesRef as if it were a String, in this case consumers of 
 TermToBytesRefAttribute.
 The thing is, while our general implementation works on char[] and then 
 populates the consumers BytesRef,
 not all TermToBytesRefAttribute implementations do this, specifically ICU 
 collation, it reuses the bytes and simply sets the pointers:
 {noformat}
   @Override
   public int toBytesRef(BytesRef target) {
 collator.getRawCollationKey(toString(), key);
 target.bytes = key.bytes;
 target.offset = 0;
 target.length = key.size;
 return target.hashCode();
   }
 {noformat}
 Most of the blame falls on me as I added this to the queryparser in 
 LUCENE-2514.
 Attached is a patch so that these consumers re-use a 'spare' and copy the 
 bytes when they are going to make a long lasting object such as a Term.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Lucene Solr a one way street?

2011-03-13 Thread Simon Willnauer
Hey folks,

I have recently tried to push some refactorings towards moving stuff
from Solr to modules land to enable users of Lucene to benefit from
the developments that have been made in Solr land during the past with
very little success. Actually, it was a really disappointing
experience whenever  I tried to kick off issues towards this
direction. On LUCENE-2308 David asked a good question why FieldType is
not ported to Lucene rather than a new development.
I replied with:

{quote}
Moving stuff from Solr to Lucene involves lots of politics. It is way
easier to let Solr adopt eventually than fight your way through the
politics (this is my opinion though.)
{quote}

Yet, while the answer to his question doesn't matter in this context
but it raised some other question from Roberts side:

{quote}
Then why do we still have merged codebases?
If this is the way things are, then we should un-merge the two projects.

because as a lucene developer, i spend a lot of time trying to do my
part to fix various things in Solr... if its a one-way-street then we
need to un-merge.
{quote}

The discussions on LUCENE-2883 changed my personal reception on how
things work here quite dramatically. I lost almost all enthusiasm to
even try to push developments towards moving things out of Solr and
into modules since literally every movement in this direction starts a
lot of politics (at least this is my understanding drawn from the
rather non-technical disagreements I have seen folks mentioning on
this issue). I don't care where those politics come from but in my
opinion we need to find an agreement how we deal with stealing
functionality form Solr and make them available to lucene users. My
personal opinion is that any refactoring, consolidation of APIs etc.
should NOT be influenced by the fact that they have been Solr private
and might influence further development on solr with regards to
backwards compatibility etc.

Moving features to modules should be first priority and please correct
me if I am wrong this was one of the major reason why we merged the
code base. All users should benefit from the nice features which are
currently hidden in the solr code base. FunctionQuery is a tiny one
IMO and the frustration it caused on my side was immense. I don't even
wanna try to suggest to make replication, faceting or even the cloud
feature decoupled from Solr (I don't want to argue about the
feasibility or the amount of work this would be here! Just lemme say
one thing we can always branch and there is a large workforce out
there that is willing to work on stuff like that).

I can only agree with robert that if this is a one way street that the
merge makes no sense anymore. Refactoring will bring a lot of benefits
to both lucene AND solr. I also wanna quote Earwin here (I don't find
the issue right now - this quote is from my memory): we should try to
decouple things rather than couple them even more tight. I moved out
of all Solr issues since then and I am not willing to do any work on
it until we find an agreement on this.

I should have raised this issue earlier I guess but I rather do it now
than never.

thanks for reading,

simon

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Lucene Solr a one way street?

2011-03-13 Thread Grant Ingersoll

On Mar 13, 2011, at 10:23 AM, Simon Willnauer wrote:

 Hey folks,
 
 I have recently tried to push some refactorings towards moving stuff
 from Solr to modules land to enable users of Lucene to benefit from
 the developments that have been made in Solr land during the past with
 very little success. Actually, it was a really disappointing
 experience whenever  I tried to kick off issues towards this
 direction. On LUCENE-2308 David asked a good question why FieldType is
 not ported to Lucene rather than a new development.
 I replied with:
 
 {quote}
 Moving stuff from Solr to Lucene involves lots of politics. It is way
 easier to let Solr adopt eventually than fight your way through the
 politics (this is my opinion though.)
 {quote}

I hadn't looked at 2308, but to me, if there are well written patches, then 
they should be considered.  More modules make a lot of sense to me, as long as 
everyone is kept whole and there are no performance losses.  Moving FTs to 
Lucene seems like a lot of work for little benefit, but maybe that is just me.  
Seems like most Lucene users like to roll their own in that stuff or use Spring.

 
 Yet, while the answer to his question doesn't matter in this context
 but it raised some other question from Roberts side:
 
 {quote}
 Then why do we still have merged codebases?
 If this is the way things are, then we should un-merge the two projects.
 
 because as a lucene developer, i spend a lot of time trying to do my
 part to fix various things in Solr... if its a one-way-street then we
 need to un-merge.
 {quote}
 
 The discussions on LUCENE-2883 changed my personal reception on how
 things work here quite dramatically. I lost almost all enthusiasm to
 even try to push developments towards moving things out of Solr and
 into modules since literally every movement in this direction starts a
 lot of politics (at least this is my understanding drawn from the
 rather non-technical disagreements I have seen folks mentioning on
 this issue). I don't care where those politics come from but in my
 opinion we need to find an agreement how we deal with stealing
 functionality form Solr and make them available to lucene users. My
 personal opinion is that any refactoring, consolidation of APIs etc.
 should NOT be influenced by the fact that they have been Solr private
 and might influence further development on solr with regards to
 backwards compatibility etc.
 

I actually thought 2883 was a pretty good discussion.  The sum take away from 
it for me was go for it.  One person was hesitant about it.   I think 
sometimes you need to just put up patches instead of having lengthy discussions.


 Moving features to modules should be first priority and please correct
 me if I am wrong this was one of the major reason why we merged the
 code base.

I don't think it is a first priority, but it is a benefit.  I also don't think 
it was the majority reason for the merge.  I think the majority reason was that 
most of the Solr committers were also Lucene committers and there was a fair 
amount of duplicated work and a desire to be on the same version.  
Modularization was/is also a benefit.

FWIW, I think the merge for the most part has been successful in most places.  
We have better tested code, faster code, etc.

 All users should benefit from the nice features which are
 currently hidden in the solr code base. FunctionQuery is a tiny one
 IMO and the frustration it caused on my side was immense. I don't even
 wanna try to suggest to make replication, faceting or even the cloud
 feature decoupled from Solr (I don't want to argue about the
 feasibility or the amount of work this would be here! Just lemme say
 one thing we can always branch and there is a large workforce out
 there that is willing to work on stuff like that).
 
 I can only agree with robert that if this is a one way street that the
 merge makes no sense anymore.

I guess the question people w/ Solr only hats on have (if there are such 
people), is which way is that street going?  It seems like most people want to 
pull stuff out of Solr, but they don't seem to want to put into it.  That's 
probably where some of the resistance comes from.  If you want to modularize 
everything so that you can consume it outside of Solr, it usually means you 
don't use Solr, which sometimes comes across that you don't care if the 
modularization actually has a negative effect on Solr.  I'm all for 
modularization and enabling everyone, but not at the cost of loss of 
performance in Solr.  As tightly coupled as Solr is, it's pretty damn fast and 
resilient.  Show me that you keep that whole and I'll be +1 on everything.

You also have to keep in mind that some of these things, replication for 
instance, rely on Solr things.  Are you really going to more the HTTP protocols 
to just Lucene?  What does that even mean?  Lucene is a Java API.  It doesn't 
assume containers, etc.  Solr is the thing that delivers Lucene in a container.

To me, lower 

[jira] Commented: (LUCENE-2308) Separately specify a field's type

2011-03-13 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13006213#comment-13006213
 ] 

Yonik Seeley commented on LUCENE-2308:
--

bq. I think there is a lot of overlap but Solr's FieldTypes also integrate with 
its schema through SchemaField so maybe its an option to port the overlap and 
then let Solr extend whatever is created, to provide its schema 
integration/Solr specific functions?

Yeah, that seems reasonable.

 Separately specify a field's type
 -

 Key: LUCENE-2308
 URL: https://issues.apache.org/jira/browse/LUCENE-2308
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
  Labels: gsoc2011, lucene-gsoc-11
 Fix For: 4.0


 This came up from dicussions on IRC.  I'm summarizing here...
 Today when you make a Field to add to a document you can set things
 index or not, stored or not, analyzed or not, details like omitTfAP,
 omitNorms, index term vectors (separately controlling
 offsets/positions), etc.
 I think we should factor these out into a new class (FieldType?).
 Then you could re-use this FieldType instance across multiple fields.
 The Field instance would still hold the actual value.
 We could then do per-field analyzers by adding a setAnalyzer on the
 FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise
 for per-field codecs (with flex), where we now have
 PerFieldCodecWrapper).
 This would NOT be a schema!  It's just refactoring what we already
 specify today.  EG it's not serialized into the index.
 This has been discussed before, and I know Michael Busch opened a more
 ambitious (I think?) issue.  I think this is a good first baby step.  We could
 consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold
 off on that for starters...

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2960) Allow (or bring back) the ability to setRAMBufferSizeMB on an open IndexWriter

2011-03-13 Thread Shay Banon (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13006214#comment-13006214
 ] 

Shay Banon commented on LUCENE-2960:


Just a note regarding the IWC and being able to consult it for live changes, it 
feels strange to me that settings something on the config will affect the IW in 
real time. Maybe its just me, but it feels nicer to have the live setters on 
IW compared to IWC.

I also like the ability to decouple construction time configuration through 
IWC, and live settings through setters on IW. It is then very clear what can be 
set on construction time, and what can be set on a live IW. It also allows for 
compile time / static check for the code what can be changed at what lifecycle 
phase.

 Allow (or bring back) the ability to setRAMBufferSizeMB on an open IndexWriter
 --

 Key: LUCENE-2960
 URL: https://issues.apache.org/jira/browse/LUCENE-2960
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shay Banon
Priority: Blocker
 Fix For: 3.1, 4.0


 In 3.1 the ability to setRAMBufferSizeMB is deprecated, and removed in trunk. 
 It would be great to be able to control that on a live IndexWriter. Other 
 possible two methods that would be great to bring back are 
 setTermIndexInterval and setReaderTermsIndexDivisor. Most of the other 
 setters can actually be set on the MergePolicy itself, so no need for setters 
 for those (I think).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2308) Separately specify a field's type

2011-03-13 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13006215#comment-13006215
 ] 

Simon Willnauer commented on LUCENE-2308:
-

{quote}
I think there is a lot of overlap but Solr's FieldTypes also integrate with its 
schema through SchemaField so maybe its an option to port the overlap and then 
let Solr extend whatever is created, to provide its schema integration/Solr 
specific functions?
{quote}

+1

 Separately specify a field's type
 -

 Key: LUCENE-2308
 URL: https://issues.apache.org/jira/browse/LUCENE-2308
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
  Labels: gsoc2011, lucene-gsoc-11
 Fix For: 4.0


 This came up from dicussions on IRC.  I'm summarizing here...
 Today when you make a Field to add to a document you can set things
 index or not, stored or not, analyzed or not, details like omitTfAP,
 omitNorms, index term vectors (separately controlling
 offsets/positions), etc.
 I think we should factor these out into a new class (FieldType?).
 Then you could re-use this FieldType instance across multiple fields.
 The Field instance would still hold the actual value.
 We could then do per-field analyzers by adding a setAnalyzer on the
 FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise
 for per-field codecs (with flex), where we now have
 PerFieldCodecWrapper).
 This would NOT be a schema!  It's just refactoring what we already
 specify today.  EG it's not serialized into the index.
 This has been discussed before, and I know Michael Busch opened a more
 ambitious (I think?) issue.  I think this is a good first baby step.  We could
 consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold
 off on that for starters...

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Lucene Solr a one way street?

2011-03-13 Thread Robert Muir
On Sun, Mar 13, 2011 at 11:47 AM, Grant Ingersoll gsing...@apache.org wrote:
 I guess the question people w/ Solr only hats on have (if there are such 
 people), is which way is that street going?  It seems like most people want 
 to pull stuff out of Solr, but they don't seem to want to put into it.  
 That's probably where some of the resistance comes from.  If you want to 
 modularize everything so that you can consume it outside of Solr, it usually 
 means you don't use Solr, which sometimes comes across that you don't care if 
 the modularization actually has a negative effect on Solr.  I'm all for 
 modularization and enabling everyone, but not at the cost of loss of 
 performance in Solr.  As tightly coupled as Solr is, it's pretty damn fast 
 and resilient.  Show me that you keep that whole and I'll be +1 on everything.

Do you have any facts to back up these baseless accusations?

Because I'll tell you how its seems to me: lucene committers are
going well beyond whats required (fixing solr) to commit changes to
lucene.

Take a look at the commits list, we are the ones doing Solr's dirty work:
* Like Uwe Schindler fixing up tons of XML related bugs in Solr,
fixing analysis.jsp and the related request handlers.
* Like Simon Willnauer doing the necessary improvements to IndexReader
such that SolrIndexReader need not exist, and trying to add good codec
support to Solr so it can take advantage of flexible indexing.

And I guess i didnt put any effort into solr when i spent a huge
chunk of this weekend tracking down jre crashes and test bugs in a
Solr cloud test?!

As far as modularization having a negative performance effect on Solr,
how is this the case? Again do you have any concrete examples, or is
this just more baseless accusations?

Do you have specific benchmarks to where solr's analysis is now
somehow slower due to the refactoring (since this is the only
modularization thats happened from solr)?!
Doesn't look slower to me:
http://www.lucidimagination.com/search/document/46a8351089a98aec/protwords_txt_support_in_stemmers#46a8351089a98aec

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2749) Co-occurrence filter

2011-03-13 Thread Elmar Pitschke (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13006222#comment-13006222
 ] 

Elmar Pitschke commented on LUCENE-2749:


Hi,
i am fairly new to Lucene development, but i have plenty experience using it 
:). I would like to make some contribution and think this would be a good task 
for me to start, as i am fairly interested in the analysis part. Can i work on 
this task or has there been any work done on this yet?
Regards
   Elmar

 Co-occurrence filter
 

 Key: LUCENE-2749
 URL: https://issues.apache.org/jira/browse/LUCENE-2749
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Analysis
Affects Versions: 3.1, 4.0
Reporter: Steven Rowe
Priority: Minor
 Fix For: 4.0


 The co-occurrence filter to be developed here will output sets of tokens that 
 co-occur within a given window onto a token stream.  
 These token sets can be ordered either lexically (to allow order-independent 
 matching/counting) or positionally (e.g. sliding windows of positionally 
 ordered co-occurring terms that include all terms in the window are called 
 n-grams or shingles). 
 The parameters to this filter will be: 
 * window size: this can be a fixed sequence length, sentence/paragraph 
 context (these will require sentence/paragraph segmentation, which is not in 
 Lucene yet), or over the entire token stream (full field width)
 * minimum number of co-occurring terms: = 2
 * maximum number of co-occurring terms: = window size
 * token set ordering (lexical or positional)
 One use case for co-occurring token sets is as candidates for collocations.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2621) Extend Codec to handle also stored fields and term vectors

2011-03-13 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13006223#comment-13006223
 ] 

Zhijie Shen commented on LUCENE-2621:
-

I've some questions here to make sure I'm looking into the correct codes:
When you mentioned Codec API, do you mean the abstract class 
org.apache.lucene.index.codecs.Codec?
Term vectors refer to org.apache.lucene.index.TermFreqVector, and it is 
processed by TermVectorsWriter now, correct?
But what are the stored fields? I cannot find them immediately.

BTW, is there any design document of Lucene in the Wiki?


 Extend Codec to handle also stored fields and term vectors
 --

 Key: LUCENE-2621
 URL: https://issues.apache.org/jira/browse/LUCENE-2621
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 4.0
Reporter: Andrzej Bialecki 
  Labels: gsoc2011, lucene-gsoc-11

 Currently Codec API handles only writing/reading of term-related data, while 
 stored fields data and term frequency vector data writing/reading is handled 
 elsewhere.
 I propose to extend the Codec API to handle this data as well.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2960) Allow (or bring back) the ability to setRAMBufferSizeMB on an open IndexWriter

2011-03-13 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13006227#comment-13006227
 ] 

Earwin Burrfoot commented on LUCENE-2960:
-

{quote}
Why such purity? What do we gain?

I'm all for purity, but only if it doesn't interfere w/ functionality.
Here, it's taking away freedom...
{quote}
We gain consistency and predictability. And there are a lot of freedoms 
dangerous for developers.

{quote}
In fact it should be fine to share an IWC across multiple writers; you
can change the RAM buffer for all of them at once.
{quote}

You've brought up a purrfect example of how NOT to do things.
This is called 'action at a distance' and is a damn bug. Very annoying one.
I've thoroughly experienced it with previous major version of Apache HTTPClient 
- they had an API that suggested you can set per-request timeouts, while these 
were actually global for a single Client instance.
I fried my brain trying to understand why the hell random user requests timeout 
at hundred times their intended duration.
Oh! It was an occasional admin request changing the global.

irony You know, you can actually instantiate some DateRangeFilter with a 
couple of Dates, and then change these Dates (they are writeable) before each 
request. Isn't it an exciting kind of programming freedom?
Or, back to our current discussion - we can pass RAMBufferSizeMB as an 
AtomicDouble, instead of current double, then we can use .set() on an instance 
we passed, and have our live reconfigurability. What's more - AtomicDouble 
protects us from word tearing! /irony

bq. I doubt there's any JVM out there where our lack-of-volatile infoStream 
causes any problems.
Er.. While I have never personally witnessed unsynchronized long/double tearing,
I've seen the consequence of unsafely publishing a HashMap - an endless loop on 
get().
It happened on your run off the mill Sun 1.6 JVM.
So the bug is there, lying in wait. Maybe nobody ever actually used the freedom 
to change infoStream in-flight, or the guy was lucky, or in his particular 
situation the field was guarded by some unrelated sync.




While I see banishing live reconfiguration from IW as a lost cause, I ask to 
make IWC immutable at the very least. As Shay said - this will provide a clear 
barrier between mutable and immutable properties.

 Allow (or bring back) the ability to setRAMBufferSizeMB on an open IndexWriter
 --

 Key: LUCENE-2960
 URL: https://issues.apache.org/jira/browse/LUCENE-2960
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shay Banon
Priority: Blocker
 Fix For: 3.1, 4.0


 In 3.1 the ability to setRAMBufferSizeMB is deprecated, and removed in trunk. 
 It would be great to be able to control that on a live IndexWriter. Other 
 possible two methods that would be great to bring back are 
 setTermIndexInterval and setReaderTermsIndexDivisor. Most of the other 
 setters can actually be set on the MergePolicy itself, so no need for setters 
 for those (I think).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2958) WriteLineDocTask improvements

2011-03-13 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13006251#comment-13006251
 ] 

Shai Erera commented on LUCENE-2958:


A thought -- how about we do the following:
# LineDocSource remains basically as it is today, with a getDocData(String 
line) extendable method for whoever wants
# Instead of introducing those Fillers, you create a HeaderLineDocSource which 
assumes the first line read is the header line, and parses the following ones 
as appropriate. It will create LDS extending getDocData.

This will not introduce a Filler in LDS, and those who don't care about it 
don't need to know about it at all. Also, it will showcase the 'extendability' 
of LDS.

Will that be simpler?

 WriteLineDocTask improvements
 -

 Key: LUCENE-2958
 URL: https://issues.apache.org/jira/browse/LUCENE-2958
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark
Reporter: Doron Cohen
Assignee: Doron Cohen
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-2958.patch, LUCENE-2958.patch, LUCENE-2958.patch


 Make WriteLineDocTask and LineDocSource more flexible/extendable:
 * allow to emit lines also for empty docs (keep current behavior as default)
 * allow more/less/other fields

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2749) Co-occurrence filter

2011-03-13 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13006254#comment-13006254
 ] 

Steven Rowe commented on LUCENE-2749:
-

Hi Elmar,

I haven't had a chance to do more than an hour or two of work on this, and that 
was a while back, so please feel free to run with it.

You should know, though, that Robert Muir and Yonik Seeley (both Lucene/Solr 
developers) expressed skepticism (on #lucene IRC) about whether this filter 
belongs in Lucene itself, because in their experience, collocations are used by 
non-search software, and they believe that Lucene should remain focused 
exclusively on search.  

Robert Muir also thinks that components that support Boolean search (i.e., not 
ranked search) should go elsewhere.  

I personally disagree with these restrictions in general, and I think that a 
co-occurrence filter could directly support search.  See this 
solr-u...@lucene.apache.org mailing list discussion for an example I gave (and 
one of the reasons I made this issue): 
http://www.lucidimagination.com/search/document/f69f877e0fa05d17/how_do_i_this_in_solr#d9d5932e7074d356
 . In this thread, I described a way to solve the original poster's problem 
using a co-occurrence filter exactly like the one proposed here.

I mention all this to caution you that work you put in here may never be 
committed to Lucene itself.

The mailing list thread I mentioned above describes the main limitations a 
filter like this will have: combinatoric explosion of generated terms.  I 
haven't figured out how to manage this, but it occurs to me that the 
two-term-collocation case is less problematic in this regard than the 
generalized case (whole-field window, all possible combinations).  I had a 
vague implementation conception of incrementing a fixed-width integer to 
iterate over the combinations, using the integer's bits to include/exclude 
input terms in the output termset tokens.  Using a 32-bit integer to track 
combinations would limit the length of an input token stream to 32 tokens, but 
in the generalized case of all combinations, I'm pretty sure that the number of 
bits available would not be the limiting factor, but rather the number of 
generated terms.  I guess the question is how to handle cases that produce 
fewer terms than all combinations of terms from an input token stream, e.g. the 
two-term-collocation case, without imposing the restrictions necessary in the 
generalized case.

Here are a couple of recent information retrieval papers using termset to 
mean indexed token containing multiple input terms:

TSS: Efficient Term Set Search in Large Peer-to-Peer Textual Collections
http://www.cs.ust.hk/~liu/TSS-TC.pdf

Termset-based Indexing and Query Processing in P2P Search
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=arnumber=5384831

(Sorry, I couldn't find a free public location for the second paper.)

 Co-occurrence filter
 

 Key: LUCENE-2749
 URL: https://issues.apache.org/jira/browse/LUCENE-2749
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Analysis
Affects Versions: 3.1, 4.0
Reporter: Steven Rowe
Priority: Minor
 Fix For: 4.0


 The co-occurrence filter to be developed here will output sets of tokens that 
 co-occur within a given window onto a token stream.  
 These token sets can be ordered either lexically (to allow order-independent 
 matching/counting) or positionally (e.g. sliding windows of positionally 
 ordered co-occurring terms that include all terms in the window are called 
 n-grams or shingles). 
 The parameters to this filter will be: 
 * window size: this can be a fixed sequence length, sentence/paragraph 
 context (these will require sentence/paragraph segmentation, which is not in 
 Lucene yet), or over the entire token stream (full field width)
 * minimum number of co-occurring terms: = 2
 * maximum number of co-occurring terms: = window size
 * token set ordering (lexical or positional)
 One use case for co-occurring token sets is as candidates for collocations.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2958) WriteLineDocTask improvements

2011-03-13 Thread Doron Cohen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doron Cohen updated LUCENE-2958:


Attachment: LUCENE-2958.patch

bq. Will that be simpler?
It will be simpler, I admit, but it will harder to manage:
* when re-reading the input file (with repeat=true) special treatment of the 
header line is needed. And cannot assume that the header line exists, because 
there are 1-line files out there without this line, which, is possible, I would 
not like to force recreating, and it is possible.
* the simple LDS as today handles no header line. As such, if there is one, it 
will wrongly treat it as a regular line. But I would like it to be able to 
handle both old files (with no header) and new files, with the header. Mmmm,,, 
e could for that write the header only if it differs from the default header. 
Perhaps this will work.

I'll take a look at that again, meanwhile attaching updated patch with the two 
inner DocDataLineReader's.

 WriteLineDocTask improvements
 -

 Key: LUCENE-2958
 URL: https://issues.apache.org/jira/browse/LUCENE-2958
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark
Reporter: Doron Cohen
Assignee: Doron Cohen
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-2958.patch, LUCENE-2958.patch, LUCENE-2958.patch, 
 LUCENE-2958.patch


 Make WriteLineDocTask and LineDocSource more flexible/extendable:
 * allow to emit lines also for empty docs (keep current behavior as default)
 * allow more/less/other fields

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2958) WriteLineDocTask improvements

2011-03-13 Thread Doron Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13006269#comment-13006269
 ] 

Doron Cohen commented on LUCENE-2958:
-

Rethinking this suggestion, I am afraid it will easily lead users to 
errors/mistakes - users would have to be aware: 

did I create that file with a header? Mmm... so I must use the source which 
handles the header, and that file is with the default settings, so it needs the 
simple reader, but man, did I set it to create the header anyhow... I don't 
remember, I'll recreate the file... 

Maybe some users will remember such things, but I know that I will not 
remember, and a line-reader that handles correctly all inputs out-of-the-box is 
much more convenient... which is what I liked in the header suggestion.

 WriteLineDocTask improvements
 -

 Key: LUCENE-2958
 URL: https://issues.apache.org/jira/browse/LUCENE-2958
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark
Reporter: Doron Cohen
Assignee: Doron Cohen
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-2958.patch, LUCENE-2958.patch, LUCENE-2958.patch, 
 LUCENE-2958.patch


 Make WriteLineDocTask and LineDocSource more flexible/extendable:
 * allow to emit lines also for empty docs (keep current behavior as default)
 * allow more/less/other fields

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Lucene Solr a one way street?

2011-03-13 Thread David Smiley (@MITRE.org)
It's good to see this discussion finally happen.  Some things are in Solr
(e.g. faceting, function queries, Yonik's recent join patch, ...) that
probably belong in Lucene.  As someone contributing functionality to
Lucene/Solr in SOLR-2155 (Geospatial search via geohash prefix techniques),
anecdotally I find it most convenient for the code to span both Lucene and
Solr, particularly Solr on the testing side.  (Of course each patch is
different but this is my experience)  The testing infrastructure on the Solr
side is excellent.  As I look to implement sorting it's going to be
difficult to have a non-Solr user take the code since the sorting capability
is wrapped up in Solr concepts like function queries.

~ David Smiley

-
 Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Lucene-Solr-a-one-way-street-tp2672821p2674116.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: GPU acceleration

2011-03-13 Thread k...@kenobrien.org
Vectorizable code would be the major target, yes.

Do you know if the guys from Yandex were using CUDA or OpenCL approaches? or
the old kind of opengl hacks.


On Sun, Mar 13, 2011 at 8:31 AM, Earwin Burrfoot ear...@gmail.com wrote:

 On Sun, Mar 13, 2011 at 00:15, Ken O'Brien k...@kenobrien.org wrote:
  To clarify, I've not yet written any code. I aim to bring a large speedup
 to
  any functionality that is computationally expensive. I'm wondering which
  components are candidates for this.
 
  I'll be looking through the code but if anyone is aware of parallelizable
  code, I'll start with that.
 More like 'vectorizable' code, huh?

 Guys from Yandex use modified group varint encoding plus handcrafted
 SSE magic to decode/intersect posting lists and claim tremendous
 speedups over original group varint.
 They also use SSE to run the decision trees used in ranking.

 There were experiments with moving both pieces of code to the GPU, and
 GPU did well in terms of speed, but they say getting data in and out
 of GPU made the approach unfeasible.

  I'll basically replicate existing functionality to run on the gpu.
 
  On 12/03/11 21:08, Simon Willnauer wrote:
 
  On Sat, Mar 12, 2011 at 9:21 PM, Ken O'Brienk...@kenobrien.org  wrote:
 
  Hi,
 
  Is anyone looking at GPU acceleration for Solr? If not, I'd like to
  contribute code which adds this functionality.
 
  As I'm not familiar with the codebase, does anyone know which areas of
  functionality could benefit from high degrees of parallelism.
 
  Very interesting can you elaborate a little more what kind of
  functionality you exposed / try to expose to the GPU?
 
  simon
 
  Regards,
 
  Ken
 
 
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
  For additional commands, e-mail: dev-h...@lucene.apache.org
 
 
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
  For additional commands, e-mail: dev-h...@lucene.apache.org
 
 



 --
 Kirill Zakharenko/Кирилл Захаренко
 E-Mail/Jabber: ear...@gmail.com
 Phone: +7 (495) 683-567-4
 ICQ: 104465785

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




[jira] Commented: (SOLR-1725) Script based UpdateRequestProcessorFactory

2011-03-13 Thread David Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13006284#comment-13006284
 ] 

David Smiley commented on SOLR-1725:


This capability is awesome, just as it's DIH equivalent is.  It's been over a 
year since any activity here.  As time passes, the case for moving to Java 6 
increases.  Short of that, I don't see a problem with this working like the DIH 
ScriptTransformer does by using the Java 6 features via reflection.

 Script based UpdateRequestProcessorFactory
 --

 Key: SOLR-1725
 URL: https://issues.apache.org/jira/browse/SOLR-1725
 Project: Solr
  Issue Type: New Feature
  Components: update
Affects Versions: 1.4
Reporter: Uri Boness
 Attachments: SOLR-1725.patch, SOLR-1725.patch, SOLR-1725.patch, 
 SOLR-1725.patch, SOLR-1725.patch


 A script based UpdateRequestProcessorFactory (Uses JDK6 script engine 
 support). The main goal of this plugin is to be able to configure/write 
 update processors without the need to write and package Java code.
 The update request processor factory enables writing update processors in 
 scripts located in {{solr.solr.home}} directory. The functory accepts one 
 (mandatory) configuration parameter named {{scripts}} which accepts a 
 comma-separated list of file names. It will look for these files under the 
 {{conf}} directory in solr home. When multiple scripts are defined, their 
 execution order is defined by the lexicographical order of the script file 
 name (so {{scriptA.js}} will be executed before {{scriptB.js}}).
 The script language is resolved based on the script file extension (that is, 
 a *.js files will be treated as a JavaScript script), therefore an extension 
 is mandatory.
 Each script file is expected to have one or more methods with the same 
 signature as the methods in the {{UpdateRequestProcessor}} interface. It is 
 *not* required to define all methods, only those hat are required by the 
 processing logic.
 The following variables are define as global variables for each script:
  * {{req}} - The SolrQueryRequest
  * {{rsp}}- The SolrQueryResponse
  * {{logger}} - A logger that can be used for logging purposes in the script

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (SOLR-2347) Use InputStream and not Reader for XML parsing

2011-03-13 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated SOLR-2347:


  Component/s: contrib - DataImportHandler
Fix Version/s: 4.0
   3.2

 Use InputStream and not Reader for XML parsing
 --

 Key: SOLR-2347
 URL: https://issues.apache.org/jira/browse/SOLR-2347
 Project: Solr
  Issue Type: Bug
  Components: contrib - DataImportHandler
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 3.2, 4.0


 Followup to SOLR-96:
 Solr mostly uses java.io.Reader and passes this Reader to the XML parser. 
 According to XML spec, a XML file should be initially seen as a binary stream 
 with a default charset of UTF-8 or another charset given by the network 
 protocol (like Content-Type header in HTTP). But very important, this default 
 charset is only a hint to the parser - mandatory is the charset from the 
 XML header processing inctruction. Because of this, the parser must be able 
 to change the charset when reading the XML headers (possibly also when seeing 
 BOM markers). This is not possible if the XML parser gets a java.io.Reader 
 instead of java.io.InputStreams. SOLR-96 already fixed this for the 
 XmlUpdateRequestHandler and the DocumentAnalysisRequestHandler. This issue 
 should fix the rest to be conforming to XML-spec (open schema.xml and 
 config.xml as InputStream not Reader and others).
 This change would not break anything in Solr (perhaps only backwards 
 compatibility in the API), as the default used by XML parsers is UTF-8.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-2347) Use InputStream and not Reader for XML parsing

2011-03-13 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13006285#comment-13006285
 ] 

Uwe Schindler commented on SOLR-2347:
-

This now only affects DIH anymore.

DIH have to be changed that all the base classes that open files, read from 
network don't use Readers but InputStreams. This is easy to do, but breaks 
backwards (which is no problem as DIH is contrib and experimental).

 Use InputStream and not Reader for XML parsing
 --

 Key: SOLR-2347
 URL: https://issues.apache.org/jira/browse/SOLR-2347
 Project: Solr
  Issue Type: Bug
  Components: contrib - DataImportHandler
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 3.2, 4.0


 Followup to SOLR-96:
 Solr mostly uses java.io.Reader and passes this Reader to the XML parser. 
 According to XML spec, a XML file should be initially seen as a binary stream 
 with a default charset of UTF-8 or another charset given by the network 
 protocol (like Content-Type header in HTTP). But very important, this default 
 charset is only a hint to the parser - mandatory is the charset from the 
 XML header processing inctruction. Because of this, the parser must be able 
 to change the charset when reading the XML headers (possibly also when seeing 
 BOM markers). This is not possible if the XML parser gets a java.io.Reader 
 instead of java.io.InputStreams. SOLR-96 already fixed this for the 
 XmlUpdateRequestHandler and the DocumentAnalysisRequestHandler. This issue 
 should fix the rest to be conforming to XML-spec (open schema.xml and 
 config.xml as InputStream not Reader and others).
 This change would not break anything in Solr (perhaps only backwards 
 compatibility in the API), as the default used by XML parsers is UTF-8.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-2347) Use InputStream and not Reader for XML parsing

2011-03-13 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13006286#comment-13006286
 ] 

Mark Miller commented on SOLR-2347:
---

bq. which is no problem as DIH is contrib and experimental

+1

 Use InputStream and not Reader for XML parsing
 --

 Key: SOLR-2347
 URL: https://issues.apache.org/jira/browse/SOLR-2347
 Project: Solr
  Issue Type: Bug
  Components: contrib - DataImportHandler
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 3.2, 4.0


 Followup to SOLR-96:
 Solr mostly uses java.io.Reader and passes this Reader to the XML parser. 
 According to XML spec, a XML file should be initially seen as a binary stream 
 with a default charset of UTF-8 or another charset given by the network 
 protocol (like Content-Type header in HTTP). But very important, this default 
 charset is only a hint to the parser - mandatory is the charset from the 
 XML header processing inctruction. Because of this, the parser must be able 
 to change the charset when reading the XML headers (possibly also when seeing 
 BOM markers). This is not possible if the XML parser gets a java.io.Reader 
 instead of java.io.InputStreams. SOLR-96 already fixed this for the 
 XmlUpdateRequestHandler and the DocumentAnalysisRequestHandler. This issue 
 should fix the rest to be conforming to XML-spec (open schema.xml and 
 config.xml as InputStream not Reader and others).
 This change would not break anything in Solr (perhaps only backwards 
 compatibility in the API), as the default used by XML parsers is UTF-8.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Created: (SOLR-2425) firstSearcher Listener of SpellChecker can be never called

2011-03-13 Thread Koji Sekiguchi (JIRA)
firstSearcher Listener of SpellChecker can be never called
--

 Key: SOLR-2425
 URL: https://issues.apache.org/jira/browse/SOLR-2425
 Project: Solr
  Issue Type: Bug
  Components: spellchecker
Affects Versions: 1.4.1, 3.1, 4.0
Reporter: Koji Sekiguchi
Priority: Minor


mail thread:
http://www.lucidimagination.com/search/document/65e73468958faf09/known_problem_firstsearcher_event_of_spellchecker_is_never_called

firstSearcher Listener of SpellChecker is never called when there is no 
listener event=firstSearcher/ registered in solrconfig.xml.

The reason is because the sequence of procedures in SolrCore constructor:

# initListeners();
# getSearcher(false,false,null); = register (general) firstSearcher listener 
if it exists
# call SolrCoreAware.inform(); = register SpellChecker's firstSearcher listener

After that, Callable.call() is called to execute the firstSearcher event:

{code}
if (currSearcher==null  firstSearcherListeners.size()  0) {
  future = searcherExecutor.submit(
  new Callable() {
public Object call() throws Exception {
  try {
for (SolrEventListener listener : firstSearcherListeners) {
  listener.newSearcher(newSearcher,null);
}
  } catch (Throwable e) {
SolrException.logOnce(log,null,e);
  }
  return null;
}
  }
  );
}
{code}

At the time, firstSearcherListeners includes SpellChecker's 
firstSearcherListner, registered by procedure 3 above. But if you have no 
listener event=firstSearcher/ registered in solrconfig.xml, at the 
procedure 2, searcherExecutor.submit() is never called because 
firstSearcherListeners.size() is zero at the moment.


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-2425) firstSearcher Listener of SpellChecker can be never called

2011-03-13 Thread Koji Sekiguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13006313#comment-13006313
 ] 

Koji Sekiguchi commented on SOLR-2425:
--

bq. At the time, firstSearcherListeners includes SpellChecker's 
firstSearcherListner, registered by procedure 3 above. But if you have no 
listener event=firstSearcher/ registered in solrconfig.xml, at the 
procedure 2, searcherExecutor.submit() is never called because 
firstSearcherListeners.size() is zero at the moment.

This was bit misleading.

I think there is a timing issue. Regardless of the existence of listener 
event=firstSearcher/ in solrconfig.xml, SpellChecker's firstSearcher 
listener can never be called, because Callable.call() can be called before 
executing
SolrCoreAware.inform().


 firstSearcher Listener of SpellChecker can be never called
 --

 Key: SOLR-2425
 URL: https://issues.apache.org/jira/browse/SOLR-2425
 Project: Solr
  Issue Type: Bug
  Components: spellchecker
Affects Versions: 1.4.1, 3.1, 4.0
Reporter: Koji Sekiguchi
Priority: Minor

 mail thread:
 http://www.lucidimagination.com/search/document/65e73468958faf09/known_problem_firstsearcher_event_of_spellchecker_is_never_called
 firstSearcher Listener of SpellChecker is never called when there is no 
 listener event=firstSearcher/ registered in solrconfig.xml.
 The reason is because the sequence of procedures in SolrCore constructor:
 # initListeners();
 # getSearcher(false,false,null); = register (general) firstSearcher listener 
 if it exists
 # call SolrCoreAware.inform(); = register SpellChecker's firstSearcher 
 listener
 After that, Callable.call() is called to execute the firstSearcher event:
 {code}
 if (currSearcher==null  firstSearcherListeners.size()  0) {
   future = searcherExecutor.submit(
   new Callable() {
 public Object call() throws Exception {
   try {
 for (SolrEventListener listener : firstSearcherListeners) {
   listener.newSearcher(newSearcher,null);
 }
   } catch (Throwable e) {
 SolrException.logOnce(log,null,e);
   }
   return null;
 }
   }
   );
 }
 {code}
 At the time, firstSearcherListeners includes SpellChecker's 
 firstSearcherListner, registered by procedure 3 above. But if you have no 
 listener event=firstSearcher/ registered in solrconfig.xml, at the 
 procedure 2, searcherExecutor.submit() is never called because 
 firstSearcherListeners.size() is zero at the moment.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: GPU acceleration

2011-03-13 Thread Lance Norskog
There were experiments with moving both pieces of code to the GPU, and
GPU did well in terms of speed, but they say getting data in and out
of GPU made the approach unfeasible.

That would be the problem with GPUs :) Co-processors always have this
problem. GPUs were designed for 3d video games- the data flow into the
GPU is low-volume (textures  triangles) and the data flow out is (I
assume) pretty minimal.

2011/3/13 k...@kenobrien.org k...@kenobrien.org:
 Vectorizable code would be the major target, yes.

 Do you know if the guys from Yandex were using CUDA or OpenCL approaches? or
 the old kind of opengl hacks.


 On Sun, Mar 13, 2011 at 8:31 AM, Earwin Burrfoot ear...@gmail.com wrote:

 On Sun, Mar 13, 2011 at 00:15, Ken O'Brien k...@kenobrien.org wrote:
  To clarify, I've not yet written any code. I aim to bring a large
  speedup to
  any functionality that is computationally expensive. I'm wondering which
  components are candidates for this.
 
  I'll be looking through the code but if anyone is aware of
  parallelizable
  code, I'll start with that.
 More like 'vectorizable' code, huh?

 Guys from Yandex use modified group varint encoding plus handcrafted
 SSE magic to decode/intersect posting lists and claim tremendous
 speedups over original group varint.
 They also use SSE to run the decision trees used in ranking.

 There were experiments with moving both pieces of code to the GPU, and
 GPU did well in terms of speed, but they say getting data in and out
 of GPU made the approach unfeasible.

  I'll basically replicate existing functionality to run on the gpu.
 
  On 12/03/11 21:08, Simon Willnauer wrote:
 
  On Sat, Mar 12, 2011 at 9:21 PM, Ken O'Brienk...@kenobrien.org  wrote:
 
  Hi,
 
  Is anyone looking at GPU acceleration for Solr? If not, I'd like to
  contribute code which adds this functionality.
 
  As I'm not familiar with the codebase, does anyone know which areas of
  functionality could benefit from high degrees of parallelism.
 
  Very interesting can you elaborate a little more what kind of
  functionality you exposed / try to expose to the GPU?
 
  simon
 
  Regards,
 
  Ken
 
 
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
  For additional commands, e-mail: dev-h...@lucene.apache.org
 
 
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
  For additional commands, e-mail: dev-h...@lucene.apache.org
 
 



 --
 Kirill Zakharenko/Кирилл Захаренко
 E-Mail/Jabber: ear...@gmail.com
 Phone: +7 (495) 683-567-4
 ICQ: 104465785

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org






-- 
Lance Norskog
goks...@gmail.com

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: GPU acceleration

2011-03-13 Thread k...@kenobrien.org
I see. I will conduct my own experiments with more recent GPU libraries.
Thanks for everyone's input.
On Mar 14, 2011 2:30 AM, Lance Norskog goks...@gmail.com wrote:
 There were experiments with moving both pieces of code to the GPU, and
 GPU did well in terms of speed, but they say getting data in and out
 of GPU made the approach unfeasible.

 That would be the problem with GPUs :) Co-processors always have this
 problem. GPUs were designed for 3d video games- the data flow into the
 GPU is low-volume (textures  triangles) and the data flow out is (I
 assume) pretty minimal.

 2011/3/13 k...@kenobrien.org k...@kenobrien.org:
 Vectorizable code would be the major target, yes.

 Do you know if the guys from Yandex were using CUDA or OpenCL approaches?
or
 the old kind of opengl hacks.


 On Sun, Mar 13, 2011 at 8:31 AM, Earwin Burrfoot ear...@gmail.com
wrote:

 On Sun, Mar 13, 2011 at 00:15, Ken O'Brien k...@kenobrien.org wrote:
  To clarify, I've not yet written any code. I aim to bring a large
  speedup to
  any functionality that is computationally expensive. I'm wondering
which
  components are candidates for this.
 
  I'll be looking through the code but if anyone is aware of
  parallelizable
  code, I'll start with that.
 More like 'vectorizable' code, huh?

 Guys from Yandex use modified group varint encoding plus handcrafted
 SSE magic to decode/intersect posting lists and claim tremendous
 speedups over original group varint.
 They also use SSE to run the decision trees used in ranking.

 There were experiments with moving both pieces of code to the GPU, and
 GPU did well in terms of speed, but they say getting data in and out
 of GPU made the approach unfeasible.

  I'll basically replicate existing functionality to run on the gpu.
 
  On 12/03/11 21:08, Simon Willnauer wrote:
 
  On Sat, Mar 12, 2011 at 9:21 PM, Ken O'Brienk...@kenobrien.org
 wrote:
 
  Hi,
 
  Is anyone looking at GPU acceleration for Solr? If not, I'd like to
  contribute code which adds this functionality.
 
  As I'm not familiar with the codebase, does anyone know which areas
of
  functionality could benefit from high degrees of parallelism.
 
  Very interesting can you elaborate a little more what kind of
  functionality you exposed / try to expose to the GPU?
 
  simon
 
  Regards,
 
  Ken
 
 
 
 
-
  To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
  For additional commands, e-mail: dev-h...@lucene.apache.org
 
 
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
  For additional commands, e-mail: dev-h...@lucene.apache.org
 
 



 --
 Kirill Zakharenko/Кирилл Захаренко
 E-Mail/Jabber: ear...@gmail.com
 Phone: +7 (495) 683-567-4
 ICQ: 104465785

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org






 --
 Lance Norskog
 goks...@gmail.com

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-2337) Solr needs hits= added to the log when using grouping

2011-03-13 Thread Bill Bell (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13006330#comment-13006330
 ] 

Bill Bell commented on SOLR-2337:
-

The only downfall with this patch, is the hits=10 means the following:

1. if there are 20 rows that match and are grouped, and if rows=10 then hits 
will be 10 (since the short circuit) is still obeyed. Otherwise there could be 
a performance issue with looping until end of results.
2. if there are 4 rows that match and are grouped, and you say rows=10, hits 
will be 4 (normal behavior).


 Solr needs hits= added to the log when using grouping 
 --

 Key: SOLR-2337
 URL: https://issues.apache.org/jira/browse/SOLR-2337
 Project: Solr
  Issue Type: Bug
  Components: SearchComponents - other
Affects Versions: 4.0
Reporter: Bill Bell
 Fix For: 4.0

 Attachments: SOLR.2337.patch


 We monitor the Solr logs to try to review queries that have hits=0. This 
 enables us to improve relevancy since they are easy to find and review.
 When using group=true, hits= does not show up:
 {code}
 2011-01-27 01:10:16,117 INFO  core.SolrCore  - [collection1] webapp= 
 path=/select params={group=truegroup.field=gendergroup.field=idq=*:*} 
 status=0 QTime=15
 {code}
 The code in QueryComponent.java needs to add the matches() after calling 
 grouping.execute() and add up the total.
 It does return hits= in the log for mainResult, but not for standard grouping.
 This should be easy to add since matches are defined...

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org