[Lucene.Net] hudkins update
for everyones benefit. I started putting together a home server this weekend (not because we're going use that as a build server, but because I needed a sandbox to install jenkins to see which plugins work and which do not) It seems like there has been constant trickle in of issues from upgrading to jenkins on the build list. So it'll be wise to do a due diligence mock setup. It will probably take me a lil over a week to work through this. I'll either screen shot the results or see if I can somehow allow for a demo within a 72 hour period. - Michael
Use JCC wrapper from C++ (no python)
Hello all, I'm curious how closely tied the JCC/Lucene wrappers are to Python. I've been looking for a way to use Lucene from C/C++. I'm aware of CLucene, but it appears to have some serious issues, and I was pointed to PyLucene. Is it conceivable that one could use the JCC-generated code from C/C++ in the absence of Python? Or is the JCC code closely tied to the Python interfaces to use without heavy modification? Thanks, Ian
Re: GPU acceleration
On Sun, Mar 13, 2011 at 00:15, Ken O'Brien k...@kenobrien.org wrote: To clarify, I've not yet written any code. I aim to bring a large speedup to any functionality that is computationally expensive. I'm wondering which components are candidates for this. I'll be looking through the code but if anyone is aware of parallelizable code, I'll start with that. More like 'vectorizable' code, huh? Guys from Yandex use modified group varint encoding plus handcrafted SSE magic to decode/intersect posting lists and claim tremendous speedups over original group varint. They also use SSE to run the decision trees used in ranking. There were experiments with moving both pieces of code to the GPU, and GPU did well in terms of speed, but they say getting data in and out of GPU made the approach unfeasible. I'll basically replicate existing functionality to run on the gpu. On 12/03/11 21:08, Simon Willnauer wrote: On Sat, Mar 12, 2011 at 9:21 PM, Ken O'Brienk...@kenobrien.org wrote: Hi, Is anyone looking at GPU acceleration for Solr? If not, I'd like to contribute code which adds this functionality. As I'm not familiar with the codebase, does anyone know which areas of functionality could benefit from high degrees of parallelism. Very interesting can you elaborate a little more what kind of functionality you exposed / try to expose to the GPU? simon Regards, Ken - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Kirill Zakharenko/Кирилл Захаренко E-Mail/Jabber: ear...@gmail.com Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[HUDSON] Lucene-Solr-tests-only-trunk - Build # 5887 - Failure
Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-trunk/5887/ 2 tests failed. REGRESSION: org.apache.solr.uima.processor.UIMAUpdateRequestProcessorTest.testProcessing Error Message: org.apache.uima.analysis_engine.AnalysisEngineProcessException Stack Trace: java.lang.RuntimeException: org.apache.uima.analysis_engine.AnalysisEngineProcessException at org.apache.solr.uima.processor.UIMAUpdateRequestProcessor.processAdd(UIMAUpdateRequestProcessor.java:82) at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:127) at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:77) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:55) at org.apache.solr.uima.processor.UIMAUpdateRequestProcessorTest.addDoc(UIMAUpdateRequestProcessorTest.java:120) at org.apache.solr.uima.processor.UIMAUpdateRequestProcessorTest.testProcessing(UIMAUpdateRequestProcessorTest.java:76) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1214) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1146) Caused by: org.apache.uima.analysis_engine.AnalysisEngineProcessException at org.apache.uima.annotator.calais.OpenCalaisAnnotator.process(OpenCalaisAnnotator.java:206) at org.apache.uima.analysis_component.CasAnnotator_ImplBase.process(CasAnnotator_ImplBase.java:56) at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:377) at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.processAndOutputNewCASes(PrimitiveAnalysisEngine_impl.java:295) at org.apache.uima.analysis_engine.asb.impl.ASB_impl$AggregateCasIterator.processUntilNextOutputCas(ASB_impl.java:567) at org.apache.uima.analysis_engine.asb.impl.ASB_impl$AggregateCasIterator.init(ASB_impl.java:409) at org.apache.uima.analysis_engine.asb.impl.ASB_impl.process(ASB_impl.java:342) at org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine_impl.processAndOutputNewCASes(AggregateAnalysisEngine_impl.java:267) at org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.process(AnalysisEngineImplBase.java:267) at org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.process(AnalysisEngineImplBase.java:280) at org.apache.solr.uima.processor.UIMAUpdateRequestProcessor.processText(UIMAUpdateRequestProcessor.java:122) at org.apache.solr.uima.processor.UIMAUpdateRequestProcessor.processAdd(UIMAUpdateRequestProcessor.java:69) Caused by: java.io.IOException: Server returned HTTP response code: 503 for URL: http://api.opencalais.com/enlighten/calais.asmx/Enlighten at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1269) at org.apache.uima.annotator.calais.OpenCalaisAnnotator.callServiceOnText(OpenCalaisAnnotator.java:234) at org.apache.uima.annotator.calais.OpenCalaisAnnotator.process(OpenCalaisAnnotator.java:126) REGRESSION: org.apache.solr.uima.processor.UIMAUpdateRequestProcessorTest.testTwoUpdates Error Message: org.apache.uima.analysis_engine.AnalysisEngineProcessException Stack Trace: java.lang.RuntimeException: org.apache.uima.analysis_engine.AnalysisEngineProcessException at org.apache.solr.uima.processor.UIMAUpdateRequestProcessor.processAdd(UIMAUpdateRequestProcessor.java:82) at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:127) at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:77) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:55) at org.apache.solr.uima.processor.UIMAUpdateRequestProcessorTest.addDoc(UIMAUpdateRequestProcessorTest.java:120) at org.apache.solr.uima.processor.UIMAUpdateRequestProcessorTest.testTwoUpdates(UIMAUpdateRequestProcessorTest.java:94) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1214) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1146) Caused by: org.apache.uima.analysis_engine.AnalysisEngineProcessException at org.apache.uima.annotator.calais.OpenCalaisAnnotator.process(OpenCalaisAnnotator.java:206) at org.apache.uima.analysis_component.CasAnnotator_ImplBase.process(CasAnnotator_ImplBase.java:56) at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:377) at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.processAndOutputNewCASes(PrimitiveAnalysisEngine_impl.java:295) at org.apache.uima.analysis_engine.asb.impl.ASB_impl$AggregateCasIterator.processUntilNextOutputCas(ASB_impl.java:567)
[jira] Commented: (LUCENE-2308) Separately specify a field's type
[ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13006182#comment-13006182 ] Simon Willnauer commented on LUCENE-2308: - Brief Summary for GSoC Students: FieldType aims on the one hand to separate field properties from the actual value and on the other make Field's extensibility easier. Both seem equally important while far from easy to achieve. Fieldable and Field are a core API and changes to it need to well thought. Further this issue can easily cause drastic performance degradation if not done right. Consider this as a massive change since fields are used almost all over lucene and solr. Separately specify a field's type - Key: LUCENE-2308 URL: https://issues.apache.org/jira/browse/LUCENE-2308 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless Labels: gsoc2011, lucene-gsoc-11 Fix For: 4.0 This came up from dicussions on IRC. I'm summarizing here... Today when you make a Field to add to a document you can set things index or not, stored or not, analyzed or not, details like omitTfAP, omitNorms, index term vectors (separately controlling offsets/positions), etc. I think we should factor these out into a new class (FieldType?). Then you could re-use this FieldType instance across multiple fields. The Field instance would still hold the actual value. We could then do per-field analyzers by adding a setAnalyzer on the FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise for per-field codecs (with flex), where we now have PerFieldCodecWrapper). This would NOT be a schema! It's just refactoring what we already specify today. EG it's not serialized into the index. This has been discussed before, and I know Michael Busch opened a more ambitious (I think?) issue. I think this is a good first baby step. We could consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold off on that for starters... -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2621) Extend Codec to handle also stored fields and term vectors
[ https://issues.apache.org/jira/browse/LUCENE-2621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13006181#comment-13006181 ] Simon Willnauer commented on LUCENE-2621: - Brief Summary for GSoC Students: This issue is about extend Codec to handle also stored fields and term vectors This is a very interesting and at the same time very much needed feature which involves API Design, Refactoring and in depth understanding of how IndexWriter and its internals work. The API which needs to be refactored (Codec API) was made to consume PostingLists once an in memory index segment is flushed to disc. Yet, to expose Stored Fields to this API we need to prepare it to consume data for every document while we build the in memory segment. So there is a little paradigm missmatch here which needs to be addressed. Extend Codec to handle also stored fields and term vectors -- Key: LUCENE-2621 URL: https://issues.apache.org/jira/browse/LUCENE-2621 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 4.0 Reporter: Andrzej Bialecki Labels: gsoc2011, lucene-gsoc-11 Currently Codec API handles only writing/reading of term-related data, while stored fields data and term frequency vector data writing/reading is handled elsewhere. I propose to extend the Codec API to handle this data as well. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2621) Extend Codec to handle also stored fields and term vectors
[ https://issues.apache.org/jira/browse/LUCENE-2621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13006180#comment-13006180 ] Simon Willnauer commented on LUCENE-2621: - Brief Summary for GSoC Students: This issue is about extend Codec to handle also stored fields and term vectors This is a very interesting and at the same time very much needed feature which involves API Design, Refactoring and in depth understanding of how IndexWriter and its internals work. The API which needs to be refactored (Codec API) was made to consume PostingLists once an in memory index segment is flushed to disc. Yet, to expose Stored Fields to this API we need to prepare it to consume data for every document while we build the in memory segment. So there is a little paradigm missmatch here which needs to be addressed. Extend Codec to handle also stored fields and term vectors -- Key: LUCENE-2621 URL: https://issues.apache.org/jira/browse/LUCENE-2621 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 4.0 Reporter: Andrzej Bialecki Labels: gsoc2011, lucene-gsoc-11 Currently Codec API handles only writing/reading of term-related data, while stored fields data and term frequency vector data writing/reading is handled elsewhere. I propose to extend the Codec API to handle this data as well. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Participating in GSoC'11 with Lucene
On Sun, Mar 13, 2011 at 12:11 AM, Michael McCandless luc...@mikemccandless.com wrote: Simon these are great summaries -- can you post them on the issues too? Thanks! done! simon On Sat, Mar 12, 2011 at 4:35 PM, Simon Willnauer simon.willna...@googlemail.com wrote: Hey, On Sat, Mar 12, 2011 at 5:32 PM, Zhijie Shen zjshe...@gmail.com wrote: Hi developers, I'm a graduate student from National University of Singapore, majoring in Computer Science. The enthusiasm of open source and information retrieval drives me to participate in GSoC'11 with your community. I first got to know Lucene when I was in a software engineer intern in IBM, working on Lotus Connections. Awesome and welcome to Lucene :) Now I've already checked out the source code and successfully built it locally. Meanwhile, I begin to read through the Jira issues, and are more interested in Issue 2308, 2309 and 2621, which seem to be the refactoring tasks (Please correct me if I'm wrong). My personal feeling is that these tasks will be more appropriate for a beginner to get in. Moreover, I think to start with such a big project, it is more efficient to read through the discussion on Jira to understand the problem, and then dive into the related code with the problem kept in mind. What is your opinion? I'm looking forward to your guidance. Apparently you survived the first steps to get into lucene and solr! Great! You also looked at JIRA which is even better. So lemme tell you some words about the issues you have listed. LUCENE-2621 - Extend Codec to handle also stored fields and term vectors This is a very interesting and at the same time very much needed feature which involves API Design, Refactoring and in depth understanding of how IndexWriter and its internals work. The API which needs to be refactored (Codec API) was made to consume PostingLists once an in memory index segment is flushed to disc. Yet, to expose Stored Fields to this API we need to prepare it to consume data for every document while we build the in memory segment. So there is a little paradigm missmatch here which needs to be addressed. LUCENE-2309 - Fully decouple IndexWriter from analyzers This one is something I look forward to have for quite a while which would flatten the way for other analysis capabilities than the one lucene offers today. This seems to be refactoring-heavier that the other but might be require less knowledge about the IndexWriter (IW) internals than the codec one. Yet, it still is a very interesting issue / project to work on and fairly self-contained. LUCENE-2308 - Separately specify a field's type FieldType aims on the one hand to separate field properties from the actual value and on the other make Field's extensibility easier. Both seem equally important while far from easy to achieve. Fieldable and Field are a core API and changes to it need to well thought. Further this issue can easily cause drastic performance degradation if not done right. Consider this as a massive change since fields are used almost all over lucene and solr. I wrote those little summaries not to scare you away, not at all! I rather tried to find out what to expect from the issues and to make it easier for you to pick either one or another which you would like to work on. I will try to update the description of those issues if they are not already clear enough ( LUCENE-2621 seems kind of too brief though) in the next couple of days. If you have any question regarding those issues or any other, feel free to ask here on the list or on the issue directly (you might need a JIRA account if you don't have one already you should get one :) Reading the JIRA issue might help you to understand what those issues about but those are usually written by core devs or long time contributors so please as any question you have and don't hesitate to ask if you have problems with anything. Simon Regards, Zhijie -- Zhijie Shen School of Computing National University of Singapore - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Mike http://blog.mikemccandless.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2308) Separately specify a field's type
[ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13006185#comment-13006185 ] Robert Muir commented on LUCENE-2308: - {quote} Moving stuff from Solr to Lucene involves lots of politics. It is way easier to let Solr adopt eventually than fight your way through the politics (this is my opinion though.) {quote} Then why do we still have merged codebases? If this is the way things are, then we should un-merge the two projects. because as a lucene developer, i spend a lot of time trying to do my part to fix various things in Solr... if its a one-way-street then we need to un-merge. Separately specify a field's type - Key: LUCENE-2308 URL: https://issues.apache.org/jira/browse/LUCENE-2308 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless Labels: gsoc2011, lucene-gsoc-11 Fix For: 4.0 This came up from dicussions on IRC. I'm summarizing here... Today when you make a Field to add to a document you can set things index or not, stored or not, analyzed or not, details like omitTfAP, omitNorms, index term vectors (separately controlling offsets/positions), etc. I think we should factor these out into a new class (FieldType?). Then you could re-use this FieldType instance across multiple fields. The Field instance would still hold the actual value. We could then do per-field analyzers by adding a setAnalyzer on the FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise for per-field codecs (with flex), where we now have PerFieldCodecWrapper). This would NOT be a schema! It's just refactoring what we already specify today. EG it's not serialized into the index. This has been discussed before, and I know Michael Busch opened a more ambitious (I think?) issue. I think this is a good first baby step. We could consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold off on that for starters... -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2308) Separately specify a field's type
[ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13006187#comment-13006187 ] Chris Male commented on LUCENE-2308: bq. I'm surprised to barely even see a mention to Solr here which already, of course obviously, already has a FieldType. Might it be ported? I think there is a lot of overlap but Solr's FieldTypes also integrate with its schema through SchemaField so maybe its an option to port the overlap and then let Solr extend whatever is created, to provide its schema integration/Solr specific functions? Separately specify a field's type - Key: LUCENE-2308 URL: https://issues.apache.org/jira/browse/LUCENE-2308 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless Labels: gsoc2011, lucene-gsoc-11 Fix For: 4.0 This came up from dicussions on IRC. I'm summarizing here... Today when you make a Field to add to a document you can set things index or not, stored or not, analyzed or not, details like omitTfAP, omitNorms, index term vectors (separately controlling offsets/positions), etc. I think we should factor these out into a new class (FieldType?). Then you could re-use this FieldType instance across multiple fields. The Field instance would still hold the actual value. We could then do per-field analyzers by adding a setAnalyzer on the FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise for per-field codecs (with flex), where we now have PerFieldCodecWrapper). This would NOT be a schema! It's just refactoring what we already specify today. EG it's not serialized into the index. This has been discussed before, and I know Michael Busch opened a more ambitious (I think?) issue. I think this is a good first baby step. We could consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold off on that for starters... -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[HUDSON] Lucene-Solr-tests-only-3.x - Build # 5874 - Failure
Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-3.x/5874/ 1 tests failed. REGRESSION: org.apache.solr.client.solrj.embedded.SolrExampleStreamingTest.testCommitWithin Error Message: expected:0 but was:1 Stack Trace: junit.framework.AssertionFailedError: expected:0 but was:1 at org.apache.solr.client.solrj.SolrExampleTests.testCommitWithin(SolrExampleTests.java:327) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1075) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1007) Build Log (for compile errors): [...truncated 10129 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2944) BytesRef reuse bugs in QueryParser and analysis.jsp
[ https://issues.apache.org/jira/browse/LUCENE-2944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13006192#comment-13006192 ] Simon Willnauer commented on LUCENE-2944: - I had a brief review - naming looks good from my side though... simon BytesRef reuse bugs in QueryParser and analysis.jsp --- Key: LUCENE-2944 URL: https://issues.apache.org/jira/browse/LUCENE-2944 Project: Lucene - Java Issue Type: Bug Reporter: Robert Muir Assignee: Robert Muir Fix For: 4.0 Attachments: LUCENE-2944.patch, LUCENE-2944.patch, LUCENE-2944.patch, LUCENE-2944.patch, LUCENE-2944_option2.patch Some code uses BytesRef as if it were a String, in this case consumers of TermToBytesRefAttribute. The thing is, while our general implementation works on char[] and then populates the consumers BytesRef, not all TermToBytesRefAttribute implementations do this, specifically ICU collation, it reuses the bytes and simply sets the pointers: {noformat} @Override public int toBytesRef(BytesRef target) { collator.getRawCollationKey(toString(), key); target.bytes = key.bytes; target.offset = 0; target.length = key.size; return target.hashCode(); } {noformat} Most of the blame falls on me as I added this to the queryparser in LUCENE-2514. Attached is a patch so that these consumers re-use a 'spare' and copy the bytes when they are going to make a long lasting object such as a Term. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Lucene Solr a one way street?
Hey folks, I have recently tried to push some refactorings towards moving stuff from Solr to modules land to enable users of Lucene to benefit from the developments that have been made in Solr land during the past with very little success. Actually, it was a really disappointing experience whenever I tried to kick off issues towards this direction. On LUCENE-2308 David asked a good question why FieldType is not ported to Lucene rather than a new development. I replied with: {quote} Moving stuff from Solr to Lucene involves lots of politics. It is way easier to let Solr adopt eventually than fight your way through the politics (this is my opinion though.) {quote} Yet, while the answer to his question doesn't matter in this context but it raised some other question from Roberts side: {quote} Then why do we still have merged codebases? If this is the way things are, then we should un-merge the two projects. because as a lucene developer, i spend a lot of time trying to do my part to fix various things in Solr... if its a one-way-street then we need to un-merge. {quote} The discussions on LUCENE-2883 changed my personal reception on how things work here quite dramatically. I lost almost all enthusiasm to even try to push developments towards moving things out of Solr and into modules since literally every movement in this direction starts a lot of politics (at least this is my understanding drawn from the rather non-technical disagreements I have seen folks mentioning on this issue). I don't care where those politics come from but in my opinion we need to find an agreement how we deal with stealing functionality form Solr and make them available to lucene users. My personal opinion is that any refactoring, consolidation of APIs etc. should NOT be influenced by the fact that they have been Solr private and might influence further development on solr with regards to backwards compatibility etc. Moving features to modules should be first priority and please correct me if I am wrong this was one of the major reason why we merged the code base. All users should benefit from the nice features which are currently hidden in the solr code base. FunctionQuery is a tiny one IMO and the frustration it caused on my side was immense. I don't even wanna try to suggest to make replication, faceting or even the cloud feature decoupled from Solr (I don't want to argue about the feasibility or the amount of work this would be here! Just lemme say one thing we can always branch and there is a large workforce out there that is willing to work on stuff like that). I can only agree with robert that if this is a one way street that the merge makes no sense anymore. Refactoring will bring a lot of benefits to both lucene AND solr. I also wanna quote Earwin here (I don't find the issue right now - this quote is from my memory): we should try to decouple things rather than couple them even more tight. I moved out of all Solr issues since then and I am not willing to do any work on it until we find an agreement on this. I should have raised this issue earlier I guess but I rather do it now than never. thanks for reading, simon - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Lucene Solr a one way street?
On Mar 13, 2011, at 10:23 AM, Simon Willnauer wrote: Hey folks, I have recently tried to push some refactorings towards moving stuff from Solr to modules land to enable users of Lucene to benefit from the developments that have been made in Solr land during the past with very little success. Actually, it was a really disappointing experience whenever I tried to kick off issues towards this direction. On LUCENE-2308 David asked a good question why FieldType is not ported to Lucene rather than a new development. I replied with: {quote} Moving stuff from Solr to Lucene involves lots of politics. It is way easier to let Solr adopt eventually than fight your way through the politics (this is my opinion though.) {quote} I hadn't looked at 2308, but to me, if there are well written patches, then they should be considered. More modules make a lot of sense to me, as long as everyone is kept whole and there are no performance losses. Moving FTs to Lucene seems like a lot of work for little benefit, but maybe that is just me. Seems like most Lucene users like to roll their own in that stuff or use Spring. Yet, while the answer to his question doesn't matter in this context but it raised some other question from Roberts side: {quote} Then why do we still have merged codebases? If this is the way things are, then we should un-merge the two projects. because as a lucene developer, i spend a lot of time trying to do my part to fix various things in Solr... if its a one-way-street then we need to un-merge. {quote} The discussions on LUCENE-2883 changed my personal reception on how things work here quite dramatically. I lost almost all enthusiasm to even try to push developments towards moving things out of Solr and into modules since literally every movement in this direction starts a lot of politics (at least this is my understanding drawn from the rather non-technical disagreements I have seen folks mentioning on this issue). I don't care where those politics come from but in my opinion we need to find an agreement how we deal with stealing functionality form Solr and make them available to lucene users. My personal opinion is that any refactoring, consolidation of APIs etc. should NOT be influenced by the fact that they have been Solr private and might influence further development on solr with regards to backwards compatibility etc. I actually thought 2883 was a pretty good discussion. The sum take away from it for me was go for it. One person was hesitant about it. I think sometimes you need to just put up patches instead of having lengthy discussions. Moving features to modules should be first priority and please correct me if I am wrong this was one of the major reason why we merged the code base. I don't think it is a first priority, but it is a benefit. I also don't think it was the majority reason for the merge. I think the majority reason was that most of the Solr committers were also Lucene committers and there was a fair amount of duplicated work and a desire to be on the same version. Modularization was/is also a benefit. FWIW, I think the merge for the most part has been successful in most places. We have better tested code, faster code, etc. All users should benefit from the nice features which are currently hidden in the solr code base. FunctionQuery is a tiny one IMO and the frustration it caused on my side was immense. I don't even wanna try to suggest to make replication, faceting or even the cloud feature decoupled from Solr (I don't want to argue about the feasibility or the amount of work this would be here! Just lemme say one thing we can always branch and there is a large workforce out there that is willing to work on stuff like that). I can only agree with robert that if this is a one way street that the merge makes no sense anymore. I guess the question people w/ Solr only hats on have (if there are such people), is which way is that street going? It seems like most people want to pull stuff out of Solr, but they don't seem to want to put into it. That's probably where some of the resistance comes from. If you want to modularize everything so that you can consume it outside of Solr, it usually means you don't use Solr, which sometimes comes across that you don't care if the modularization actually has a negative effect on Solr. I'm all for modularization and enabling everyone, but not at the cost of loss of performance in Solr. As tightly coupled as Solr is, it's pretty damn fast and resilient. Show me that you keep that whole and I'll be +1 on everything. You also have to keep in mind that some of these things, replication for instance, rely on Solr things. Are you really going to more the HTTP protocols to just Lucene? What does that even mean? Lucene is a Java API. It doesn't assume containers, etc. Solr is the thing that delivers Lucene in a container. To me, lower
[jira] Commented: (LUCENE-2308) Separately specify a field's type
[ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13006213#comment-13006213 ] Yonik Seeley commented on LUCENE-2308: -- bq. I think there is a lot of overlap but Solr's FieldTypes also integrate with its schema through SchemaField so maybe its an option to port the overlap and then let Solr extend whatever is created, to provide its schema integration/Solr specific functions? Yeah, that seems reasonable. Separately specify a field's type - Key: LUCENE-2308 URL: https://issues.apache.org/jira/browse/LUCENE-2308 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless Labels: gsoc2011, lucene-gsoc-11 Fix For: 4.0 This came up from dicussions on IRC. I'm summarizing here... Today when you make a Field to add to a document you can set things index or not, stored or not, analyzed or not, details like omitTfAP, omitNorms, index term vectors (separately controlling offsets/positions), etc. I think we should factor these out into a new class (FieldType?). Then you could re-use this FieldType instance across multiple fields. The Field instance would still hold the actual value. We could then do per-field analyzers by adding a setAnalyzer on the FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise for per-field codecs (with flex), where we now have PerFieldCodecWrapper). This would NOT be a schema! It's just refactoring what we already specify today. EG it's not serialized into the index. This has been discussed before, and I know Michael Busch opened a more ambitious (I think?) issue. I think this is a good first baby step. We could consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold off on that for starters... -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2960) Allow (or bring back) the ability to setRAMBufferSizeMB on an open IndexWriter
[ https://issues.apache.org/jira/browse/LUCENE-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13006214#comment-13006214 ] Shay Banon commented on LUCENE-2960: Just a note regarding the IWC and being able to consult it for live changes, it feels strange to me that settings something on the config will affect the IW in real time. Maybe its just me, but it feels nicer to have the live setters on IW compared to IWC. I also like the ability to decouple construction time configuration through IWC, and live settings through setters on IW. It is then very clear what can be set on construction time, and what can be set on a live IW. It also allows for compile time / static check for the code what can be changed at what lifecycle phase. Allow (or bring back) the ability to setRAMBufferSizeMB on an open IndexWriter -- Key: LUCENE-2960 URL: https://issues.apache.org/jira/browse/LUCENE-2960 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shay Banon Priority: Blocker Fix For: 3.1, 4.0 In 3.1 the ability to setRAMBufferSizeMB is deprecated, and removed in trunk. It would be great to be able to control that on a live IndexWriter. Other possible two methods that would be great to bring back are setTermIndexInterval and setReaderTermsIndexDivisor. Most of the other setters can actually be set on the MergePolicy itself, so no need for setters for those (I think). -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2308) Separately specify a field's type
[ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13006215#comment-13006215 ] Simon Willnauer commented on LUCENE-2308: - {quote} I think there is a lot of overlap but Solr's FieldTypes also integrate with its schema through SchemaField so maybe its an option to port the overlap and then let Solr extend whatever is created, to provide its schema integration/Solr specific functions? {quote} +1 Separately specify a field's type - Key: LUCENE-2308 URL: https://issues.apache.org/jira/browse/LUCENE-2308 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless Labels: gsoc2011, lucene-gsoc-11 Fix For: 4.0 This came up from dicussions on IRC. I'm summarizing here... Today when you make a Field to add to a document you can set things index or not, stored or not, analyzed or not, details like omitTfAP, omitNorms, index term vectors (separately controlling offsets/positions), etc. I think we should factor these out into a new class (FieldType?). Then you could re-use this FieldType instance across multiple fields. The Field instance would still hold the actual value. We could then do per-field analyzers by adding a setAnalyzer on the FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise for per-field codecs (with flex), where we now have PerFieldCodecWrapper). This would NOT be a schema! It's just refactoring what we already specify today. EG it's not serialized into the index. This has been discussed before, and I know Michael Busch opened a more ambitious (I think?) issue. I think this is a good first baby step. We could consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold off on that for starters... -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Lucene Solr a one way street?
On Sun, Mar 13, 2011 at 11:47 AM, Grant Ingersoll gsing...@apache.org wrote: I guess the question people w/ Solr only hats on have (if there are such people), is which way is that street going? It seems like most people want to pull stuff out of Solr, but they don't seem to want to put into it. That's probably where some of the resistance comes from. If you want to modularize everything so that you can consume it outside of Solr, it usually means you don't use Solr, which sometimes comes across that you don't care if the modularization actually has a negative effect on Solr. I'm all for modularization and enabling everyone, but not at the cost of loss of performance in Solr. As tightly coupled as Solr is, it's pretty damn fast and resilient. Show me that you keep that whole and I'll be +1 on everything. Do you have any facts to back up these baseless accusations? Because I'll tell you how its seems to me: lucene committers are going well beyond whats required (fixing solr) to commit changes to lucene. Take a look at the commits list, we are the ones doing Solr's dirty work: * Like Uwe Schindler fixing up tons of XML related bugs in Solr, fixing analysis.jsp and the related request handlers. * Like Simon Willnauer doing the necessary improvements to IndexReader such that SolrIndexReader need not exist, and trying to add good codec support to Solr so it can take advantage of flexible indexing. And I guess i didnt put any effort into solr when i spent a huge chunk of this weekend tracking down jre crashes and test bugs in a Solr cloud test?! As far as modularization having a negative performance effect on Solr, how is this the case? Again do you have any concrete examples, or is this just more baseless accusations? Do you have specific benchmarks to where solr's analysis is now somehow slower due to the refactoring (since this is the only modularization thats happened from solr)?! Doesn't look slower to me: http://www.lucidimagination.com/search/document/46a8351089a98aec/protwords_txt_support_in_stemmers#46a8351089a98aec - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2749) Co-occurrence filter
[ https://issues.apache.org/jira/browse/LUCENE-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13006222#comment-13006222 ] Elmar Pitschke commented on LUCENE-2749: Hi, i am fairly new to Lucene development, but i have plenty experience using it :). I would like to make some contribution and think this would be a good task for me to start, as i am fairly interested in the analysis part. Can i work on this task or has there been any work done on this yet? Regards Elmar Co-occurrence filter Key: LUCENE-2749 URL: https://issues.apache.org/jira/browse/LUCENE-2749 Project: Lucene - Java Issue Type: New Feature Components: Analysis Affects Versions: 3.1, 4.0 Reporter: Steven Rowe Priority: Minor Fix For: 4.0 The co-occurrence filter to be developed here will output sets of tokens that co-occur within a given window onto a token stream. These token sets can be ordered either lexically (to allow order-independent matching/counting) or positionally (e.g. sliding windows of positionally ordered co-occurring terms that include all terms in the window are called n-grams or shingles). The parameters to this filter will be: * window size: this can be a fixed sequence length, sentence/paragraph context (these will require sentence/paragraph segmentation, which is not in Lucene yet), or over the entire token stream (full field width) * minimum number of co-occurring terms: = 2 * maximum number of co-occurring terms: = window size * token set ordering (lexical or positional) One use case for co-occurring token sets is as candidates for collocations. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2621) Extend Codec to handle also stored fields and term vectors
[ https://issues.apache.org/jira/browse/LUCENE-2621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13006223#comment-13006223 ] Zhijie Shen commented on LUCENE-2621: - I've some questions here to make sure I'm looking into the correct codes: When you mentioned Codec API, do you mean the abstract class org.apache.lucene.index.codecs.Codec? Term vectors refer to org.apache.lucene.index.TermFreqVector, and it is processed by TermVectorsWriter now, correct? But what are the stored fields? I cannot find them immediately. BTW, is there any design document of Lucene in the Wiki? Extend Codec to handle also stored fields and term vectors -- Key: LUCENE-2621 URL: https://issues.apache.org/jira/browse/LUCENE-2621 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 4.0 Reporter: Andrzej Bialecki Labels: gsoc2011, lucene-gsoc-11 Currently Codec API handles only writing/reading of term-related data, while stored fields data and term frequency vector data writing/reading is handled elsewhere. I propose to extend the Codec API to handle this data as well. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2960) Allow (or bring back) the ability to setRAMBufferSizeMB on an open IndexWriter
[ https://issues.apache.org/jira/browse/LUCENE-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13006227#comment-13006227 ] Earwin Burrfoot commented on LUCENE-2960: - {quote} Why such purity? What do we gain? I'm all for purity, but only if it doesn't interfere w/ functionality. Here, it's taking away freedom... {quote} We gain consistency and predictability. And there are a lot of freedoms dangerous for developers. {quote} In fact it should be fine to share an IWC across multiple writers; you can change the RAM buffer for all of them at once. {quote} You've brought up a purrfect example of how NOT to do things. This is called 'action at a distance' and is a damn bug. Very annoying one. I've thoroughly experienced it with previous major version of Apache HTTPClient - they had an API that suggested you can set per-request timeouts, while these were actually global for a single Client instance. I fried my brain trying to understand why the hell random user requests timeout at hundred times their intended duration. Oh! It was an occasional admin request changing the global. irony You know, you can actually instantiate some DateRangeFilter with a couple of Dates, and then change these Dates (they are writeable) before each request. Isn't it an exciting kind of programming freedom? Or, back to our current discussion - we can pass RAMBufferSizeMB as an AtomicDouble, instead of current double, then we can use .set() on an instance we passed, and have our live reconfigurability. What's more - AtomicDouble protects us from word tearing! /irony bq. I doubt there's any JVM out there where our lack-of-volatile infoStream causes any problems. Er.. While I have never personally witnessed unsynchronized long/double tearing, I've seen the consequence of unsafely publishing a HashMap - an endless loop on get(). It happened on your run off the mill Sun 1.6 JVM. So the bug is there, lying in wait. Maybe nobody ever actually used the freedom to change infoStream in-flight, or the guy was lucky, or in his particular situation the field was guarded by some unrelated sync. While I see banishing live reconfiguration from IW as a lost cause, I ask to make IWC immutable at the very least. As Shay said - this will provide a clear barrier between mutable and immutable properties. Allow (or bring back) the ability to setRAMBufferSizeMB on an open IndexWriter -- Key: LUCENE-2960 URL: https://issues.apache.org/jira/browse/LUCENE-2960 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shay Banon Priority: Blocker Fix For: 3.1, 4.0 In 3.1 the ability to setRAMBufferSizeMB is deprecated, and removed in trunk. It would be great to be able to control that on a live IndexWriter. Other possible two methods that would be great to bring back are setTermIndexInterval and setReaderTermsIndexDivisor. Most of the other setters can actually be set on the MergePolicy itself, so no need for setters for those (I think). -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2958) WriteLineDocTask improvements
[ https://issues.apache.org/jira/browse/LUCENE-2958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13006251#comment-13006251 ] Shai Erera commented on LUCENE-2958: A thought -- how about we do the following: # LineDocSource remains basically as it is today, with a getDocData(String line) extendable method for whoever wants # Instead of introducing those Fillers, you create a HeaderLineDocSource which assumes the first line read is the header line, and parses the following ones as appropriate. It will create LDS extending getDocData. This will not introduce a Filler in LDS, and those who don't care about it don't need to know about it at all. Also, it will showcase the 'extendability' of LDS. Will that be simpler? WriteLineDocTask improvements - Key: LUCENE-2958 URL: https://issues.apache.org/jira/browse/LUCENE-2958 Project: Lucene - Java Issue Type: Improvement Components: contrib/benchmark Reporter: Doron Cohen Assignee: Doron Cohen Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-2958.patch, LUCENE-2958.patch, LUCENE-2958.patch Make WriteLineDocTask and LineDocSource more flexible/extendable: * allow to emit lines also for empty docs (keep current behavior as default) * allow more/less/other fields -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2749) Co-occurrence filter
[ https://issues.apache.org/jira/browse/LUCENE-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13006254#comment-13006254 ] Steven Rowe commented on LUCENE-2749: - Hi Elmar, I haven't had a chance to do more than an hour or two of work on this, and that was a while back, so please feel free to run with it. You should know, though, that Robert Muir and Yonik Seeley (both Lucene/Solr developers) expressed skepticism (on #lucene IRC) about whether this filter belongs in Lucene itself, because in their experience, collocations are used by non-search software, and they believe that Lucene should remain focused exclusively on search. Robert Muir also thinks that components that support Boolean search (i.e., not ranked search) should go elsewhere. I personally disagree with these restrictions in general, and I think that a co-occurrence filter could directly support search. See this solr-u...@lucene.apache.org mailing list discussion for an example I gave (and one of the reasons I made this issue): http://www.lucidimagination.com/search/document/f69f877e0fa05d17/how_do_i_this_in_solr#d9d5932e7074d356 . In this thread, I described a way to solve the original poster's problem using a co-occurrence filter exactly like the one proposed here. I mention all this to caution you that work you put in here may never be committed to Lucene itself. The mailing list thread I mentioned above describes the main limitations a filter like this will have: combinatoric explosion of generated terms. I haven't figured out how to manage this, but it occurs to me that the two-term-collocation case is less problematic in this regard than the generalized case (whole-field window, all possible combinations). I had a vague implementation conception of incrementing a fixed-width integer to iterate over the combinations, using the integer's bits to include/exclude input terms in the output termset tokens. Using a 32-bit integer to track combinations would limit the length of an input token stream to 32 tokens, but in the generalized case of all combinations, I'm pretty sure that the number of bits available would not be the limiting factor, but rather the number of generated terms. I guess the question is how to handle cases that produce fewer terms than all combinations of terms from an input token stream, e.g. the two-term-collocation case, without imposing the restrictions necessary in the generalized case. Here are a couple of recent information retrieval papers using termset to mean indexed token containing multiple input terms: TSS: Efficient Term Set Search in Large Peer-to-Peer Textual Collections http://www.cs.ust.hk/~liu/TSS-TC.pdf Termset-based Indexing and Query Processing in P2P Search http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=arnumber=5384831 (Sorry, I couldn't find a free public location for the second paper.) Co-occurrence filter Key: LUCENE-2749 URL: https://issues.apache.org/jira/browse/LUCENE-2749 Project: Lucene - Java Issue Type: New Feature Components: Analysis Affects Versions: 3.1, 4.0 Reporter: Steven Rowe Priority: Minor Fix For: 4.0 The co-occurrence filter to be developed here will output sets of tokens that co-occur within a given window onto a token stream. These token sets can be ordered either lexically (to allow order-independent matching/counting) or positionally (e.g. sliding windows of positionally ordered co-occurring terms that include all terms in the window are called n-grams or shingles). The parameters to this filter will be: * window size: this can be a fixed sequence length, sentence/paragraph context (these will require sentence/paragraph segmentation, which is not in Lucene yet), or over the entire token stream (full field width) * minimum number of co-occurring terms: = 2 * maximum number of co-occurring terms: = window size * token set ordering (lexical or positional) One use case for co-occurring token sets is as candidates for collocations. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2958) WriteLineDocTask improvements
[ https://issues.apache.org/jira/browse/LUCENE-2958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doron Cohen updated LUCENE-2958: Attachment: LUCENE-2958.patch bq. Will that be simpler? It will be simpler, I admit, but it will harder to manage: * when re-reading the input file (with repeat=true) special treatment of the header line is needed. And cannot assume that the header line exists, because there are 1-line files out there without this line, which, is possible, I would not like to force recreating, and it is possible. * the simple LDS as today handles no header line. As such, if there is one, it will wrongly treat it as a regular line. But I would like it to be able to handle both old files (with no header) and new files, with the header. Mmmm,,, e could for that write the header only if it differs from the default header. Perhaps this will work. I'll take a look at that again, meanwhile attaching updated patch with the two inner DocDataLineReader's. WriteLineDocTask improvements - Key: LUCENE-2958 URL: https://issues.apache.org/jira/browse/LUCENE-2958 Project: Lucene - Java Issue Type: Improvement Components: contrib/benchmark Reporter: Doron Cohen Assignee: Doron Cohen Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-2958.patch, LUCENE-2958.patch, LUCENE-2958.patch, LUCENE-2958.patch Make WriteLineDocTask and LineDocSource more flexible/extendable: * allow to emit lines also for empty docs (keep current behavior as default) * allow more/less/other fields -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2958) WriteLineDocTask improvements
[ https://issues.apache.org/jira/browse/LUCENE-2958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13006269#comment-13006269 ] Doron Cohen commented on LUCENE-2958: - Rethinking this suggestion, I am afraid it will easily lead users to errors/mistakes - users would have to be aware: did I create that file with a header? Mmm... so I must use the source which handles the header, and that file is with the default settings, so it needs the simple reader, but man, did I set it to create the header anyhow... I don't remember, I'll recreate the file... Maybe some users will remember such things, but I know that I will not remember, and a line-reader that handles correctly all inputs out-of-the-box is much more convenient... which is what I liked in the header suggestion. WriteLineDocTask improvements - Key: LUCENE-2958 URL: https://issues.apache.org/jira/browse/LUCENE-2958 Project: Lucene - Java Issue Type: Improvement Components: contrib/benchmark Reporter: Doron Cohen Assignee: Doron Cohen Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-2958.patch, LUCENE-2958.patch, LUCENE-2958.patch, LUCENE-2958.patch Make WriteLineDocTask and LineDocSource more flexible/extendable: * allow to emit lines also for empty docs (keep current behavior as default) * allow more/less/other fields -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Lucene Solr a one way street?
It's good to see this discussion finally happen. Some things are in Solr (e.g. faceting, function queries, Yonik's recent join patch, ...) that probably belong in Lucene. As someone contributing functionality to Lucene/Solr in SOLR-2155 (Geospatial search via geohash prefix techniques), anecdotally I find it most convenient for the code to span both Lucene and Solr, particularly Solr on the testing side. (Of course each patch is different but this is my experience) The testing infrastructure on the Solr side is excellent. As I look to implement sorting it's going to be difficult to have a non-Solr user take the code since the sorting capability is wrapped up in Solr concepts like function queries. ~ David Smiley - Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book -- View this message in context: http://lucene.472066.n3.nabble.com/Lucene-Solr-a-one-way-street-tp2672821p2674116.html Sent from the Lucene - Java Developer mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: GPU acceleration
Vectorizable code would be the major target, yes. Do you know if the guys from Yandex were using CUDA or OpenCL approaches? or the old kind of opengl hacks. On Sun, Mar 13, 2011 at 8:31 AM, Earwin Burrfoot ear...@gmail.com wrote: On Sun, Mar 13, 2011 at 00:15, Ken O'Brien k...@kenobrien.org wrote: To clarify, I've not yet written any code. I aim to bring a large speedup to any functionality that is computationally expensive. I'm wondering which components are candidates for this. I'll be looking through the code but if anyone is aware of parallelizable code, I'll start with that. More like 'vectorizable' code, huh? Guys from Yandex use modified group varint encoding plus handcrafted SSE magic to decode/intersect posting lists and claim tremendous speedups over original group varint. They also use SSE to run the decision trees used in ranking. There were experiments with moving both pieces of code to the GPU, and GPU did well in terms of speed, but they say getting data in and out of GPU made the approach unfeasible. I'll basically replicate existing functionality to run on the gpu. On 12/03/11 21:08, Simon Willnauer wrote: On Sat, Mar 12, 2011 at 9:21 PM, Ken O'Brienk...@kenobrien.org wrote: Hi, Is anyone looking at GPU acceleration for Solr? If not, I'd like to contribute code which adds this functionality. As I'm not familiar with the codebase, does anyone know which areas of functionality could benefit from high degrees of parallelism. Very interesting can you elaborate a little more what kind of functionality you exposed / try to expose to the GPU? simon Regards, Ken - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Kirill Zakharenko/Кирилл Захаренко E-Mail/Jabber: ear...@gmail.com Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-1725) Script based UpdateRequestProcessorFactory
[ https://issues.apache.org/jira/browse/SOLR-1725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13006284#comment-13006284 ] David Smiley commented on SOLR-1725: This capability is awesome, just as it's DIH equivalent is. It's been over a year since any activity here. As time passes, the case for moving to Java 6 increases. Short of that, I don't see a problem with this working like the DIH ScriptTransformer does by using the Java 6 features via reflection. Script based UpdateRequestProcessorFactory -- Key: SOLR-1725 URL: https://issues.apache.org/jira/browse/SOLR-1725 Project: Solr Issue Type: New Feature Components: update Affects Versions: 1.4 Reporter: Uri Boness Attachments: SOLR-1725.patch, SOLR-1725.patch, SOLR-1725.patch, SOLR-1725.patch, SOLR-1725.patch A script based UpdateRequestProcessorFactory (Uses JDK6 script engine support). The main goal of this plugin is to be able to configure/write update processors without the need to write and package Java code. The update request processor factory enables writing update processors in scripts located in {{solr.solr.home}} directory. The functory accepts one (mandatory) configuration parameter named {{scripts}} which accepts a comma-separated list of file names. It will look for these files under the {{conf}} directory in solr home. When multiple scripts are defined, their execution order is defined by the lexicographical order of the script file name (so {{scriptA.js}} will be executed before {{scriptB.js}}). The script language is resolved based on the script file extension (that is, a *.js files will be treated as a JavaScript script), therefore an extension is mandatory. Each script file is expected to have one or more methods with the same signature as the methods in the {{UpdateRequestProcessor}} interface. It is *not* required to define all methods, only those hat are required by the processing logic. The following variables are define as global variables for each script: * {{req}} - The SolrQueryRequest * {{rsp}}- The SolrQueryResponse * {{logger}} - A logger that can be used for logging purposes in the script -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (SOLR-2347) Use InputStream and not Reader for XML parsing
[ https://issues.apache.org/jira/browse/SOLR-2347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated SOLR-2347: Component/s: contrib - DataImportHandler Fix Version/s: 4.0 3.2 Use InputStream and not Reader for XML parsing -- Key: SOLR-2347 URL: https://issues.apache.org/jira/browse/SOLR-2347 Project: Solr Issue Type: Bug Components: contrib - DataImportHandler Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: 3.2, 4.0 Followup to SOLR-96: Solr mostly uses java.io.Reader and passes this Reader to the XML parser. According to XML spec, a XML file should be initially seen as a binary stream with a default charset of UTF-8 or another charset given by the network protocol (like Content-Type header in HTTP). But very important, this default charset is only a hint to the parser - mandatory is the charset from the XML header processing inctruction. Because of this, the parser must be able to change the charset when reading the XML headers (possibly also when seeing BOM markers). This is not possible if the XML parser gets a java.io.Reader instead of java.io.InputStreams. SOLR-96 already fixed this for the XmlUpdateRequestHandler and the DocumentAnalysisRequestHandler. This issue should fix the rest to be conforming to XML-spec (open schema.xml and config.xml as InputStream not Reader and others). This change would not break anything in Solr (perhaps only backwards compatibility in the API), as the default used by XML parsers is UTF-8. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-2347) Use InputStream and not Reader for XML parsing
[ https://issues.apache.org/jira/browse/SOLR-2347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13006285#comment-13006285 ] Uwe Schindler commented on SOLR-2347: - This now only affects DIH anymore. DIH have to be changed that all the base classes that open files, read from network don't use Readers but InputStreams. This is easy to do, but breaks backwards (which is no problem as DIH is contrib and experimental). Use InputStream and not Reader for XML parsing -- Key: SOLR-2347 URL: https://issues.apache.org/jira/browse/SOLR-2347 Project: Solr Issue Type: Bug Components: contrib - DataImportHandler Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: 3.2, 4.0 Followup to SOLR-96: Solr mostly uses java.io.Reader and passes this Reader to the XML parser. According to XML spec, a XML file should be initially seen as a binary stream with a default charset of UTF-8 or another charset given by the network protocol (like Content-Type header in HTTP). But very important, this default charset is only a hint to the parser - mandatory is the charset from the XML header processing inctruction. Because of this, the parser must be able to change the charset when reading the XML headers (possibly also when seeing BOM markers). This is not possible if the XML parser gets a java.io.Reader instead of java.io.InputStreams. SOLR-96 already fixed this for the XmlUpdateRequestHandler and the DocumentAnalysisRequestHandler. This issue should fix the rest to be conforming to XML-spec (open schema.xml and config.xml as InputStream not Reader and others). This change would not break anything in Solr (perhaps only backwards compatibility in the API), as the default used by XML parsers is UTF-8. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-2347) Use InputStream and not Reader for XML parsing
[ https://issues.apache.org/jira/browse/SOLR-2347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13006286#comment-13006286 ] Mark Miller commented on SOLR-2347: --- bq. which is no problem as DIH is contrib and experimental +1 Use InputStream and not Reader for XML parsing -- Key: SOLR-2347 URL: https://issues.apache.org/jira/browse/SOLR-2347 Project: Solr Issue Type: Bug Components: contrib - DataImportHandler Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: 3.2, 4.0 Followup to SOLR-96: Solr mostly uses java.io.Reader and passes this Reader to the XML parser. According to XML spec, a XML file should be initially seen as a binary stream with a default charset of UTF-8 or another charset given by the network protocol (like Content-Type header in HTTP). But very important, this default charset is only a hint to the parser - mandatory is the charset from the XML header processing inctruction. Because of this, the parser must be able to change the charset when reading the XML headers (possibly also when seeing BOM markers). This is not possible if the XML parser gets a java.io.Reader instead of java.io.InputStreams. SOLR-96 already fixed this for the XmlUpdateRequestHandler and the DocumentAnalysisRequestHandler. This issue should fix the rest to be conforming to XML-spec (open schema.xml and config.xml as InputStream not Reader and others). This change would not break anything in Solr (perhaps only backwards compatibility in the API), as the default used by XML parsers is UTF-8. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Created: (SOLR-2425) firstSearcher Listener of SpellChecker can be never called
firstSearcher Listener of SpellChecker can be never called -- Key: SOLR-2425 URL: https://issues.apache.org/jira/browse/SOLR-2425 Project: Solr Issue Type: Bug Components: spellchecker Affects Versions: 1.4.1, 3.1, 4.0 Reporter: Koji Sekiguchi Priority: Minor mail thread: http://www.lucidimagination.com/search/document/65e73468958faf09/known_problem_firstsearcher_event_of_spellchecker_is_never_called firstSearcher Listener of SpellChecker is never called when there is no listener event=firstSearcher/ registered in solrconfig.xml. The reason is because the sequence of procedures in SolrCore constructor: # initListeners(); # getSearcher(false,false,null); = register (general) firstSearcher listener if it exists # call SolrCoreAware.inform(); = register SpellChecker's firstSearcher listener After that, Callable.call() is called to execute the firstSearcher event: {code} if (currSearcher==null firstSearcherListeners.size() 0) { future = searcherExecutor.submit( new Callable() { public Object call() throws Exception { try { for (SolrEventListener listener : firstSearcherListeners) { listener.newSearcher(newSearcher,null); } } catch (Throwable e) { SolrException.logOnce(log,null,e); } return null; } } ); } {code} At the time, firstSearcherListeners includes SpellChecker's firstSearcherListner, registered by procedure 3 above. But if you have no listener event=firstSearcher/ registered in solrconfig.xml, at the procedure 2, searcherExecutor.submit() is never called because firstSearcherListeners.size() is zero at the moment. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-2425) firstSearcher Listener of SpellChecker can be never called
[ https://issues.apache.org/jira/browse/SOLR-2425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13006313#comment-13006313 ] Koji Sekiguchi commented on SOLR-2425: -- bq. At the time, firstSearcherListeners includes SpellChecker's firstSearcherListner, registered by procedure 3 above. But if you have no listener event=firstSearcher/ registered in solrconfig.xml, at the procedure 2, searcherExecutor.submit() is never called because firstSearcherListeners.size() is zero at the moment. This was bit misleading. I think there is a timing issue. Regardless of the existence of listener event=firstSearcher/ in solrconfig.xml, SpellChecker's firstSearcher listener can never be called, because Callable.call() can be called before executing SolrCoreAware.inform(). firstSearcher Listener of SpellChecker can be never called -- Key: SOLR-2425 URL: https://issues.apache.org/jira/browse/SOLR-2425 Project: Solr Issue Type: Bug Components: spellchecker Affects Versions: 1.4.1, 3.1, 4.0 Reporter: Koji Sekiguchi Priority: Minor mail thread: http://www.lucidimagination.com/search/document/65e73468958faf09/known_problem_firstsearcher_event_of_spellchecker_is_never_called firstSearcher Listener of SpellChecker is never called when there is no listener event=firstSearcher/ registered in solrconfig.xml. The reason is because the sequence of procedures in SolrCore constructor: # initListeners(); # getSearcher(false,false,null); = register (general) firstSearcher listener if it exists # call SolrCoreAware.inform(); = register SpellChecker's firstSearcher listener After that, Callable.call() is called to execute the firstSearcher event: {code} if (currSearcher==null firstSearcherListeners.size() 0) { future = searcherExecutor.submit( new Callable() { public Object call() throws Exception { try { for (SolrEventListener listener : firstSearcherListeners) { listener.newSearcher(newSearcher,null); } } catch (Throwable e) { SolrException.logOnce(log,null,e); } return null; } } ); } {code} At the time, firstSearcherListeners includes SpellChecker's firstSearcherListner, registered by procedure 3 above. But if you have no listener event=firstSearcher/ registered in solrconfig.xml, at the procedure 2, searcherExecutor.submit() is never called because firstSearcherListeners.size() is zero at the moment. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: GPU acceleration
There were experiments with moving both pieces of code to the GPU, and GPU did well in terms of speed, but they say getting data in and out of GPU made the approach unfeasible. That would be the problem with GPUs :) Co-processors always have this problem. GPUs were designed for 3d video games- the data flow into the GPU is low-volume (textures triangles) and the data flow out is (I assume) pretty minimal. 2011/3/13 k...@kenobrien.org k...@kenobrien.org: Vectorizable code would be the major target, yes. Do you know if the guys from Yandex were using CUDA or OpenCL approaches? or the old kind of opengl hacks. On Sun, Mar 13, 2011 at 8:31 AM, Earwin Burrfoot ear...@gmail.com wrote: On Sun, Mar 13, 2011 at 00:15, Ken O'Brien k...@kenobrien.org wrote: To clarify, I've not yet written any code. I aim to bring a large speedup to any functionality that is computationally expensive. I'm wondering which components are candidates for this. I'll be looking through the code but if anyone is aware of parallelizable code, I'll start with that. More like 'vectorizable' code, huh? Guys from Yandex use modified group varint encoding plus handcrafted SSE magic to decode/intersect posting lists and claim tremendous speedups over original group varint. They also use SSE to run the decision trees used in ranking. There were experiments with moving both pieces of code to the GPU, and GPU did well in terms of speed, but they say getting data in and out of GPU made the approach unfeasible. I'll basically replicate existing functionality to run on the gpu. On 12/03/11 21:08, Simon Willnauer wrote: On Sat, Mar 12, 2011 at 9:21 PM, Ken O'Brienk...@kenobrien.org wrote: Hi, Is anyone looking at GPU acceleration for Solr? If not, I'd like to contribute code which adds this functionality. As I'm not familiar with the codebase, does anyone know which areas of functionality could benefit from high degrees of parallelism. Very interesting can you elaborate a little more what kind of functionality you exposed / try to expose to the GPU? simon Regards, Ken - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Kirill Zakharenko/Кирилл Захаренко E-Mail/Jabber: ear...@gmail.com Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Lance Norskog goks...@gmail.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: GPU acceleration
I see. I will conduct my own experiments with more recent GPU libraries. Thanks for everyone's input. On Mar 14, 2011 2:30 AM, Lance Norskog goks...@gmail.com wrote: There were experiments with moving both pieces of code to the GPU, and GPU did well in terms of speed, but they say getting data in and out of GPU made the approach unfeasible. That would be the problem with GPUs :) Co-processors always have this problem. GPUs were designed for 3d video games- the data flow into the GPU is low-volume (textures triangles) and the data flow out is (I assume) pretty minimal. 2011/3/13 k...@kenobrien.org k...@kenobrien.org: Vectorizable code would be the major target, yes. Do you know if the guys from Yandex were using CUDA or OpenCL approaches? or the old kind of opengl hacks. On Sun, Mar 13, 2011 at 8:31 AM, Earwin Burrfoot ear...@gmail.com wrote: On Sun, Mar 13, 2011 at 00:15, Ken O'Brien k...@kenobrien.org wrote: To clarify, I've not yet written any code. I aim to bring a large speedup to any functionality that is computationally expensive. I'm wondering which components are candidates for this. I'll be looking through the code but if anyone is aware of parallelizable code, I'll start with that. More like 'vectorizable' code, huh? Guys from Yandex use modified group varint encoding plus handcrafted SSE magic to decode/intersect posting lists and claim tremendous speedups over original group varint. They also use SSE to run the decision trees used in ranking. There were experiments with moving both pieces of code to the GPU, and GPU did well in terms of speed, but they say getting data in and out of GPU made the approach unfeasible. I'll basically replicate existing functionality to run on the gpu. On 12/03/11 21:08, Simon Willnauer wrote: On Sat, Mar 12, 2011 at 9:21 PM, Ken O'Brienk...@kenobrien.org wrote: Hi, Is anyone looking at GPU acceleration for Solr? If not, I'd like to contribute code which adds this functionality. As I'm not familiar with the codebase, does anyone know which areas of functionality could benefit from high degrees of parallelism. Very interesting can you elaborate a little more what kind of functionality you exposed / try to expose to the GPU? simon Regards, Ken - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Kirill Zakharenko/Кирилл Захаренко E-Mail/Jabber: ear...@gmail.com Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Lance Norskog goks...@gmail.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-2337) Solr needs hits= added to the log when using grouping
[ https://issues.apache.org/jira/browse/SOLR-2337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13006330#comment-13006330 ] Bill Bell commented on SOLR-2337: - The only downfall with this patch, is the hits=10 means the following: 1. if there are 20 rows that match and are grouped, and if rows=10 then hits will be 10 (since the short circuit) is still obeyed. Otherwise there could be a performance issue with looping until end of results. 2. if there are 4 rows that match and are grouped, and you say rows=10, hits will be 4 (normal behavior). Solr needs hits= added to the log when using grouping -- Key: SOLR-2337 URL: https://issues.apache.org/jira/browse/SOLR-2337 Project: Solr Issue Type: Bug Components: SearchComponents - other Affects Versions: 4.0 Reporter: Bill Bell Fix For: 4.0 Attachments: SOLR.2337.patch We monitor the Solr logs to try to review queries that have hits=0. This enables us to improve relevancy since they are easy to find and review. When using group=true, hits= does not show up: {code} 2011-01-27 01:10:16,117 INFO core.SolrCore - [collection1] webapp= path=/select params={group=truegroup.field=gendergroup.field=idq=*:*} status=0 QTime=15 {code} The code in QueryComponent.java needs to add the matches() after calling grouping.execute() and add up the total. It does return hits= in the log for mainResult, but not for standard grouping. This should be easy to add since matches are defined... -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org