[ANNOUNCE] Apache PyLucene 2.9.4 and 3.0.3
I am pleased to announce the availability of the Apache PyLucene 2.9.4 and 3.0.3 releases. Apache PyLucene, a subproject of Apache Lucene, is a Python extension for accessing Java Lucene. Its goal is to allow you to use Lucene's text indexing and searching capabilities from Python. It is API compatible with the latest versions of Java Lucene, 2.9.4 and 3.0.3. This release contains a number of bug fixes and improvements. Details can be found in the changes files: http://svn.apache.org/repos/asf/lucene/pylucene/tags/pylucene_2_9_4/CHANGES http://svn.apache.org/repos/asf/lucene/pylucene/tags/pylucene_3_0_3/CHANGES http://svn.apache.org/repos/asf/lucene/pylucene/trunk/jcc/CHANGES Apache PyLucene is available from the following download pages: http://www.apache.org/dyn/closer.cgi/lucene/pylucene/pylucene-2.9.4-1-src.tar.gz http://www.apache.org/dyn/closer.cgi/lucene/pylucene/pylucene-3.0.3-1-src.tar.gz When downloading from a mirror site, please remember to verify the downloads using signatures found on the Apache site: http://www.apache.org/dist/lucene/pylucene/KEYS For more information on Apache PyLucene, visit the project home page: http://lucene.apache.org/pylucene Andi..
Solr-trunk - Build # 1344 - Failure
Build: https://hudson.apache.org/hudson/job/Solr-trunk/1344/ All tests passed Build Log (for compile errors): [...truncated 20199 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
LogMergePolicy.setUseCompoundFile/DocStore
Hi I find it very annoying that I need to set true/false on these methods whenever I want to control compound files creation. Is it really necessary to allow writing doc stores in non compound files vs. the other index files in a compound file? Does somebody know if this feature is used somewhere? If it's crucial to keep the two methods, then how about introducing a setCompoundMode(true/false) to turn on/off both at once? IndexWriter used to have it, before we switched to IndexWriterConfig and I think it was very useful. Shai
Re: LogMergePolicy.setUseCompoundFile/DocStore
Incoming LUCENE-2814 drops setUseCompoundDocStore() On Thu, Dec 16, 2010 at 12:04, Shai Erera ser...@gmail.com wrote: Hi I find it very annoying that I need to set true/false on these methods whenever I want to control compound files creation. Is it really necessary to allow writing doc stores in non compound files vs. the other index files in a compound file? Does somebody know if this feature is used somewhere? If it's crucial to keep the two methods, then how about introducing a setCompoundMode(true/false) to turn on/off both at once? IndexWriter used to have it, before we switched to IndexWriterConfig and I think it was very useful. Shai -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: strange problem of PForDelta decoder
hi Michael, lucene 4 has so much changes that I don't know how to index and search with specified codec. could you please give me some code snipplets that using PFor codec so I can trace the codes. in you blog http://chbits.blogspot.com/2010/08/lucene-performance-with-pfordelta-codec.html you said The AND query, curiously, got faster; I think this is because I modified its scoring to first try seeking within the block of decoded ints. I am also curious about the result because VINT only need decode part of the doclist while PFor need decode the whole block. But I think with conjuction queries, the main time is used for searching in skiplist. I haven't read your codes yet. But I guess the skiplist for VINT and the skiplist for PFor is different. e.g. lucene 2.9's default skipInterval is 16, so it like level 1 256 level 0 16 32 48 64 80 96 112 128 ... 256 when need skipTo(60) we need read 0 16 32 48 64 in level0 but when use block, e.g. block size is 128, my implementation of skiplist is level 1 256 level 0 128 256 when skipTo(60) we only read 2 item in level0 and decode the first block which contains 128 docIDs How do you implement bulk read? I did like this: I decode a block and cache it in a int array. I think I can buffer up to 100K docIDs and tfs for disjuction queries(it cost less than 1MB memory for each term) SegmentTermDocs.read(final int[] docs, final int[] freqs) ... while (i length count df) { if (curBlockIdx = curBlockSize) { //this condition is often false, we may optimize it. but JVM hotspots will cache hot codes. So ... int idBlockBytes = freqStream.readVInt(); curBlockIdx = 0; for (int k = 0; k idBlockBytes; k++) { buffer[k] = freqStream.readInt(); } blockIds = code.decode(buffer,idBlockBytes); curBlockSize = blockIds.length; int tfBlockBytes = freqStream.readVInt(); for (int k = 0; k tfBlockBytes; k++) { buffer[k] = freqStream.readInt(); } blockTfs = code.decode(buffer, tfBlockBytes); assert curBlockSize == decoded.length; } freq = blockTfs[curBlockIdx]; doc += blockIds[curBlockIdx++]; count++; if (deletedDocs == null || !deletedDocs.get(doc)) { docs[i] = doc; freqs[i] = freq; ++i; } } 2010/12/15 Michael McCandless luc...@mikemccandless.com: Hi Li Li, That issue has such a big patch, and enough of us are now iterating on it, that we cut a dedicated branch for it. But note that this branch is off of trunk (to be 4.0). You should be able to do this: svn checkout https://svn.apache.org/repos/asf/lucene/dev/branches/bulkpostings And then run things in there. I just committed FOR/PFOR prototype codecs from LUCENE-1410 onto that branch, so eg you can run unit tests using those codecs by running ant test -Dtests.codec=PatchedFrameOfRef. Please post patches back if you improve things! We need all the help we can get :) Mike On Wed, Dec 15, 2010 at 5:54 AM, Li Li fancye...@gmail.com wrote: hi Michael you posted a patch here https://issues.apache.org/jira/browse/LUCENE-2723 I am not familiar with patch. do I need download LUCENE-2723.patch(there are many patches after this name, do I need the latest one?) and LUCENE-2723_termscorer.patch and patch them (patch -p1 LUCENE-2723.patch)? I just check out the latest source code from http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene 2010/12/14 Michael McCandless luc...@mikemccandless.com: Likely you are seeing the startup cost of hotspot compiling the PFOR code? Ie, does your test first warmup the JRE and then do the real test? I've also found that running -Xbatch produces more consistent results from run to run, however, those results may not be as fast as running w/o -Xbatch. Also, it's better to test on actual data (ie a Lucene index's postings), and in the full context of searching, because then we get a sense of what speedups a real app will see... micro-benching is nearly
Re: LogMergePolicy.setUseCompoundFile/DocStore
Ok perfect ! Shai On Thu, Dec 16, 2010 at 11:23 AM, Earwin Burrfoot ear...@gmail.com wrote: Incoming LUCENE-2814 drops setUseCompoundDocStore() On Thu, Dec 16, 2010 at 12:04, Shai Erera ser...@gmail.com wrote: Hi I find it very annoying that I need to set true/false on these methods whenever I want to control compound files creation. Is it really necessary to allow writing doc stores in non compound files vs. the other index files in a compound file? Does somebody know if this feature is used somewhere? If it's crucial to keep the two methods, then how about introducing a setCompoundMode(true/false) to turn on/off both at once? IndexWriter used to have it, before we switched to IndexWriterConfig and I think it was very useful. Shai -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (SOLR-1993) SolrJ binary update erro when commitWithin is set.
[ https://issues.apache.org/jira/browse/SOLR-1993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Valyanskiy updated SOLR-1993: --- Attachment: SOLR-1993-1.4.patch Patch for Solr 1.4 branch SolrJ binary update erro when commitWithin is set. -- Key: SOLR-1993 URL: https://issues.apache.org/jira/browse/SOLR-1993 Project: Solr Issue Type: Bug Components: clients - java Affects Versions: 1.4, 1.4.1 Reporter: Phil Bingley Priority: Minor Fix For: 3.1, 4.0 Attachments: SOLR-1993-1.4.patch, SOLR-1993.patch, SOLR-1993.patch, SolrExampleBinaryTest.java Solr server is unable to unmarshall a binary update request where the commitWithin property is set on the UpdateRequest class. The client marshalls the request with the following code if (updateRequest.getCommitWithin() != -1) { params.add(commitWithin, updateRequest.getCommitWithin()); } The property is an int and when the server unmarshalls, the following error happens (can't cast to ListString due to an Integer element) SEVERE: java.lang.ClassCastException: java.lang.Integer cannot be cast to java.util.List at org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.namedListToSolrParams(JavaBinUpdateRequestCodec.java:213) at org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.access$100(JavaBinUpdateRequestCodec.java:40) at org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$2.readOuterMostDocIterator(JavaBinUpdateRequestCodec.java:131) at org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$2.readIterator(JavaBinUpdateRequestCodec.java:126) at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:210) at org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$2.readNamedList(JavaBinUpdateRequestCodec.java:112) at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:175) at org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:101) at org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.unmarshal(JavaBinUpdateRequestCodec.java:141) at org.apache.solr.handler.BinaryUpdateRequestHandler.parseAndLoadDocs(BinaryUpdateRequestHandler.java:68) at org.apache.solr.handler.BinaryUpdateRequestHandler.access$000(BinaryUpdateRequestHandler.java:46) at org.apache.solr.handler.BinaryUpdateRequestHandler$1.load(BinaryUpdateRequestHandler.java:55) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:567) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:849) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:454) at java.lang.Thread.run(Thread.java:619) Workaround is to set the parameter manually as a string value instead of setting using the property on the UpdateRequest class. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-1993) SolrJ binary update erro when commitWithin is set.
[ https://issues.apache.org/jira/browse/SOLR-1993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12972036#action_12972036 ] Maxim Valyanskiy commented on SOLR-1993: I ported this patch to 1.4 branch and test it in my application. 5 min test passed without any problems, commitWithin parametes works as excepted. Is it possible to include this patch in 1.4.2? SolrJ binary update erro when commitWithin is set. -- Key: SOLR-1993 URL: https://issues.apache.org/jira/browse/SOLR-1993 Project: Solr Issue Type: Bug Components: clients - java Affects Versions: 1.4, 1.4.1 Reporter: Phil Bingley Priority: Minor Fix For: 3.1, 4.0 Attachments: SOLR-1993-1.4.patch, SOLR-1993.patch, SOLR-1993.patch, SolrExampleBinaryTest.java Solr server is unable to unmarshall a binary update request where the commitWithin property is set on the UpdateRequest class. The client marshalls the request with the following code if (updateRequest.getCommitWithin() != -1) { params.add(commitWithin, updateRequest.getCommitWithin()); } The property is an int and when the server unmarshalls, the following error happens (can't cast to ListString due to an Integer element) SEVERE: java.lang.ClassCastException: java.lang.Integer cannot be cast to java.util.List at org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.namedListToSolrParams(JavaBinUpdateRequestCodec.java:213) at org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.access$100(JavaBinUpdateRequestCodec.java:40) at org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$2.readOuterMostDocIterator(JavaBinUpdateRequestCodec.java:131) at org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$2.readIterator(JavaBinUpdateRequestCodec.java:126) at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:210) at org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$2.readNamedList(JavaBinUpdateRequestCodec.java:112) at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:175) at org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:101) at org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.unmarshal(JavaBinUpdateRequestCodec.java:141) at org.apache.solr.handler.BinaryUpdateRequestHandler.parseAndLoadDocs(BinaryUpdateRequestHandler.java:68) at org.apache.solr.handler.BinaryUpdateRequestHandler.access$000(BinaryUpdateRequestHandler.java:46) at org.apache.solr.handler.BinaryUpdateRequestHandler$1.load(BinaryUpdateRequestHandler.java:55) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:567) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:849) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:454) at java.lang.Thread.run(Thread.java:619) Workaround is to set the parameter manually as a string value instead of setting using the property on the UpdateRequest class. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail:
Re: strange problem of PForDelta decoder
On the bulkpostings branch you can do something like this: CodecProvider cp = new CodecProvider(); cp.register(new PatchedFrameOfRefCodec()); cp.setDefaultFieldCodec(PatchedFrameOfRef); Then whenever you create an IW or IR, use the advanced method that accepts a CodecProvider. Then the index will always use PForDelta to write/read. I suspect conjunction queries got faster because we no longer skip if the docID we seek is already in the current buffer (currently sized 64). Ie, skip is very costly when the target isn't far. This was sort of an accidental byproduct of forcing even conjunction queries using Standard (vInt) codec to work on block buffers, but I think it's an important opto that we should more generally apply. Skipping for block codecs and Standard/vInt are done w/ the same class now. It's just that the block codec must store the long filePointer where the block starts *and* the int offset into the block, vs Standard codec that just stores a filePointer. On how do we implement bulk read this is the core change on the bulkpostings branch -- we have a new API to separately bulk-read docDeltas, freqs, positionDeltas. But we are rapidly iterating on improving this (and getting to a clean PFor/For impl) now... Mike On Thu, Dec 16, 2010 at 4:29 AM, Li Li fancye...@gmail.com wrote: hi Michael, lucene 4 has so much changes that I don't know how to index and search with specified codec. could you please give me some code snipplets that using PFor codec so I can trace the codes. in you blog http://chbits.blogspot.com/2010/08/lucene-performance-with-pfordelta-codec.html you said The AND query, curiously, got faster; I think this is because I modified its scoring to first try seeking within the block of decoded ints. I am also curious about the result because VINT only need decode part of the doclist while PFor need decode the whole block. But I think with conjuction queries, the main time is used for searching in skiplist. I haven't read your codes yet. But I guess the skiplist for VINT and the skiplist for PFor is different. e.g. lucene 2.9's default skipInterval is 16, so it like level 1 256 level 0 16 32 48 64 80 96 112 128 ... 256 when need skipTo(60) we need read 0 16 32 48 64 in level0 but when use block, e.g. block size is 128, my implementation of skiplist is level 1 256 level 0 128 256 when skipTo(60) we only read 2 item in level0 and decode the first block which contains 128 docIDs How do you implement bulk read? I did like this: I decode a block and cache it in a int array. I think I can buffer up to 100K docIDs and tfs for disjuction queries(it cost less than 1MB memory for each term) SegmentTermDocs.read(final int[] docs, final int[] freqs) ... while (i length count df) { if (curBlockIdx = curBlockSize) { //this condition is often false, we may optimize it. but JVM hotspots will cache hot codes. So ... int idBlockBytes = freqStream.readVInt(); curBlockIdx = 0; for (int k = 0; k idBlockBytes; k++) { buffer[k] = freqStream.readInt(); } blockIds = code.decode(buffer,idBlockBytes); curBlockSize = blockIds.length; int tfBlockBytes = freqStream.readVInt(); for (int k = 0; k tfBlockBytes; k++) { buffer[k] = freqStream.readInt(); } blockTfs = code.decode(buffer, tfBlockBytes); assert curBlockSize == decoded.length; } freq = blockTfs[curBlockIdx]; doc += blockIds[curBlockIdx++]; count++; if (deletedDocs == null || !deletedDocs.get(doc)) { docs[i] = doc; freqs[i] = freq; ++i; } } 2010/12/15 Michael McCandless luc...@mikemccandless.com: Hi Li Li, That issue has such a big patch, and enough of us are now iterating on it, that we cut a dedicated branch for it. But note that this branch is off of trunk (to be 4.0). You should be able to do this: svn checkout https://svn.apache.org/repos/asf/lucene/dev/branches/bulkpostings And then run things in there. I just committed FOR/PFOR prototype
[jira] Commented: (LUCENE-2815) MultiFields not thread safe
[ https://issues.apache.org/jira/browse/LUCENE-2815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12972057#action_12972057 ] Michael McCandless commented on LUCENE-2815: Ugh, nice finds Yonik! We should fix these. Maybe MultiFields should just pre-build its MapString,Term on init? You're right, we do reuse MultiFields today (we stuff the instance of MultiFields onto the IndexReader with IndexReader.store/retrieveFields), but I wonder whether we really should? (In fact I thought at one point we decided to stop doing that... yet, we still are... can't remember the details; maybe perf hit was too high eg for MTQs/Solr facets/etc.). What do we need to do to make the publication safe? Is making IR.store/retrieveFields sync'd sufficient? Aside: Java concurrency is a *mess*. I understand why JMM is needed, to get good perf on modern CPUs, but allowing the low level CPU cache coherency requirements to bubble all the way up to complex requirements in the language itself, is a disaster. MultiFields not thread safe --- Key: LUCENE-2815 URL: https://issues.apache.org/jira/browse/LUCENE-2815 Project: Lucene - Java Issue Type: Bug Affects Versions: 4.0 Reporter: Yonik Seeley MultiFields looks like it has thread safety issues -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2814) stop writing shared doc stores across segments
[ https://issues.apache.org/jira/browse/LUCENE-2814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Earwin Burrfoot updated LUCENE-2814: Attachment: LUCENE-2814.patch First iteration. Passes all tests except TestNRTThreads. Something to do with numDocsInStore and numDocsInRam merged together? Lots of non-critical nocommits (just markers for places I'd like to recheck). DW.docStoreEnabled and *.closeDocStore() have to go, before committing stop writing shared doc stores across segments -- Key: LUCENE-2814 URL: https://issues.apache.org/jira/browse/LUCENE-2814 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 3.1, 4.0 Reporter: Michael McCandless Assignee: Michael McCandless Attachments: LUCENE-2814.patch Shared doc stores enables the files for stored fields and term vectors to be shared across multiple segments. We've had this optimization since 2.1 I think. It works best against a new index, where you open an IW, add lots of docs, and then close it. In that case all of the written segments will reference slices a single shared doc store segment. This was a good optimization because it means we never need to merge these files. But, when you open another IW on that index, it writes a new set of doc stores, and then whenever merges take place across doc stores, they must now be merged. However, since we switched to shared doc stores, there have been two optimizations for merging the stores. First, we now bulk-copy the bytes in these files if the field name/number assignment is congruent. Second, we now force congruent field name/number mapping in IndexWriter. This means this optimization is much less potent than it used to be. Furthermore, the optimization adds *a lot* of hair to IndexWriter/DocumentsWriter; this has been the source of sneaky bugs over time, and causes odd behavior like a merge possibly forcing a flush when it starts. Finally, with DWPT (LUCENE-2324), which gets us truly concurrent flushing, we can no longer share doc stores. So, I think we should turn off the write-side of shared doc stores to pave the path for DWPT to land on trunk and simplify IW/DW. We still must support reading them (until 5.0), but the read side is far less hairy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (SOLR-2259) Improve analyzer/version handling in Solr
[ https://issues.apache.org/jira/browse/SOLR-2259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated SOLR-2259: -- Attachment: SOLR-2259part2.patch here is a patch for branch_3x for part 2. it warns if you are missing the luceneMatchVersion param in your config, informing you that its emulating Lucene 2.4 and that this emulation is deprecated, and that this parameter will be mandatory in 4.0 Improve analyzer/version handling in Solr - Key: SOLR-2259 URL: https://issues.apache.org/jira/browse/SOLR-2259 Project: Solr Issue Type: Task Reporter: Robert Muir Fix For: 3.1, 4.0 Attachments: SOLR-2259.patch, SOLR-2259.patch, SOLR-2259part2.patch We added Version for backwards compatibility support in Lucene. We use this to fire deprecated code to emulate old version to ensure index backwards compat. Related: we deprecate old analysis components and eventually remove them. To hook into Solr, at first it defaulted to Version 2.4 emulation everywhere, with the example having the latest. if you don't specify a version in your solrconfig, it defaults to 2.4 though. However, as of LUCENE-2781 2.4 is removed: but users with old configs that don't specify a version should not be silently upgraded to the Version 3.0 emulation... this is bad. Additionally, when users are using deprecated emulation or using deprecated factories they might not know it, and it might come as a surprise if they upgrade, especially if they arent looking at java apis or java code. I propose: # in trunk: we make the solrconfig luceneMatchVersion mandatory. This is simple: Uwe already has a method that will error out if its not present, we just use that. # in 3.x: we warn if you don't specify luceneMatchVersion in solrconfig: telling you that its going to be required in 4.0 and that you are defaulting to 2.4 emulation. For example: Warning: luceneMatchVersion is not specified in solrconfig.xml. Defaulting to 2.4 emulation. You should at some point declare and reindex to at least 3.0, because 2.4 emulation is deprecated in 3.x and will be removed in 4.0. This parameter will be mandatory in 4.0. # in 3.x,trunk: we warn if you are using a deprecated matchVersion constant somewhere in general, even for a specific tokenizer, telling you that you need to at some point reindex with a current version before you can move to the next release. For example: Warning: you are using 2.4 emulation, at some point you need to bump and reindex to at least 3.0, because 2.4 emulation is deprecated in 3.x and will be removed in 4.0 # in 3.x,trunk: we warn if you are using a deprecated TokenStreamFactory so that you know its going to be removed. For example: Warning: the ISOLatin1FilterFactory is deprecated and will be removed in the next release. You should migrate to ASCIIFoldingFilterFactory. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (SOLR-2288) clean up compiler warnings
[ https://issues.apache.org/jira/browse/SOLR-2288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated SOLR-2288: -- Attachment: SOLR-2288_namedlist.patch Hi Hoss Man, thanks for starting this issue. I looked at your patch, and personally I think NamedList should really be type-safe. If users want to use it in a type-unsafe way, thats fine, but the container itself shouldn't be ListObject. Here's an initial patch (all tests pass)... it also removes the deprecated methods. clean up compiler warnings -- Key: SOLR-2288 URL: https://issues.apache.org/jira/browse/SOLR-2288 Project: Solr Issue Type: Improvement Reporter: Hoss Man Attachments: SOLR-2288_namedlist.patch, warning.cleanup.patch there's a ton of compiler warning in the solr tree, and it's high time we cleaned them up, or annotate them to be suppressed so we can start making a bigger stink when/if code is added to the tree thta produces warnings (we'll never do a good job of noticing new warnings when we have ~175 existing ones) Using this issue to track related commits The goal of this issue should not be to change any functionality or APIs, just deal with each warning in the most appropriate way; * fix generic declarations * add SuppressWarning anotation if it's safe in context -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2815) MultiFields not thread safe
[ https://issues.apache.org/jira/browse/LUCENE-2815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12972096#action_12972096 ] Yonik Seeley commented on LUCENE-2815: -- bq. but I wonder whether we really should? (In fact I thought at one point we decided to stop doing that... yet, we still are... can't remember the details; maybe perf hit was too high eg for MTQs/Solr facets/etc.). It wouldn't be solr facets... that code asks for fields() once up front (per facet request) and the rest of the work will dwarf that. I think there probably are a lot of random places that use it where the overhead could be significant. For example IndexReader.deleteDocuments(), ParallelReader, FuzzyLikeThisQuery, and anyone else that uses any of the static methods on Field on a non-segment reader. bq. What do we need to do to make the publication safe? Is making IR.store/retrieveFields sync'd sufficient? More than sufficient. A volatile would also work fine provided that a race shouldn't matter (i.e. more than one MultiFields object could be constructed). bq. Maybe MultiFields should just pre-build its MapString,Term on init? Ouch... those folks with 1000s of fields wouldn't be happy about that. MultiFields not thread safe --- Key: LUCENE-2815 URL: https://issues.apache.org/jira/browse/LUCENE-2815 Project: Lucene - Java Issue Type: Bug Affects Versions: 4.0 Reporter: Yonik Seeley MultiFields looks like it has thread safety issues -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-2288) clean up compiler warnings
[ https://issues.apache.org/jira/browse/SOLR-2288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12972107#action_12972107 ] Hoss Man commented on SOLR-2288: Robert: as mentioned, i'm trying to keep a narrow focus on this issue: dealing with warnings that can be cleaned up w/o changing functionality... bq. The goal of this issue should not be to change any functionality or APIs, just deal with each warning ...can we please confine discusions of changing the implementation of NamedList (or any other classes) to distinct issues? like SOLR-912? clean up compiler warnings -- Key: SOLR-2288 URL: https://issues.apache.org/jira/browse/SOLR-2288 Project: Solr Issue Type: Improvement Reporter: Hoss Man Attachments: SOLR-2288_namedlist.patch, warning.cleanup.patch there's a ton of compiler warning in the solr tree, and it's high time we cleaned them up, or annotate them to be suppressed so we can start making a bigger stink when/if code is added to the tree thta produces warnings (we'll never do a good job of noticing new warnings when we have ~175 existing ones) Using this issue to track related commits The goal of this issue should not be to change any functionality or APIs, just deal with each warning in the most appropriate way; * fix generic declarations * add SuppressWarning anotation if it's safe in context -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-2288) clean up compiler warnings
[ https://issues.apache.org/jira/browse/SOLR-2288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12972108#action_12972108 ] Robert Muir commented on SOLR-2288: --- bq. Robert: as mentioned, i'm trying to keep a narrow focus on this issue: dealing with warnings that can be cleaned up w/o changing functionality... Ok but i didnt change the functionality? the functionality is the same, just the implementation is different. This is the root cause of most of the compiler warnings, let's not dodge the issue. clean up compiler warnings -- Key: SOLR-2288 URL: https://issues.apache.org/jira/browse/SOLR-2288 Project: Solr Issue Type: Improvement Reporter: Hoss Man Attachments: SOLR-2288_namedlist.patch, warning.cleanup.patch there's a ton of compiler warning in the solr tree, and it's high time we cleaned them up, or annotate them to be suppressed so we can start making a bigger stink when/if code is added to the tree thta produces warnings (we'll never do a good job of noticing new warnings when we have ~175 existing ones) Using this issue to track related commits The goal of this issue should not be to change any functionality or APIs, just deal with each warning in the most appropriate way; * fix generic declarations * add SuppressWarning anotation if it's safe in context -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-2288) clean up compiler warnings
[ https://issues.apache.org/jira/browse/SOLR-2288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12972114#action_12972114 ] Hoss Man commented on SOLR-2288: bq. just the implementation is different. fair enough -- i ment i was trying to avoid changes to either the APIs or the internals, just focusing on the quick wins that were easy to review at a glance and shouldn't affect the bytecode (CollectionObject instead of Collection; etc...) I don't expect that *all* compiler warnings can be dealt with using trivial patches, but that's what i was trying to focus on in this issue. changes to the internals of specific classes seem like they should be tracked in distinct issues with more visibility clean up compiler warnings -- Key: SOLR-2288 URL: https://issues.apache.org/jira/browse/SOLR-2288 Project: Solr Issue Type: Improvement Reporter: Hoss Man Attachments: SOLR-2288_namedlist.patch, warning.cleanup.patch there's a ton of compiler warning in the solr tree, and it's high time we cleaned them up, or annotate them to be suppressed so we can start making a bigger stink when/if code is added to the tree thta produces warnings (we'll never do a good job of noticing new warnings when we have ~175 existing ones) Using this issue to track related commits The goal of this issue should not be to change any functionality or APIs, just deal with each warning in the most appropriate way; * fix generic declarations * add SuppressWarning anotation if it's safe in context -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-2288) clean up compiler warnings
[ https://issues.apache.org/jira/browse/SOLR-2288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12972116#action_12972116 ] Ryan McKinley commented on SOLR-2288: - For compiler warnings... without chaning the API, can we just use: NamedList? rather then bind it explicitly to Object? clean up compiler warnings -- Key: SOLR-2288 URL: https://issues.apache.org/jira/browse/SOLR-2288 Project: Solr Issue Type: Improvement Reporter: Hoss Man Attachments: SOLR-2288_namedlist.patch, warning.cleanup.patch there's a ton of compiler warning in the solr tree, and it's high time we cleaned them up, or annotate them to be suppressed so we can start making a bigger stink when/if code is added to the tree thta produces warnings (we'll never do a good job of noticing new warnings when we have ~175 existing ones) Using this issue to track related commits The goal of this issue should not be to change any functionality or APIs, just deal with each warning in the most appropriate way; * fix generic declarations * add SuppressWarning anotation if it's safe in context -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2723) Speed up Lucene's low level bulk postings read API
[ https://issues.apache.org/jira/browse/LUCENE-2723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12972119#action_12972119 ] Yonik Seeley commented on LUCENE-2723: -- I tested the optimized index with mike's latest patches (since that's per segment on both branch and trunk). Things are much more in line now... with the branch being anywhere from 2.3% to 5.4% slower, depending on the exact field tested. Speed up Lucene's low level bulk postings read API -- Key: LUCENE-2723 URL: https://issues.apache.org/jira/browse/LUCENE-2723 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 4.0 Attachments: LUCENE-2723-termscorer.patch, LUCENE-2723-termscorer.patch, LUCENE-2723-termscorer.patch, LUCENE-2723.patch, LUCENE-2723.patch, LUCENE-2723.patch, LUCENE-2723.patch, LUCENE-2723.patch, LUCENE-2723_termscorer.patch Spinoff from LUCENE-1410. The flex DocsEnum has a simple bulk-read API that reads the next chunk of docs/freqs. But it's a poor fit for intblock codecs like FOR/PFOR (from LUCENE-1410). This is not unlike sucking coffee through those tiny plastic coffee stirrers they hand out airplanes that, surprisingly, also happen to function as a straw. As a result we see no perf gain from using FOR/PFOR. I had hacked up a fix for this, described at in my blog post at http://chbits.blogspot.com/2010/08/lucene-performance-with-pfordelta-codec.html I'm opening this issue to get that work to a committable point. So... I've worked out a new bulk-read API to address performance bottleneck. It has some big changes over the current bulk-read API: * You can now also bulk-read positions (but not payloads), but, I have yet to cutover positional queries. * The buffer contains doc deltas, not absolute values, for docIDs and positions (freqs are absolute). * Deleted docs are not filtered out. * The doc freq buffers need not be aligned. For fixed intblock codecs (FOR/PFOR) they will be, but for varint codecs (Simple9/16, Group varint, etc.) they won't be. It's still a work in progress... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-2288) clean up compiler warnings
[ https://issues.apache.org/jira/browse/SOLR-2288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12972121#action_12972121 ] Robert Muir commented on SOLR-2288: --- Separately, i just want to say the following about NamedList: All uses of this API should really be reviewed. I'm quite aware that it warns you about the fact that its slow for certain operations, but in my opinion these slow operations such as get(String, int) should be deprecated and removed. Any users that are using NamedList in this way, especially in loops, are very likely using the wrong datastructure. clean up compiler warnings -- Key: SOLR-2288 URL: https://issues.apache.org/jira/browse/SOLR-2288 Project: Solr Issue Type: Improvement Reporter: Hoss Man Attachments: SOLR-2288_namedlist.patch, warning.cleanup.patch there's a ton of compiler warning in the solr tree, and it's high time we cleaned them up, or annotate them to be suppressed so we can start making a bigger stink when/if code is added to the tree thta produces warnings (we'll never do a good job of noticing new warnings when we have ~175 existing ones) Using this issue to track related commits The goal of this issue should not be to change any functionality or APIs, just deal with each warning in the most appropriate way; * fix generic declarations * add SuppressWarning anotation if it's safe in context -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2694) MTQ rewrite + weight/scorer init should be single pass
[ https://issues.apache.org/jira/browse/LUCENE-2694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer updated LUCENE-2694: Attachment: LUCENE-2694.patch Attaching current state - all test pass for me and luceneutils brings consistent results with trunk. {code} Query QPS trunkQPS termstate Pct diff unit~2.0 14.70 14.39 -2.1% united~2.06.916.83 -1.1% united~1.07.427.38 -0.6% unit state 12.31 12.37 0.5% unit~1.0 15.41 15.49 0.5% uni*7.187.22 0.6% un*d7.978.04 0.9% unit* 12.89 13.09 1.6% +unit +state 28.16 28.64 1.7% +nebraska +state 81.26 82.67 1.7% spanNear([unit, state], 10, true) 11.60 11.83 2.0% state 40.50 41.47 2.4% spanFirst(unit, 5) 47.65 48.84 2.5% unit state 17.72 18.19 2.7% u*d4.274.48 5.0% {code} those are the results I have for now Fuzzy only expands to 50 terms so that might no be very meaningful. I re-added the TermCache for this patch though... Will attach more info tomorrow. MTQ rewrite + weight/scorer init should be single pass -- Key: LUCENE-2694 URL: https://issues.apache.org/jira/browse/LUCENE-2694 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 4.0 Attachments: LUCENE-2694.patch, LUCENE-2694.patch, LUCENE-2694.patch Spinoff of LUCENE-2690 (see the hacked patch on that issue)... Once we fix MTQ rewrite to be per-segment, we should take it further and make weight/scorer init also run in the same single pass as rewrite. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2694) MTQ rewrite + weight/scorer init should be single pass
[ https://issues.apache.org/jira/browse/LUCENE-2694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12972127#action_12972127 ] Robert Muir commented on LUCENE-2694: - We shouldn't lose the clone() optimization in StandardPostingsReader... the class is final so it should use 'copy' instead of calling super.clone() This is important for -client. MTQ rewrite + weight/scorer init should be single pass -- Key: LUCENE-2694 URL: https://issues.apache.org/jira/browse/LUCENE-2694 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 4.0 Attachments: LUCENE-2694.patch, LUCENE-2694.patch, LUCENE-2694.patch Spinoff of LUCENE-2690 (see the hacked patch on that issue)... Once we fix MTQ rewrite to be per-segment, we should take it further and make weight/scorer init also run in the same single pass as rewrite. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Example documents and geospatial
The example docs we distribute have a bunch of stores that have the exact same location. That lead to some head scratching about why changing the distance in the example queries seemed to make no difference in the number of returned results then all of a sudden it reduced the number of hits drastically. Any objections to a patch that adds an arbitrary distance (say 1/4 mile or so) to all of the stores in the example docs that have the same location? If not, I'll put up a JIRA and attach a patch. Erick
[jira] Commented: (SOLR-2259) Improve analyzer/version handling in Solr
[ https://issues.apache.org/jira/browse/SOLR-2259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12972150#action_12972150 ] Robert Muir commented on SOLR-2259: --- I committed part 2 in revision 1050064. Improve analyzer/version handling in Solr - Key: SOLR-2259 URL: https://issues.apache.org/jira/browse/SOLR-2259 Project: Solr Issue Type: Task Reporter: Robert Muir Fix For: 3.1, 4.0 Attachments: SOLR-2259.patch, SOLR-2259.patch, SOLR-2259part2.patch We added Version for backwards compatibility support in Lucene. We use this to fire deprecated code to emulate old version to ensure index backwards compat. Related: we deprecate old analysis components and eventually remove them. To hook into Solr, at first it defaulted to Version 2.4 emulation everywhere, with the example having the latest. if you don't specify a version in your solrconfig, it defaults to 2.4 though. However, as of LUCENE-2781 2.4 is removed: but users with old configs that don't specify a version should not be silently upgraded to the Version 3.0 emulation... this is bad. Additionally, when users are using deprecated emulation or using deprecated factories they might not know it, and it might come as a surprise if they upgrade, especially if they arent looking at java apis or java code. I propose: # in trunk: we make the solrconfig luceneMatchVersion mandatory. This is simple: Uwe already has a method that will error out if its not present, we just use that. # in 3.x: we warn if you don't specify luceneMatchVersion in solrconfig: telling you that its going to be required in 4.0 and that you are defaulting to 2.4 emulation. For example: Warning: luceneMatchVersion is not specified in solrconfig.xml. Defaulting to 2.4 emulation. You should at some point declare and reindex to at least 3.0, because 2.4 emulation is deprecated in 3.x and will be removed in 4.0. This parameter will be mandatory in 4.0. # in 3.x,trunk: we warn if you are using a deprecated matchVersion constant somewhere in general, even for a specific tokenizer, telling you that you need to at some point reindex with a current version before you can move to the next release. For example: Warning: you are using 2.4 emulation, at some point you need to bump and reindex to at least 3.0, because 2.4 emulation is deprecated in 3.x and will be removed in 4.0 # in 3.x,trunk: we warn if you are using a deprecated TokenStreamFactory so that you know its going to be removed. For example: Warning: the ISOLatin1FilterFactory is deprecated and will be removed in the next release. You should migrate to ASCIIFoldingFilterFactory. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2815) MultiFields not thread safe
[ https://issues.apache.org/jira/browse/LUCENE-2815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-2815: --- Fix Version/s: 4.0 MultiFields not thread safe --- Key: LUCENE-2815 URL: https://issues.apache.org/jira/browse/LUCENE-2815 Project: Lucene - Java Issue Type: Bug Affects Versions: 4.0 Reporter: Yonik Seeley Fix For: 4.0 MultiFields looks like it has thread safety issues -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2814) stop writing shared doc stores across segments
[ https://issues.apache.org/jira/browse/LUCENE-2814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12972173#action_12972173 ] Michael McCandless commented on LUCENE-2814: OK I dug here... the reason why TestNRTThreads fails is because you moved the numDocsInRAM++ out of DW.getThreadState into WaitQueue.writeDocument. When we buffer del terms in DW.deleteTerm/Terms/Query/Queries, we grab the current numDocsInRAM as the docID upto, to record when it comes time to apply the delete which docID we must stop at. But with your change, this value is now an undercount, since numDocsInRAM is now acting like numDocsInStore. One way to fix this would be to change the delete methods to use nextDocID instead of numDocsInRAM? But I think I'd prefer to put back numDocsInRAM++ in getThreadState... stop writing shared doc stores across segments -- Key: LUCENE-2814 URL: https://issues.apache.org/jira/browse/LUCENE-2814 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 3.1, 4.0 Reporter: Michael McCandless Assignee: Michael McCandless Attachments: LUCENE-2814.patch, LUCENE-2814.patch Shared doc stores enables the files for stored fields and term vectors to be shared across multiple segments. We've had this optimization since 2.1 I think. It works best against a new index, where you open an IW, add lots of docs, and then close it. In that case all of the written segments will reference slices a single shared doc store segment. This was a good optimization because it means we never need to merge these files. But, when you open another IW on that index, it writes a new set of doc stores, and then whenever merges take place across doc stores, they must now be merged. However, since we switched to shared doc stores, there have been two optimizations for merging the stores. First, we now bulk-copy the bytes in these files if the field name/number assignment is congruent. Second, we now force congruent field name/number mapping in IndexWriter. This means this optimization is much less potent than it used to be. Furthermore, the optimization adds *a lot* of hair to IndexWriter/DocumentsWriter; this has been the source of sneaky bugs over time, and causes odd behavior like a merge possibly forcing a flush when it starts. Finally, with DWPT (LUCENE-2324), which gets us truly concurrent flushing, we can no longer share doc stores. So, I think we should turn off the write-side of shared doc stores to pave the path for DWPT to land on trunk and simplify IW/DW. We still must support reading them (until 5.0), but the read side is far less hairy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2611) IntelliJ IDEA setup
[ https://issues.apache.org/jira/browse/LUCENE-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12972176#action_12972176 ] David Smiley commented on LUCENE-2611: -- It turns out that IntelliJ was rewriting my $MODULE_DIR$/../../ paths to paths relative to a path variable I defined on my system, and that is intended behavior according to JetBrains. I removed the path variable... I can live without it after all, and that problem doesn't exist anymore. IntelliJ IDEA setup --- Key: LUCENE-2611 URL: https://issues.apache.org/jira/browse/LUCENE-2611 Project: Lucene - Java Issue Type: New Feature Components: Build Affects Versions: 3.1, 4.0 Reporter: Steven Rowe Priority: Minor Fix For: 3.1, 4.0 Attachments: LUCENE-2611-branch-3x.patch, LUCENE-2611-branch-3x.patch, LUCENE-2611-branch-3x.patch, LUCENE-2611-branch-3x.patch, LUCENE-2611.patch, LUCENE-2611.patch, LUCENE-2611.patch, LUCENE-2611.patch, LUCENE-2611.patch, LUCENE-2611.patch, LUCENE-2611.patch, LUCENE-2611_mkdir.patch, LUCENE-2611_test.patch, LUCENE-2611_test.patch, LUCENE-2611_test.patch, LUCENE-2611_test.patch, LUCENE-2611_test_2.patch Setting up Lucene/Solr in IntelliJ IDEA can be time-consuming. The attached patch adds a new top level directory {{dev-tools/}} with sub-dir {{idea/}} containing basic setup files for trunk, as well as a top-level ant target named idea that copies these files into the proper locations. This arrangement avoids the messiness attendant to in-place project configuration files directly checked into source control. The IDEA configuration includes modules for Lucene and Solr, each Lucene and Solr contrib, and each analysis module. A JUnit test run per module is included. Once {{ant idea}} has been run, the only configuration that must be performed manually is configuring the project-level JDK. If this patch is committed, Subversion svn:ignore properties should be added/modified to ignore the destination module files (*.iml) in each module's directory. Iam Jambour has written up on the Lucene wiki a detailed set of instructions for applying the 3.X branch patch: http://wiki.apache.org/lucene-java/HowtoConfigureIntelliJ -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Example documents and geospatial
I don't think they are all the same, at least not in trunk. I believe there are a few near San Fran, some near Buffalo, MN (my hometown ;-) ), and some in Oklahoma. You can see this when you hit the /browse url. On Dec 16, 2010, at 11:32 AM, Erick Erickson wrote: The example docs we distribute have a bunch of stores that have the exact same location. That lead to some head scratching about why changing the distance in the example queries seemed to make no difference in the number of returned results then all of a sudden it reduced the number of hits drastically. Any objections to a patch that adds an arbitrary distance (say 1/4 mile or so) to all of the stores in the example docs that have the same location? If not, I'll put up a JIRA and attach a patch. Erick - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Example documents and geospatial
On Thu, Dec 16, 2010 at 11:32 AM, Erick Erickson erickerick...@gmail.com wrote: The example docs we distribute have a bunch of stores that have the exact same location. That lead to some head scratching about why changing the distance in the example queries seemed to make no difference in the number of returned results then all of a sudden it reduced the number of hits drastically. Any objections to a patch that adds an arbitrary distance (say 1/4 mile or so) to all of the stores in the example docs that have the same location? If not, I'll put up a JIRA and attach a patch. Erick +1 Try not to put 'em in a lake or something though ;-) -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1410) PFOR implementation
[ https://issues.apache.org/jira/browse/LUCENE-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hao yan updated LUCENE-1410: Attachment: LUCENE-1410.patch This patch is to add codec support for PForDelta compression algorithms. Changes by Hao Yan (hyan2...@gmail.com) In summary, I added five files to support and test the codec. In Src, 1. org.apache.lucene.index.codecs.pfordelta.PForDelta.java 2. org.apache.lucene.index.codecs.pfordelta.Simple16.java 3. org.apache.lucene.index.codecs.PForDeltaFixedBlockCodec.java 4. org.apache.lucene.index.codecs.intblock.FixedIntBlockIndexOutputWithGetElementNum.java In Test, 5. org.apache.lucene.index.codecs.intblock.TestPForDeltaFixedIntBLockCodec.java 1) In particular, the firs class PForDelta is the core implementation of PForDelta algorithm, which compresses exceptions using Simple16 that is implemented in the second class Simple16. 2) The third classs PForDeltaFixedBlockCodec is similar to org.apache.lucene.index.codesc.ockintblock.MockFixedIntBlockCodec in Test, except that it uses PForDelta to encode the data in the buffer. 3) The fourth class is almost the same as org.apache.lucene.index.codecs.intblock.FixedIntBlockINdexOuput, except that it provides an additional public function to retrieve the value of the upto field, which is private filed in FixedIntBlockINdexOuput. The reason I added this public function is that the number of elements in the block that have meaningful values is not always equal to the blockSize or the buffer size since the last block/buffer of a stream of data usually only contain less number of data. In the case, I will fill all elements after the meaningful elements with 0s. Thus, we alwasy compress one entire block. 4) The last class is the unit test to test PForDeltaFixedIntBlockCodec which is very similar to org.apache.lucene.index.codecs.mintblock.TestIntBlockCodec. I also changed the LuceneTestCase class to add the new PForDeltaFixeIntBlockCOde. The unit tests and all lucence tests have passed. PFOR implementation --- Key: LUCENE-1410 URL: https://issues.apache.org/jira/browse/LUCENE-1410 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Paul Elschot Priority: Minor Fix For: Bulk Postings branch Attachments: autogen.tgz, for-summary.txt, LUCENE-1410-codecs.tar.bz2, LUCENE-1410.patch, LUCENE-1410.patch, LUCENE-1410.patch, LUCENE-1410b.patch, LUCENE-1410c.patch, LUCENE-1410d.patch, LUCENE-1410e.patch, TermQueryTests.tgz, TestPFor2.java, TestPFor2.java, TestPFor2.java Original Estimate: 21840h Remaining Estimate: 21840h Implementation of Patched Frame of Reference. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2814) stop writing shared doc stores across segments
[ https://issues.apache.org/jira/browse/LUCENE-2814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12972227#action_12972227 ] Michael Busch commented on LUCENE-2814: --- The shared doc stores are actually already completely removed in the realtime branch (part of LUCENE-2324). Does someone want to help with the merge, then we can land the realtime branch (which is pretty much only DWPT and removing doc stores) in trunk sometime soon? stop writing shared doc stores across segments -- Key: LUCENE-2814 URL: https://issues.apache.org/jira/browse/LUCENE-2814 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 3.1, 4.0 Reporter: Michael McCandless Assignee: Michael McCandless Attachments: LUCENE-2814.patch, LUCENE-2814.patch Shared doc stores enables the files for stored fields and term vectors to be shared across multiple segments. We've had this optimization since 2.1 I think. It works best against a new index, where you open an IW, add lots of docs, and then close it. In that case all of the written segments will reference slices a single shared doc store segment. This was a good optimization because it means we never need to merge these files. But, when you open another IW on that index, it writes a new set of doc stores, and then whenever merges take place across doc stores, they must now be merged. However, since we switched to shared doc stores, there have been two optimizations for merging the stores. First, we now bulk-copy the bytes in these files if the field name/number assignment is congruent. Second, we now force congruent field name/number mapping in IndexWriter. This means this optimization is much less potent than it used to be. Furthermore, the optimization adds *a lot* of hair to IndexWriter/DocumentsWriter; this has been the source of sneaky bugs over time, and causes odd behavior like a merge possibly forcing a flush when it starts. Finally, with DWPT (LUCENE-2324), which gets us truly concurrent flushing, we can no longer share doc stores. So, I think we should turn off the write-side of shared doc stores to pave the path for DWPT to land on trunk and simplify IW/DW. We still must support reading them (until 5.0), but the read side is far less hairy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2815) MultiFields not thread safe
[ https://issues.apache.org/jira/browse/LUCENE-2815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12972258#action_12972258 ] Yonik Seeley commented on LUCENE-2815: -- bq. It looks like MultiReaderBits also has issues with safe object publication. Actually, it looks like this one is OK with most of our current code. SegmentReader.getDeletedDocs() returns an object stored in a volatile, so that counts as a safe publish. Other implementations seem to either throw an exception or directly call a segment reader. One exception is instantiated index (I think). We can't call getDeletedDocs() just once up-front, because an IndexReader may still be used to delete documents. MultiFields not thread safe --- Key: LUCENE-2815 URL: https://issues.apache.org/jira/browse/LUCENE-2815 Project: Lucene - Java Issue Type: Bug Affects Versions: 4.0 Reporter: Yonik Seeley Fix For: 4.0 MultiFields looks like it has thread safety issues -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2814) stop writing shared doc stores across segments
[ https://issues.apache.org/jira/browse/LUCENE-2814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12972275#action_12972275 ] Michael Busch commented on LUCENE-2814: --- Well I need to merge with the recent changes in trunk (especially LUCENE-2680). The merge is pretty hard, but I'm planning to spend most of my weekend on it. If I can get most tests to pass again (most were passing before the merge), then I think the only outstanding thing is LUCENE-2573 before we could land it in trunk. stop writing shared doc stores across segments -- Key: LUCENE-2814 URL: https://issues.apache.org/jira/browse/LUCENE-2814 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 3.1, 4.0 Reporter: Michael McCandless Assignee: Michael McCandless Attachments: LUCENE-2814.patch, LUCENE-2814.patch Shared doc stores enables the files for stored fields and term vectors to be shared across multiple segments. We've had this optimization since 2.1 I think. It works best against a new index, where you open an IW, add lots of docs, and then close it. In that case all of the written segments will reference slices a single shared doc store segment. This was a good optimization because it means we never need to merge these files. But, when you open another IW on that index, it writes a new set of doc stores, and then whenever merges take place across doc stores, they must now be merged. However, since we switched to shared doc stores, there have been two optimizations for merging the stores. First, we now bulk-copy the bytes in these files if the field name/number assignment is congruent. Second, we now force congruent field name/number mapping in IndexWriter. This means this optimization is much less potent than it used to be. Furthermore, the optimization adds *a lot* of hair to IndexWriter/DocumentsWriter; this has been the source of sneaky bugs over time, and causes odd behavior like a merge possibly forcing a flush when it starts. Finally, with DWPT (LUCENE-2324), which gets us truly concurrent flushing, we can no longer share doc stores. So, I think we should turn off the write-side of shared doc stores to pave the path for DWPT to land on trunk and simplify IW/DW. We still must support reading them (until 5.0), but the read side is far less hairy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2815) MultiFields not thread safe
[ https://issues.apache.org/jira/browse/LUCENE-2815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12972281#action_12972281 ] Yonik Seeley commented on LUCENE-2815: -- I was going to fix InstantiatedIndex, but while I was in there, I saw a lot of non-threadsafe code. I think that really deserves it's own issue. What range of docs is InstantiatedIndex faster for, and is it something we want to continue to maintain? MultiFields not thread safe --- Key: LUCENE-2815 URL: https://issues.apache.org/jira/browse/LUCENE-2815 Project: Lucene - Java Issue Type: Bug Affects Versions: 4.0 Reporter: Yonik Seeley Fix For: 4.0 MultiFields looks like it has thread safety issues -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2814) stop writing shared doc stores across segments
[ https://issues.apache.org/jira/browse/LUCENE-2814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12972286#action_12972286 ] Michael McCandless commented on LUCENE-2814: I think taking things one step at a time would be good here? Ie remove doc stores from trunk, let that bake on trunk for a while, then merge to RT? So that what then remains on RT is DWPT / tiered flushing? Else RT is a monster change? stop writing shared doc stores across segments -- Key: LUCENE-2814 URL: https://issues.apache.org/jira/browse/LUCENE-2814 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 3.1, 4.0 Reporter: Michael McCandless Assignee: Michael McCandless Attachments: LUCENE-2814.patch, LUCENE-2814.patch Shared doc stores enables the files for stored fields and term vectors to be shared across multiple segments. We've had this optimization since 2.1 I think. It works best against a new index, where you open an IW, add lots of docs, and then close it. In that case all of the written segments will reference slices a single shared doc store segment. This was a good optimization because it means we never need to merge these files. But, when you open another IW on that index, it writes a new set of doc stores, and then whenever merges take place across doc stores, they must now be merged. However, since we switched to shared doc stores, there have been two optimizations for merging the stores. First, we now bulk-copy the bytes in these files if the field name/number assignment is congruent. Second, we now force congruent field name/number mapping in IndexWriter. This means this optimization is much less potent than it used to be. Furthermore, the optimization adds *a lot* of hair to IndexWriter/DocumentsWriter; this has been the source of sneaky bugs over time, and causes odd behavior like a merge possibly forcing a flush when it starts. Finally, with DWPT (LUCENE-2324), which gets us truly concurrent flushing, we can no longer share doc stores. So, I think we should turn off the write-side of shared doc stores to pave the path for DWPT to land on trunk and simplify IW/DW. We still must support reading them (until 5.0), but the read side is far less hairy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Example documents and geospatial
Grant: Yep, there were fewer than I remembered, but still half a dozen or so ... but I remember way more than that (somehow 16 comes to mind)... so obviously some gnome has been in there already... Not all of the ones I remember were the same at all, but enough were that it was puzzling. Erik: Lake? Why should I care about a lake? Actually, I never even thought about it, glad you pointed it out.. OK, I'll put up a patch today or tomorrow. Anybody want to apply the patch for 2275 (whitespace in mm parameter causes parse exception)? Erick On Thu, Dec 16, 2010 at 4:37 PM, Yonik Seeley yo...@lucidimagination.comwrote: On Thu, Dec 16, 2010 at 11:32 AM, Erick Erickson erickerick...@gmail.com wrote: The example docs we distribute have a bunch of stores that have the exact same location. That lead to some head scratching about why changing the distance in the example queries seemed to make no difference in the number of returned results then all of a sudden it reduced the number of hits drastically. Any objections to a patch that adds an arbitrary distance (say 1/4 mile or so) to all of the stores in the example docs that have the same location? If not, I'll put up a JIRA and attach a patch. Erick +1 Try not to put 'em in a lake or something though ;-) -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2814) stop writing shared doc stores across segments
[ https://issues.apache.org/jira/browse/LUCENE-2814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12972288#action_12972288 ] Michael Busch commented on LUCENE-2814: --- bq. I think taking things one step at a time would be good here? Probably still a smaller change than flex indexing ;) But yeah in general I agree that we should do things more incrementally. I think that's a mistake I've made with the RT branch so far. In this particular case it's just a bit sad to redo all this work now, because I think I got the removal of doc stores right in RT and all related tests to pass. stop writing shared doc stores across segments -- Key: LUCENE-2814 URL: https://issues.apache.org/jira/browse/LUCENE-2814 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 3.1, 4.0 Reporter: Michael McCandless Assignee: Michael McCandless Attachments: LUCENE-2814.patch, LUCENE-2814.patch Shared doc stores enables the files for stored fields and term vectors to be shared across multiple segments. We've had this optimization since 2.1 I think. It works best against a new index, where you open an IW, add lots of docs, and then close it. In that case all of the written segments will reference slices a single shared doc store segment. This was a good optimization because it means we never need to merge these files. But, when you open another IW on that index, it writes a new set of doc stores, and then whenever merges take place across doc stores, they must now be merged. However, since we switched to shared doc stores, there have been two optimizations for merging the stores. First, we now bulk-copy the bytes in these files if the field name/number assignment is congruent. Second, we now force congruent field name/number mapping in IndexWriter. This means this optimization is much less potent than it used to be. Furthermore, the optimization adds *a lot* of hair to IndexWriter/DocumentsWriter; this has been the source of sneaky bugs over time, and causes odd behavior like a merge possibly forcing a flush when it starts. Finally, with DWPT (LUCENE-2324), which gets us truly concurrent flushing, we can no longer share doc stores. So, I think we should turn off the write-side of shared doc stores to pave the path for DWPT to land on trunk and simplify IW/DW. We still must support reading them (until 5.0), but the read side is far less hairy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2814) stop writing shared doc stores across segments
[ https://issues.apache.org/jira/browse/LUCENE-2814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12972298#action_12972298 ] Earwin Burrfoot commented on LUCENE-2814: - So, what's the plan? stop writing shared doc stores across segments -- Key: LUCENE-2814 URL: https://issues.apache.org/jira/browse/LUCENE-2814 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 3.1, 4.0 Reporter: Michael McCandless Assignee: Michael McCandless Attachments: LUCENE-2814.patch, LUCENE-2814.patch Shared doc stores enables the files for stored fields and term vectors to be shared across multiple segments. We've had this optimization since 2.1 I think. It works best against a new index, where you open an IW, add lots of docs, and then close it. In that case all of the written segments will reference slices a single shared doc store segment. This was a good optimization because it means we never need to merge these files. But, when you open another IW on that index, it writes a new set of doc stores, and then whenever merges take place across doc stores, they must now be merged. However, since we switched to shared doc stores, there have been two optimizations for merging the stores. First, we now bulk-copy the bytes in these files if the field name/number assignment is congruent. Second, we now force congruent field name/number mapping in IndexWriter. This means this optimization is much less potent than it used to be. Furthermore, the optimization adds *a lot* of hair to IndexWriter/DocumentsWriter; this has been the source of sneaky bugs over time, and causes odd behavior like a merge possibly forcing a flush when it starts. Finally, with DWPT (LUCENE-2324), which gets us truly concurrent flushing, we can no longer share doc stores. So, I think we should turn off the write-side of shared doc stores to pave the path for DWPT to land on trunk and simplify IW/DW. We still must support reading them (until 5.0), but the read side is far less hairy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [VOTE] [Take 3] Release PyLucene 2.9.4-1 and 3.0.3-1
On Sun, 12 Dec 2010, Andi Vajda wrote: A patch that improves the finding of jni.h on Mac OS X was integrated. It made it worth blocking this release and preparing new release artifacts. No one voted on the [Take 2] artifacts and I hope this is not inconveniencing anyone. I also hope that this is it for PyLucene 2.9.4/3.0.3 :-) So please vote to release the artifacts available from http://people.apache.org/~vajda/staging_area/ as PyLucene 2.9.4 and PyLucene 3.0.3. Here is my +1 This now has passed. Thank you to all who voted ! The releases should be announced shortly. Andi..
[jira] Commented: (LUCENE-2814) stop writing shared doc stores across segments
[ https://issues.apache.org/jira/browse/LUCENE-2814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12972302#action_12972302 ] Michael Busch commented on LUCENE-2814: --- bq. So, what's the plan? I can't really work on this much before Saturday. But during the weekend I can work on the RT merge and maybe try to pull out the docstore removal changes and create a separate patch. Have to see how hard that is. If it's not too difficult I'll post a separate patch, otherwise I'll commit the merge to RT and maybe convince you guys to help a bit with getting the RT branch ready for landing in trunk? :) stop writing shared doc stores across segments -- Key: LUCENE-2814 URL: https://issues.apache.org/jira/browse/LUCENE-2814 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 3.1, 4.0 Reporter: Michael McCandless Assignee: Michael McCandless Attachments: LUCENE-2814.patch, LUCENE-2814.patch Shared doc stores enables the files for stored fields and term vectors to be shared across multiple segments. We've had this optimization since 2.1 I think. It works best against a new index, where you open an IW, add lots of docs, and then close it. In that case all of the written segments will reference slices a single shared doc store segment. This was a good optimization because it means we never need to merge these files. But, when you open another IW on that index, it writes a new set of doc stores, and then whenever merges take place across doc stores, they must now be merged. However, since we switched to shared doc stores, there have been two optimizations for merging the stores. First, we now bulk-copy the bytes in these files if the field name/number assignment is congruent. Second, we now force congruent field name/number mapping in IndexWriter. This means this optimization is much less potent than it used to be. Furthermore, the optimization adds *a lot* of hair to IndexWriter/DocumentsWriter; this has been the source of sneaky bugs over time, and causes odd behavior like a merge possibly forcing a flush when it starts. Finally, with DWPT (LUCENE-2324), which gets us truly concurrent flushing, we can no longer share doc stores. So, I think we should turn off the write-side of shared doc stores to pave the path for DWPT to land on trunk and simplify IW/DW. We still must support reading them (until 5.0), but the read side is far less hairy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2814) stop writing shared doc stores across segments
[ https://issues.apache.org/jira/browse/LUCENE-2814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12972316#action_12972316 ] Earwin Burrfoot commented on LUCENE-2814: - Instead of you pulling out docstore removal, I can finish that patch. But then merging's gonna be even greater bitch. Probably. But maybe not. Do you do IRC? It can be faster to discuss in realtime, and you could also tell what help you need with the branch. stop writing shared doc stores across segments -- Key: LUCENE-2814 URL: https://issues.apache.org/jira/browse/LUCENE-2814 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 3.1, 4.0 Reporter: Michael McCandless Assignee: Michael McCandless Attachments: LUCENE-2814.patch, LUCENE-2814.patch Shared doc stores enables the files for stored fields and term vectors to be shared across multiple segments. We've had this optimization since 2.1 I think. It works best against a new index, where you open an IW, add lots of docs, and then close it. In that case all of the written segments will reference slices a single shared doc store segment. This was a good optimization because it means we never need to merge these files. But, when you open another IW on that index, it writes a new set of doc stores, and then whenever merges take place across doc stores, they must now be merged. However, since we switched to shared doc stores, there have been two optimizations for merging the stores. First, we now bulk-copy the bytes in these files if the field name/number assignment is congruent. Second, we now force congruent field name/number mapping in IndexWriter. This means this optimization is much less potent than it used to be. Furthermore, the optimization adds *a lot* of hair to IndexWriter/DocumentsWriter; this has been the source of sneaky bugs over time, and causes odd behavior like a merge possibly forcing a flush when it starts. Finally, with DWPT (LUCENE-2324), which gets us truly concurrent flushing, we can no longer share doc stores. So, I think we should turn off the write-side of shared doc stores to pave the path for DWPT to land on trunk and simplify IW/DW. We still must support reading them (until 5.0), but the read side is far less hairy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2815) MultiFields not thread safe
[ https://issues.apache.org/jira/browse/LUCENE-2815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yonik Seeley updated LUCENE-2815: - Attachment: LUCENE-2815.patch Here's a patch that uses a ConcurrentHashMap for the Terms cache, and makes IndexReader.fields volatile. That IndexReader.fields variable is just the type of stuff that could just be stored in a generic cache on the IndexReader, if/when we get something like that. MultiFields not thread safe --- Key: LUCENE-2815 URL: https://issues.apache.org/jira/browse/LUCENE-2815 Project: Lucene - Java Issue Type: Bug Affects Versions: 4.0 Reporter: Yonik Seeley Fix For: 4.0 Attachments: LUCENE-2815.patch MultiFields looks like it has thread safety issues -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2814) stop writing shared doc stores across segments
[ https://issues.apache.org/jira/browse/LUCENE-2814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Earwin Burrfoot updated LUCENE-2814: Attachment: LUCENE-2814.patch Patch updated to trunk, no nocommits, no *.closeDocStore(), tests pass. SegmentWriteState vs DocumentsWriter bother me. We track flushed files in both, we inconsistently get current segment from both of them. stop writing shared doc stores across segments -- Key: LUCENE-2814 URL: https://issues.apache.org/jira/browse/LUCENE-2814 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 3.1, 4.0 Reporter: Michael McCandless Assignee: Michael McCandless Attachments: LUCENE-2814.patch, LUCENE-2814.patch, LUCENE-2814.patch Shared doc stores enables the files for stored fields and term vectors to be shared across multiple segments. We've had this optimization since 2.1 I think. It works best against a new index, where you open an IW, add lots of docs, and then close it. In that case all of the written segments will reference slices a single shared doc store segment. This was a good optimization because it means we never need to merge these files. But, when you open another IW on that index, it writes a new set of doc stores, and then whenever merges take place across doc stores, they must now be merged. However, since we switched to shared doc stores, there have been two optimizations for merging the stores. First, we now bulk-copy the bytes in these files if the field name/number assignment is congruent. Second, we now force congruent field name/number mapping in IndexWriter. This means this optimization is much less potent than it used to be. Furthermore, the optimization adds *a lot* of hair to IndexWriter/DocumentsWriter; this has been the source of sneaky bugs over time, and causes odd behavior like a merge possibly forcing a flush when it starts. Finally, with DWPT (LUCENE-2324), which gets us truly concurrent flushing, we can no longer share doc stores. So, I think we should turn off the write-side of shared doc stores to pave the path for DWPT to land on trunk and simplify IW/DW. We still must support reading them (until 5.0), but the read side is far less hairy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Lucene-trunk - Build # 1397 - Failure
Build: https://hudson.apache.org/hudson/job/Lucene-trunk/1397/ All tests passed Build Log (for compile errors): [...truncated 18399 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org