[jira] Resolved: (LUCENE-486) Core Test should not have dependencies on the Demo code
[ https://issues.apache.org/jira/browse/LUCENE-486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Busch resolved LUCENE-486. -- Resolution: Fixed Committed revision 822139. Core Test should not have dependencies on the Demo code --- Key: LUCENE-486 URL: https://issues.apache.org/jira/browse/LUCENE-486 Project: Lucene - Java Issue Type: Test Components: Build Affects Versions: 1.4 Reporter: Grant Ingersoll Assignee: Michael Busch Priority: Trivial Fix For: 3.0 Attachments: FileDocument.java, lucene-486.patch, testdoc.patch The TestDoc.java Test file has a dependency on the Demo FileDocument code. Some of us don't keep the Demo code around after downloading, so this breaks the build. Patch will be along shortly -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing
[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12762573#action_12762573 ] Michael McCandless commented on LUCENE-1458: Whoa thanks for the sudden sprint Mark! bq. Come on old man, stop clinging to emacs Hey! I'm not so old :) But yeah I still cling to emacs. Hey, I know people who still cling to vi! {quote} I didn't really look at the code, but some stuff I noticed: java 6 in pfor Arrays.copy skiplist stuff in codecs still have package of index - not sure what is going on there - changed them in IndexWriter: + // Mark: read twice? segmentInfos.read(directory); + segmentInfos.read(directory, codecs); {quote} Excellent catches! All of these are not right. bq. (since you don't include contrib in the tar) Gak, sorry. I have a bunch of mods there, cutting over to flex API. bq. You left getEnum(IndexReader reader) in the MultiTerm queries, but no in PrefixQuery - just checkin'. Woops, for back compat I think we need to leave it in (it's a protected method), deprecated. I'll put it back if you haven't. bq. I guess TestBackwardsCompatibility.java has been removed from trunk or something? kept it here for now. Eek, it shouldn't be -- indeed it is. When did that happen? We should fix this (separately from this issue!). Do you have more fixes coming? If so, I'll let you sprint some more; else, I'll merge in, add contrib back-compat branch, and post new patch! Thanks :) Further steps towards flexible indexing --- Key: LUCENE-1458 URL: https://issues.apache.org/jira/browse/LUCENE-1458 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: 2.9 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Attachments: LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2 I attached a very rough checkpoint of my current patch, to get early feedback. All tests pass, though back compat tests don't pass due to changes to package-private APIs plus certain bugs in tests that happened to work (eg call TermPostions.nextPosition() too many times, which the new API asserts against). [Aside: I think, when we commit changes to package-private APIs such that back-compat tests don't pass, we could go back, make a branch on the back-compat tag, commit changes to the tests to use the new package private APIs on that branch, then fix nightly build to use the tip of that branch?o] There's still plenty to do before this is committable! This is a rather large change: * Switches to a new more efficient terms dict format. This still uses tii/tis files, but the tii only stores term long offset (not a TermInfo). At seek points, tis encodes term freq/prox offsets absolutely instead of with deltas delta. Also, tis/tii are structured by field, so we don't have to record field number in every term. . On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB). . RAM usage when loading terms dict index is significantly less since we only load an array of offsets and an array of String (no more TermInfo array). It should be faster to init too. . This part is basically done. * Introduces modular reader codec that strongly decouples terms dict from docs/positions readers. EG there is no more TermInfo used when reading the new format. . There's nice symmetry now between reading writing in the codec chain -- the current docs/prox format is captured in: {code} FormatPostingsTermsDictWriter/Reader FormatPostingsDocsWriter/Reader (.frq file) and FormatPostingsPositionsWriter/Reader (.prx file). {code} This part is basically done. * Introduces a new flex API for iterating through the fields, terms, docs and positions: {code} FieldProducer - TermsEnum - DocsEnum - PostingsEnum {code} This replaces TermEnum/Docs/Positions. SegmentReader emulates the old API on top of the new API to keep back-compat. Next steps: * Plug in new codecs (pulsing, pfor) to exercise the modularity / fix any hidden assumptions. * Expose new API out of IndexReader, deprecate old API but emulate old API on top of new one, switch all core/contrib users to the
[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing
[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12762576#action_12762576 ] Michael McCandless commented on LUCENE-1458: bq. One of the common statistics one needs to implement some more advanced scoring approaches is the average document length. Is this patch far enough along that I could take a look at it and think about how one might do this? Well, thinking through how you'd do this... likely you'd want to store the avg length (in tokens), eg as a single float per field per segment, right? The natural place to store this would be in the FieldInfos, I think?. Unfortunately, this patch doesn't yet add extensibility to FieldInfos. And you'd need a small customization to the indexing chain to compute this when indexing new docs, which is already doable today (though, package private). But then on merging segments, you'd need an extensions point, which we don't have today, to recompute the avg. Hmm: how would you handle deleted docs? Would you want to go back to the field length for every doc recompute the average? (Which'd mean you need to per doc per field length, not just the averages). Unfortunately, this patch doesn't yet address things like customizing what's stored in FieldInfo or SegmentInfo, nor customizing what happens during merging (though it takes us a big step closer to this). I think we need both of these to finish flexible indexing, but I'm thinking at this point that these should really be tackled in followon issue(s). This issue is already ridiculously massive. Further steps towards flexible indexing --- Key: LUCENE-1458 URL: https://issues.apache.org/jira/browse/LUCENE-1458 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: 2.9 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Attachments: LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2 I attached a very rough checkpoint of my current patch, to get early feedback. All tests pass, though back compat tests don't pass due to changes to package-private APIs plus certain bugs in tests that happened to work (eg call TermPostions.nextPosition() too many times, which the new API asserts against). [Aside: I think, when we commit changes to package-private APIs such that back-compat tests don't pass, we could go back, make a branch on the back-compat tag, commit changes to the tests to use the new package private APIs on that branch, then fix nightly build to use the tip of that branch?o] There's still plenty to do before this is committable! This is a rather large change: * Switches to a new more efficient terms dict format. This still uses tii/tis files, but the tii only stores term long offset (not a TermInfo). At seek points, tis encodes term freq/prox offsets absolutely instead of with deltas delta. Also, tis/tii are structured by field, so we don't have to record field number in every term. . On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB). . RAM usage when loading terms dict index is significantly less since we only load an array of offsets and an array of String (no more TermInfo array). It should be faster to init too. . This part is basically done. * Introduces modular reader codec that strongly decouples terms dict from docs/positions readers. EG there is no more TermInfo used when reading the new format. . There's nice symmetry now between reading writing in the codec chain -- the current docs/prox format is captured in: {code} FormatPostingsTermsDictWriter/Reader FormatPostingsDocsWriter/Reader (.frq file) and FormatPostingsPositionsWriter/Reader (.prx file). {code} This part is basically done. * Introduces a new flex API for iterating through the fields, terms, docs and positions: {code} FieldProducer - TermsEnum - DocsEnum - PostingsEnum {code} This replaces TermEnum/Docs/Positions. SegmentReader emulates the old API on top of the new API to keep back-compat. Next steps: * Plug in new codecs (pulsing, pfor) to exercise the modularity / fix any hidden assumptions. * Expose new API out of IndexReader,
Re: [jira] Commented: (LUCENE-1458) Further steps towards flexible indexing
On Tue, Oct 6, 2009 at 5:54 AM, Michael McCandless (JIRA) j...@apache.org wrote: bq. I guess TestBackwardsCompatibility.java has been removed from trunk or something? kept it here for now. Eek, it shouldn't be -- indeed it is. When did that happen? We should fix this (separately from this issue!). I'm working on restoring TestBackCompat on trunk... Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
RE: svn commit: r822203 - /lucene/java/trunk/src/test/org/apache/lucene/index/TestBackwardsCompatibility.java
Sorry, I think this was one test too much to remove :-) - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: mikemcc...@apache.org [mailto:mikemcc...@apache.org] Sent: Tuesday, October 06, 2009 12:33 PM To: java-comm...@lucene.apache.org Subject: svn commit: r822203 - /lucene/java/trunk/src/test/org/apache/lucene/index/TestBackwardsCompatibi lity.java Author: mikemccand Date: Tue Oct 6 10:32:43 2009 New Revision: 822203 URL: http://svn.apache.org/viewvc?rev=822203view=rev Log: restore TestBackwardsCompatibility Added: lucene/java/trunk/src/test/org/apache/lucene/index/TestBackwardsCompatibil ity.java (with props) Added: lucene/java/trunk/src/test/org/apache/lucene/index/TestBackwardsCompatibil ity.java URL: http://svn.apache.org/viewvc/lucene/java/trunk/src/test/org/apache/lucene/ index/TestBackwardsCompatibility.java?rev=822203view=auto == --- lucene/java/trunk/src/test/org/apache/lucene/index/TestBackwardsCompatibil ity.java (added) +++ lucene/java/trunk/src/test/org/apache/lucene/index/TestBackwardsCompatibil ity.java Tue Oct 6 10:32:43 2009 @@ -0,0 +1,530 @@ +package org.apache.lucene.index; + +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +import java.io.BufferedOutputStream; +import java.io.File; +import java.io.FileOutputStream; +import java.io.IOException; +import java.io.InputStream; +import java.io.OutputStream; +import java.util.Arrays; +import java.util.Enumeration; +import java.util.List; +import java.util.zip.ZipEntry; +import java.util.zip.ZipFile; + +import org.apache.lucene.analysis.WhitespaceAnalyzer; +import org.apache.lucene.document.Document; +import org.apache.lucene.document.Field; +import org.apache.lucene.search.IndexSearcher; +import org.apache.lucene.search.ScoreDoc; +import org.apache.lucene.search.TermQuery; +import org.apache.lucene.store.Directory; +import org.apache.lucene.store.FSDirectory; +import org.apache.lucene.util.LuceneTestCase; +import org.apache.lucene.util._TestUtil; + +/* + Verify we can read the pre-2.1 file format, do searches + against it, and add documents to it. +*/ + +public class TestBackwardsCompatibility extends LuceneTestCase +{ + + // Uncomment these cases run them on an older Lucene + // version, to generate an index to test backwards + // compatibility. Then, cd to build/test/index.cfs and + // run zip index.VERSION.cfs.zip *; cd to + // build/test/index.nocfs and run zip + // index.VERSION.nocfs.zip *. Then move those 2 zip + // files to your trunk checkout and add them to the + // oldNames array. + + /* + public void testCreatePreLocklessCFS() throws IOException { +createIndex(index.cfs, true); + } + + public void testCreatePreLocklessNoCFS() throws IOException { +createIndex(index.nocfs, false); + } + */ + + /* Unzips dirName + .zip -- dirName, removing dirName + first */ + public void unzip(String zipName, String destDirName) throws IOException { + +Enumeration entries; +ZipFile zipFile; +zipFile = new ZipFile(zipName + .zip); + +entries = zipFile.entries(); + +String dirName = fullDir(destDirName); + +File fileDir = new File(dirName); +rmDir(destDirName); + +fileDir.mkdir(); + +while (entries.hasMoreElements()) { + ZipEntry entry = (ZipEntry) entries.nextElement(); + + InputStream in = zipFile.getInputStream(entry); + OutputStream out = new BufferedOutputStream(new FileOutputStream(new File(fileDir, entry.getName(; + + byte[] buffer = new byte[8192]; + int len; + while((len = in.read(buffer)) = 0) { +out.write(buffer, 0, len); + } + + in.close(); + out.close(); +} + +zipFile.close(); + } + + public void testCreateCFS() throws IOException { +String dirName = testindex.cfs; +createIndex(dirName, true); +rmDir(dirName); + } + + public void testCreateNoCFS() throws IOException { +String dirName = testindex.nocfs; +
Re: [jira] Commented: (LUCENE-1458) Further steps towards flexible indexing
Merge away - still sleeping over here. Would love to look more again but don't know when, so no use waiting on me. - Mark http://www.lucidimagination.com (mobile) On Oct 6, 2009, at 5:54 AM, Michael McCandless (JIRA) j...@apache.org wrote: [ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12762573#action_12762573 ] Michael McCandless commented on LUCENE-1458: Whoa thanks for the sudden sprint Mark! bq. Come on old man, stop clinging to emacs Hey! I'm not so old :) But yeah I still cling to emacs. Hey, I know people who still cling to vi! {quote} I didn't really look at the code, but some stuff I noticed: java 6 in pfor Arrays.copy skiplist stuff in codecs still have package of index - not sure what is going on there - changed them in IndexWriter: + // Mark: read twice? segmentInfos.read(directory); + segmentInfos.read(directory, codecs); {quote} Excellent catches! All of these are not right. bq. (since you don't include contrib in the tar) Gak, sorry. I have a bunch of mods there, cutting over to flex API. bq. You left getEnum(IndexReader reader) in the MultiTerm queries, but no in PrefixQuery - just checkin'. Woops, for back compat I think we need to leave it in (it's a protected method), deprecated. I'll put it back if you haven't. bq. I guess TestBackwardsCompatibility.java has been removed from trunk or something? kept it here for now. Eek, it shouldn't be -- indeed it is. When did that happen? We should fix this (separately from this issue!). Do you have more fixes coming? If so, I'll let you sprint some more; else, I'll merge in, add contrib back-compat branch, and post new patch! Thanks :) Further steps towards flexible indexing --- Key: LUCENE-1458 URL: https://issues.apache.org/jira/browse/LUCENE-1458 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: 2.9 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Attachments: LUCENE-1458-back-compat.patch, LUCENE-1458-back- compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back- compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back- compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2 I attached a very rough checkpoint of my current patch, to get early feedback. All tests pass, though back compat tests don't pass due to changes to package-private APIs plus certain bugs in tests that happened to work (eg call TermPostions.nextPosition() too many times, which the new API asserts against). [Aside: I think, when we commit changes to package-private APIs such that back-compat tests don't pass, we could go back, make a branch on the back-compat tag, commit changes to the tests to use the new package private APIs on that branch, then fix nightly build to use the tip of that branch?o] There's still plenty to do before this is committable! This is a rather large change: * Switches to a new more efficient terms dict format. This still uses tii/tis files, but the tii only stores term long offset (not a TermInfo). At seek points, tis encodes term freq/prox offsets absolutely instead of with deltas delta. Also, tis/tii are structured by field, so we don't have to record field number in every term. . On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB). . RAM usage when loading terms dict index is significantly less since we only load an array of offsets and an array of String (no more TermInfo array). It should be faster to init too. . This part is basically done. * Introduces modular reader codec that strongly decouples terms dict from docs/positions readers. EG there is no more TermInfo used when reading the new format. . There's nice symmetry now between reading writing in the codec chain -- the current docs/prox format is captured in: {code} FormatPostingsTermsDictWriter/Reader FormatPostingsDocsWriter/Reader (.frq file) and FormatPostingsPositionsWriter/Reader (.prx file). {code} This part is basically done. * Introduces a new flex API for iterating through the fields, terms, docs and positions: {code} FieldProducer - TermsEnum - DocsEnum - PostingsEnum {code} This replaces TermEnum/Docs/Positions. SegmentReader emulates the old API on top of the new API to keep back-compat. Next steps: * Plug in new codecs (pulsing, pfor) to exercise the modularity / fix
[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing
[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12762590#action_12762590 ] Uwe Schindler commented on LUCENE-1458: --- {quote} bq. I guess TestBackwardsCompatibility.java has been removed from trunk or something? kept it here for now. Eek, it shouldn't be - indeed it is. When did that happen? We should fix this (separately from this issue!). {quote} My fault, I removed it during the remove backwards tests on Saturday. If we do not remove DateTools/DateField for 3.0 (we may need to leave it in for index compatibility), I will restore, these tests, too. It's easy with TortoiseSVN and you can also preserve the history (using svn:mergeinfo prop). I have this on my list when going forward with removing the old TokenStream API. Further steps towards flexible indexing --- Key: LUCENE-1458 URL: https://issues.apache.org/jira/browse/LUCENE-1458 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: 2.9 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Attachments: LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2 I attached a very rough checkpoint of my current patch, to get early feedback. All tests pass, though back compat tests don't pass due to changes to package-private APIs plus certain bugs in tests that happened to work (eg call TermPostions.nextPosition() too many times, which the new API asserts against). [Aside: I think, when we commit changes to package-private APIs such that back-compat tests don't pass, we could go back, make a branch on the back-compat tag, commit changes to the tests to use the new package private APIs on that branch, then fix nightly build to use the tip of that branch?o] There's still plenty to do before this is committable! This is a rather large change: * Switches to a new more efficient terms dict format. This still uses tii/tis files, but the tii only stores term long offset (not a TermInfo). At seek points, tis encodes term freq/prox offsets absolutely instead of with deltas delta. Also, tis/tii are structured by field, so we don't have to record field number in every term. . On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB). . RAM usage when loading terms dict index is significantly less since we only load an array of offsets and an array of String (no more TermInfo array). It should be faster to init too. . This part is basically done. * Introduces modular reader codec that strongly decouples terms dict from docs/positions readers. EG there is no more TermInfo used when reading the new format. . There's nice symmetry now between reading writing in the codec chain -- the current docs/prox format is captured in: {code} FormatPostingsTermsDictWriter/Reader FormatPostingsDocsWriter/Reader (.frq file) and FormatPostingsPositionsWriter/Reader (.prx file). {code} This part is basically done. * Introduces a new flex API for iterating through the fields, terms, docs and positions: {code} FieldProducer - TermsEnum - DocsEnum - PostingsEnum {code} This replaces TermEnum/Docs/Positions. SegmentReader emulates the old API on top of the new API to keep back-compat. Next steps: * Plug in new codecs (pulsing, pfor) to exercise the modularity / fix any hidden assumptions. * Expose new API out of IndexReader, deprecate old API but emulate old API on top of new one, switch all core/contrib users to the new API. * Maybe switch to AttributeSources as the base class for TermsEnum, DocsEnum, PostingsEnum -- this would give readers API flexibility (not just index-file-format flexibility). EG if someone wanted to store payload at the term-doc level instead of term-doc-position level, you could just add a new attribute. * Test performance iterate. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail:
[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing
[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12762592#action_12762592 ] Michael McCandless commented on LUCENE-1458: bq. It's easy with TortoiseSVN and you can also preserve the history (using svn:mergeinfo prop). Ahh -- can you do this for TestBackwardsCompatibility? I restored it, but, lost all history. Thanks. Further steps towards flexible indexing --- Key: LUCENE-1458 URL: https://issues.apache.org/jira/browse/LUCENE-1458 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: 2.9 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Attachments: LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2 I attached a very rough checkpoint of my current patch, to get early feedback. All tests pass, though back compat tests don't pass due to changes to package-private APIs plus certain bugs in tests that happened to work (eg call TermPostions.nextPosition() too many times, which the new API asserts against). [Aside: I think, when we commit changes to package-private APIs such that back-compat tests don't pass, we could go back, make a branch on the back-compat tag, commit changes to the tests to use the new package private APIs on that branch, then fix nightly build to use the tip of that branch?o] There's still plenty to do before this is committable! This is a rather large change: * Switches to a new more efficient terms dict format. This still uses tii/tis files, but the tii only stores term long offset (not a TermInfo). At seek points, tis encodes term freq/prox offsets absolutely instead of with deltas delta. Also, tis/tii are structured by field, so we don't have to record field number in every term. . On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB). . RAM usage when loading terms dict index is significantly less since we only load an array of offsets and an array of String (no more TermInfo array). It should be faster to init too. . This part is basically done. * Introduces modular reader codec that strongly decouples terms dict from docs/positions readers. EG there is no more TermInfo used when reading the new format. . There's nice symmetry now between reading writing in the codec chain -- the current docs/prox format is captured in: {code} FormatPostingsTermsDictWriter/Reader FormatPostingsDocsWriter/Reader (.frq file) and FormatPostingsPositionsWriter/Reader (.prx file). {code} This part is basically done. * Introduces a new flex API for iterating through the fields, terms, docs and positions: {code} FieldProducer - TermsEnum - DocsEnum - PostingsEnum {code} This replaces TermEnum/Docs/Positions. SegmentReader emulates the old API on top of the new API to keep back-compat. Next steps: * Plug in new codecs (pulsing, pfor) to exercise the modularity / fix any hidden assumptions. * Expose new API out of IndexReader, deprecate old API but emulate old API on top of new one, switch all core/contrib users to the new API. * Maybe switch to AttributeSources as the base class for TermsEnum, DocsEnum, PostingsEnum -- this would give readers API flexibility (not just index-file-format flexibility). EG if someone wanted to store payload at the term-doc level instead of term-doc-position level, you could just add a new attribute. * Test performance iterate. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing
[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12762600#action_12762600 ] Uwe Schindler commented on LUCENE-1458: --- Done. I also did it for the BW branch, but didn't create a tag yet. The next tag creation for the next bigger patch is enough (no need to do it now). What I have done: svn copy from the older revision to the same path :-) Further steps towards flexible indexing --- Key: LUCENE-1458 URL: https://issues.apache.org/jira/browse/LUCENE-1458 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: 2.9 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Attachments: LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2 I attached a very rough checkpoint of my current patch, to get early feedback. All tests pass, though back compat tests don't pass due to changes to package-private APIs plus certain bugs in tests that happened to work (eg call TermPostions.nextPosition() too many times, which the new API asserts against). [Aside: I think, when we commit changes to package-private APIs such that back-compat tests don't pass, we could go back, make a branch on the back-compat tag, commit changes to the tests to use the new package private APIs on that branch, then fix nightly build to use the tip of that branch?o] There's still plenty to do before this is committable! This is a rather large change: * Switches to a new more efficient terms dict format. This still uses tii/tis files, but the tii only stores term long offset (not a TermInfo). At seek points, tis encodes term freq/prox offsets absolutely instead of with deltas delta. Also, tis/tii are structured by field, so we don't have to record field number in every term. . On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB). . RAM usage when loading terms dict index is significantly less since we only load an array of offsets and an array of String (no more TermInfo array). It should be faster to init too. . This part is basically done. * Introduces modular reader codec that strongly decouples terms dict from docs/positions readers. EG there is no more TermInfo used when reading the new format. . There's nice symmetry now between reading writing in the codec chain -- the current docs/prox format is captured in: {code} FormatPostingsTermsDictWriter/Reader FormatPostingsDocsWriter/Reader (.frq file) and FormatPostingsPositionsWriter/Reader (.prx file). {code} This part is basically done. * Introduces a new flex API for iterating through the fields, terms, docs and positions: {code} FieldProducer - TermsEnum - DocsEnum - PostingsEnum {code} This replaces TermEnum/Docs/Positions. SegmentReader emulates the old API on top of the new API to keep back-compat. Next steps: * Plug in new codecs (pulsing, pfor) to exercise the modularity / fix any hidden assumptions. * Expose new API out of IndexReader, deprecate old API but emulate old API on top of new one, switch all core/contrib users to the new API. * Maybe switch to AttributeSources as the base class for TermsEnum, DocsEnum, PostingsEnum -- this would give readers API flexibility (not just index-file-format flexibility). EG if someone wanted to store payload at the term-doc level instead of term-doc-position level, you could just add a new attribute. * Test performance iterate. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing
[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12762633#action_12762633 ] Michael McCandless commented on LUCENE-1458: bq. What I have done: svn copy from the older revision to the same path Excellent, thanks! It had a few problems (was still trying to deprecated APIs, some of which were gone) -- I just committed fixes. Further steps towards flexible indexing --- Key: LUCENE-1458 URL: https://issues.apache.org/jira/browse/LUCENE-1458 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: 2.9 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Attachments: LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2 I attached a very rough checkpoint of my current patch, to get early feedback. All tests pass, though back compat tests don't pass due to changes to package-private APIs plus certain bugs in tests that happened to work (eg call TermPostions.nextPosition() too many times, which the new API asserts against). [Aside: I think, when we commit changes to package-private APIs such that back-compat tests don't pass, we could go back, make a branch on the back-compat tag, commit changes to the tests to use the new package private APIs on that branch, then fix nightly build to use the tip of that branch?o] There's still plenty to do before this is committable! This is a rather large change: * Switches to a new more efficient terms dict format. This still uses tii/tis files, but the tii only stores term long offset (not a TermInfo). At seek points, tis encodes term freq/prox offsets absolutely instead of with deltas delta. Also, tis/tii are structured by field, so we don't have to record field number in every term. . On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB). . RAM usage when loading terms dict index is significantly less since we only load an array of offsets and an array of String (no more TermInfo array). It should be faster to init too. . This part is basically done. * Introduces modular reader codec that strongly decouples terms dict from docs/positions readers. EG there is no more TermInfo used when reading the new format. . There's nice symmetry now between reading writing in the codec chain -- the current docs/prox format is captured in: {code} FormatPostingsTermsDictWriter/Reader FormatPostingsDocsWriter/Reader (.frq file) and FormatPostingsPositionsWriter/Reader (.prx file). {code} This part is basically done. * Introduces a new flex API for iterating through the fields, terms, docs and positions: {code} FieldProducer - TermsEnum - DocsEnum - PostingsEnum {code} This replaces TermEnum/Docs/Positions. SegmentReader emulates the old API on top of the new API to keep back-compat. Next steps: * Plug in new codecs (pulsing, pfor) to exercise the modularity / fix any hidden assumptions. * Expose new API out of IndexReader, deprecate old API but emulate old API on top of new one, switch all core/contrib users to the new API. * Maybe switch to AttributeSources as the base class for TermsEnum, DocsEnum, PostingsEnum -- this would give readers API flexibility (not just index-file-format flexibility). EG if someone wanted to store payload at the term-doc level instead of term-doc-position level, you could just add a new attribute. * Test performance iterate. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
RE: svn commit: r822284 - /lucene/java/trunk/src/test/org/apache/lucene/index/TestBackwardsCompatibility.java
Can you add this patch to backwards, too? I forgot, that some of the backwards-changes also applied to BW, but for completeness, not sure, if a tag is also needed. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: mikemcc...@apache.org [mailto:mikemcc...@apache.org] Sent: Tuesday, October 06, 2009 4:13 PM To: java-comm...@lucene.apache.org Subject: svn commit: r822284 - /lucene/java/trunk/src/test/org/apache/lucene/index/TestBackwardsCompatibi lity.java Author: mikemccand Date: Tue Oct 6 14:12:46 2009 New Revision: 822284 URL: http://svn.apache.org/viewvc?rev=822284view=rev Log: fix TestBackwardsCompability to not use deprecated APIs Modified: lucene/java/trunk/src/test/org/apache/lucene/index/TestBackwardsCompatibil ity.java Modified: lucene/java/trunk/src/test/org/apache/lucene/index/TestBackwardsCompatibil ity.java URL: http://svn.apache.org/viewvc/lucene/java/trunk/src/test/org/apache/lucene/ index/TestBackwardsCompatibility.java?rev=822284r1=822283r2=822284view= diff == --- lucene/java/trunk/src/test/org/apache/lucene/index/TestBackwardsCompatibil ity.java (original) +++ lucene/java/trunk/src/test/org/apache/lucene/index/TestBackwardsCompatibil ity.java Tue Oct 6 14:12:46 2009 @@ -158,11 +158,7 @@ for(int i=0;ioldNames.length;i++) { String dirName = src/test/org/apache/lucene/index/index. + oldNames[i]; unzip(dirName, oldNames[i]); - changeIndexNoAdds(oldNames[i], true); - rmDir(oldNames[i]); - - unzip(dirName, oldNames[i]); - changeIndexNoAdds(oldNames[i], false); + changeIndexNoAdds(oldNames[i]); rmDir(oldNames[i]); } } @@ -171,11 +167,7 @@ for(int i=0;ioldNames.length;i++) { String dirName = src/test/org/apache/lucene/index/index. + oldNames[i]; unzip(dirName, oldNames[i]); - changeIndexWithAdds(oldNames[i], true); - rmDir(oldNames[i]); - - unzip(dirName, oldNames[i]); - changeIndexWithAdds(oldNames[i], false); + changeIndexWithAdds(oldNames[i]); rmDir(oldNames[i]); } } @@ -196,7 +188,7 @@ dirName = fullDir(dirName); Directory dir = FSDirectory.open(new File(dirName)); -IndexSearcher searcher = new IndexSearcher(dir); +IndexSearcher searcher = new IndexSearcher(dir, true); IndexReader reader = searcher.getIndexReader(); _TestUtil.checkIndex(dir); @@ -267,14 +259,14 @@ /* Open pre-lockless index, add docs, do a delete * setNorm, and search */ - public void changeIndexWithAdds(String dirName, boolean autoCommit) throws IOException { + public void changeIndexWithAdds(String dirName) throws IOException { String origDirName = dirName; dirName = fullDir(dirName); Directory dir = FSDirectory.open(new File(dirName)); // open writer -IndexWriter writer = new IndexWriter(dir, autoCommit, new WhitespaceAnalyzer(), false); +IndexWriter writer = new IndexWriter(dir, new WhitespaceAnalyzer(), false, IndexWriter.MaxFieldLength.UNLIMITED); // add 10 docs for(int i=0;i10;i++) { @@ -288,11 +280,11 @@ } else { expected = 46; } -assertEquals(wrong doc count, expected, writer.docCount()); +assertEquals(wrong doc count, expected, writer.maxDoc()); writer.close(); // make sure searching sees right # hits -IndexSearcher searcher = new IndexSearcher(dir); +IndexSearcher searcher = new IndexSearcher(dir, true); ScoreDoc[] hits = searcher.search(new TermQuery(new Term(content, aaa)), null, 1000).scoreDocs; Document d = searcher.doc(hits[0].doc); assertEquals(wrong first document, 21, d.get(id)); @@ -301,7 +293,7 @@ // make sure we can do delete setNorm against this // pre-lockless segment: -IndexReader reader = IndexReader.open(dir); +IndexReader reader = IndexReader.open(dir, false); Term searchTerm = new Term(id, 6); int delCount = reader.deleteDocuments(searchTerm); assertEquals(wrong delete count, 1, delCount); @@ -309,7 +301,7 @@ reader.close(); // make sure they took: -searcher = new IndexSearcher(dir); +searcher = new IndexSearcher(dir, true); hits = searcher.search(new TermQuery(new Term(content, aaa)), null, 1000).scoreDocs; assertEquals(wrong number of hits, 43, hits.length); d = searcher.doc(hits[0].doc); @@ -318,11 +310,11 @@ searcher.close(); // optimize -writer = new IndexWriter(dir, autoCommit, new WhitespaceAnalyzer(), false); +writer = new IndexWriter(dir, new WhitespaceAnalyzer(), false, IndexWriter.MaxFieldLength.UNLIMITED); writer.optimize(); writer.close(); -searcher = new IndexSearcher(dir); +searcher = new IndexSearcher(dir, true);
Re: svn commit: r822284 - /lucene/java/trunk/src/test/org/apache/lucene/index/TestBackwardsCompatibility.java
OK will do. Mike On Tue, Oct 6, 2009 at 10:23 AM, Uwe Schindler u...@thetaphi.de wrote: Can you add this patch to backwards, too? I forgot, that some of the backwards-changes also applied to BW, but for completeness, not sure, if a tag is also needed. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: mikemcc...@apache.org [mailto:mikemcc...@apache.org] Sent: Tuesday, October 06, 2009 4:13 PM To: java-comm...@lucene.apache.org Subject: svn commit: r822284 - /lucene/java/trunk/src/test/org/apache/lucene/index/TestBackwardsCompatibi lity.java Author: mikemccand Date: Tue Oct 6 14:12:46 2009 New Revision: 822284 URL: http://svn.apache.org/viewvc?rev=822284view=rev Log: fix TestBackwardsCompability to not use deprecated APIs Modified: lucene/java/trunk/src/test/org/apache/lucene/index/TestBackwardsCompatibil ity.java Modified: lucene/java/trunk/src/test/org/apache/lucene/index/TestBackwardsCompatibil ity.java URL: http://svn.apache.org/viewvc/lucene/java/trunk/src/test/org/apache/lucene/ index/TestBackwardsCompatibility.java?rev=822284r1=822283r2=822284view= diff == --- lucene/java/trunk/src/test/org/apache/lucene/index/TestBackwardsCompatibil ity.java (original) +++ lucene/java/trunk/src/test/org/apache/lucene/index/TestBackwardsCompatibil ity.java Tue Oct 6 14:12:46 2009 @@ -158,11 +158,7 @@ for(int i=0;ioldNames.length;i++) { String dirName = src/test/org/apache/lucene/index/index. + oldNames[i]; unzip(dirName, oldNames[i]); - changeIndexNoAdds(oldNames[i], true); - rmDir(oldNames[i]); - - unzip(dirName, oldNames[i]); - changeIndexNoAdds(oldNames[i], false); + changeIndexNoAdds(oldNames[i]); rmDir(oldNames[i]); } } @@ -171,11 +167,7 @@ for(int i=0;ioldNames.length;i++) { String dirName = src/test/org/apache/lucene/index/index. + oldNames[i]; unzip(dirName, oldNames[i]); - changeIndexWithAdds(oldNames[i], true); - rmDir(oldNames[i]); - - unzip(dirName, oldNames[i]); - changeIndexWithAdds(oldNames[i], false); + changeIndexWithAdds(oldNames[i]); rmDir(oldNames[i]); } } @@ -196,7 +188,7 @@ dirName = fullDir(dirName); Directory dir = FSDirectory.open(new File(dirName)); - IndexSearcher searcher = new IndexSearcher(dir); + IndexSearcher searcher = new IndexSearcher(dir, true); IndexReader reader = searcher.getIndexReader(); _TestUtil.checkIndex(dir); @@ -267,14 +259,14 @@ /* Open pre-lockless index, add docs, do a delete * setNorm, and search */ - public void changeIndexWithAdds(String dirName, boolean autoCommit) throws IOException { + public void changeIndexWithAdds(String dirName) throws IOException { String origDirName = dirName; dirName = fullDir(dirName); Directory dir = FSDirectory.open(new File(dirName)); // open writer - IndexWriter writer = new IndexWriter(dir, autoCommit, new WhitespaceAnalyzer(), false); + IndexWriter writer = new IndexWriter(dir, new WhitespaceAnalyzer(), false, IndexWriter.MaxFieldLength.UNLIMITED); // add 10 docs for(int i=0;i10;i++) { @@ -288,11 +280,11 @@ } else { expected = 46; } - assertEquals(wrong doc count, expected, writer.docCount()); + assertEquals(wrong doc count, expected, writer.maxDoc()); writer.close(); // make sure searching sees right # hits - IndexSearcher searcher = new IndexSearcher(dir); + IndexSearcher searcher = new IndexSearcher(dir, true); ScoreDoc[] hits = searcher.search(new TermQuery(new Term(content, aaa)), null, 1000).scoreDocs; Document d = searcher.doc(hits[0].doc); assertEquals(wrong first document, 21, d.get(id)); @@ -301,7 +293,7 @@ // make sure we can do delete setNorm against this // pre-lockless segment: - IndexReader reader = IndexReader.open(dir); + IndexReader reader = IndexReader.open(dir, false); Term searchTerm = new Term(id, 6); int delCount = reader.deleteDocuments(searchTerm); assertEquals(wrong delete count, 1, delCount); @@ -309,7 +301,7 @@ reader.close(); // make sure they took: - searcher = new IndexSearcher(dir); + searcher = new IndexSearcher(dir, true); hits = searcher.search(new TermQuery(new Term(content, aaa)), null, 1000).scoreDocs; assertEquals(wrong number of hits, 43, hits.length); d = searcher.doc(hits[0].doc); @@ -318,11 +310,11 @@ searcher.close(); // optimize - writer = new IndexWriter(dir, autoCommit, new WhitespaceAnalyzer(), false); + writer = new IndexWriter(dir, new WhitespaceAnalyzer(), false, IndexWriter.MaxFieldLength.UNLIMITED); writer.optimize(); writer.close(); -
RE: svn commit: r822284 - /lucene/java/trunk/src/test/org/apache/lucene/index/TestBackwardsCompatibility.java
Thanks, sorry for extra work! I missed to do this after the svn copy :( - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Tuesday, October 06, 2009 4:37 PM To: java-dev@lucene.apache.org Subject: Re: svn commit: r822284 - /lucene/java/trunk/src/test/org/apache/lucene/index/TestBackwardsCompatibi lity.java OK will do. Mike On Tue, Oct 6, 2009 at 10:23 AM, Uwe Schindler u...@thetaphi.de wrote: Can you add this patch to backwards, too? I forgot, that some of the backwards-changes also applied to BW, but for completeness, not sure, if a tag is also needed. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: mikemcc...@apache.org [mailto:mikemcc...@apache.org] Sent: Tuesday, October 06, 2009 4:13 PM To: java-comm...@lucene.apache.org Subject: svn commit: r822284 - /lucene/java/trunk/src/test/org/apache/lucene/index/TestBackwardsCompatibi lity.java Author: mikemccand Date: Tue Oct 6 14:12:46 2009 New Revision: 822284 URL: http://svn.apache.org/viewvc?rev=822284view=rev Log: fix TestBackwardsCompability to not use deprecated APIs Modified: lucene/java/trunk/src/test/org/apache/lucene/index/TestBackwardsCompatibil ity.java Modified: lucene/java/trunk/src/test/org/apache/lucene/index/TestBackwardsCompatibil ity.java URL: http://svn.apache.org/viewvc/lucene/java/trunk/src/test/org/apache/lucene/ index/TestBackwardsCompatibility.java?rev=822284r1=822283r2=822284view= diff == --- lucene/java/trunk/src/test/org/apache/lucene/index/TestBackwardsCompatibil ity.java (original) +++ lucene/java/trunk/src/test/org/apache/lucene/index/TestBackwardsCompatibil ity.java Tue Oct 6 14:12:46 2009 @@ -158,11 +158,7 @@ for(int i=0;ioldNames.length;i++) { String dirName = src/test/org/apache/lucene/index/index. + oldNames[i]; unzip(dirName, oldNames[i]); - changeIndexNoAdds(oldNames[i], true); - rmDir(oldNames[i]); - - unzip(dirName, oldNames[i]); - changeIndexNoAdds(oldNames[i], false); + changeIndexNoAdds(oldNames[i]); rmDir(oldNames[i]); } } @@ -171,11 +167,7 @@ for(int i=0;ioldNames.length;i++) { String dirName = src/test/org/apache/lucene/index/index. + oldNames[i]; unzip(dirName, oldNames[i]); - changeIndexWithAdds(oldNames[i], true); - rmDir(oldNames[i]); - - unzip(dirName, oldNames[i]); - changeIndexWithAdds(oldNames[i], false); + changeIndexWithAdds(oldNames[i]); rmDir(oldNames[i]); } } @@ -196,7 +188,7 @@ dirName = fullDir(dirName); Directory dir = FSDirectory.open(new File(dirName)); - IndexSearcher searcher = new IndexSearcher(dir); + IndexSearcher searcher = new IndexSearcher(dir, true); IndexReader reader = searcher.getIndexReader(); _TestUtil.checkIndex(dir); @@ -267,14 +259,14 @@ /* Open pre-lockless index, add docs, do a delete * setNorm, and search */ - public void changeIndexWithAdds(String dirName, boolean autoCommit) throws IOException { + public void changeIndexWithAdds(String dirName) throws IOException { String origDirName = dirName; dirName = fullDir(dirName); Directory dir = FSDirectory.open(new File(dirName)); // open writer - IndexWriter writer = new IndexWriter(dir, autoCommit, new WhitespaceAnalyzer(), false); + IndexWriter writer = new IndexWriter(dir, new WhitespaceAnalyzer(), false, IndexWriter.MaxFieldLength.UNLIMITED); // add 10 docs for(int i=0;i10;i++) { @@ -288,11 +280,11 @@ } else { expected = 46; } - assertEquals(wrong doc count, expected, writer.docCount()); + assertEquals(wrong doc count, expected, writer.maxDoc()); writer.close(); // make sure searching sees right # hits - IndexSearcher searcher = new IndexSearcher(dir); + IndexSearcher searcher = new IndexSearcher(dir, true); ScoreDoc[] hits = searcher.search(new TermQuery(new Term(content, aaa)), null, 1000).scoreDocs; Document d = searcher.doc(hits[0].doc); assertEquals(wrong first document, 21, d.get(id)); @@ -301,7 +293,7 @@ // make sure we can do delete setNorm against this // pre-lockless segment: - IndexReader reader = IndexReader.open(dir); + IndexReader reader = IndexReader.open(dir, false); Term searchTerm = new Term(id, 6); int delCount = reader.deleteDocuments(searchTerm); assertEquals(wrong delete count, 1, delCount); @@ -309,7 +301,7 @@ reader.close();
Re: svn commit: r822284 - /lucene/java/trunk/src/test/org/apache/lucene/index/TestBackwardsCompatibility.java
No problem! It's exciting :) Mike On Tue, Oct 6, 2009 at 10:40 AM, Uwe Schindler u...@thetaphi.de wrote: Thanks, sorry for extra work! I missed to do this after the svn copy :( - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Tuesday, October 06, 2009 4:37 PM To: java-dev@lucene.apache.org Subject: Re: svn commit: r822284 - /lucene/java/trunk/src/test/org/apache/lucene/index/TestBackwardsCompatibi lity.java OK will do. Mike On Tue, Oct 6, 2009 at 10:23 AM, Uwe Schindler u...@thetaphi.de wrote: Can you add this patch to backwards, too? I forgot, that some of the backwards-changes also applied to BW, but for completeness, not sure, if a tag is also needed. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: mikemcc...@apache.org [mailto:mikemcc...@apache.org] Sent: Tuesday, October 06, 2009 4:13 PM To: java-comm...@lucene.apache.org Subject: svn commit: r822284 - /lucene/java/trunk/src/test/org/apache/lucene/index/TestBackwardsCompatibi lity.java Author: mikemccand Date: Tue Oct 6 14:12:46 2009 New Revision: 822284 URL: http://svn.apache.org/viewvc?rev=822284view=rev Log: fix TestBackwardsCompability to not use deprecated APIs Modified: lucene/java/trunk/src/test/org/apache/lucene/index/TestBackwardsCompatibil ity.java Modified: lucene/java/trunk/src/test/org/apache/lucene/index/TestBackwardsCompatibil ity.java URL: http://svn.apache.org/viewvc/lucene/java/trunk/src/test/org/apache/lucene/ index/TestBackwardsCompatibility.java?rev=822284r1=822283r2=822284view= diff == --- lucene/java/trunk/src/test/org/apache/lucene/index/TestBackwardsCompatibil ity.java (original) +++ lucene/java/trunk/src/test/org/apache/lucene/index/TestBackwardsCompatibil ity.java Tue Oct 6 14:12:46 2009 @@ -158,11 +158,7 @@ for(int i=0;ioldNames.length;i++) { String dirName = src/test/org/apache/lucene/index/index. + oldNames[i]; unzip(dirName, oldNames[i]); - changeIndexNoAdds(oldNames[i], true); - rmDir(oldNames[i]); - - unzip(dirName, oldNames[i]); - changeIndexNoAdds(oldNames[i], false); + changeIndexNoAdds(oldNames[i]); rmDir(oldNames[i]); } } @@ -171,11 +167,7 @@ for(int i=0;ioldNames.length;i++) { String dirName = src/test/org/apache/lucene/index/index. + oldNames[i]; unzip(dirName, oldNames[i]); - changeIndexWithAdds(oldNames[i], true); - rmDir(oldNames[i]); - - unzip(dirName, oldNames[i]); - changeIndexWithAdds(oldNames[i], false); + changeIndexWithAdds(oldNames[i]); rmDir(oldNames[i]); } } @@ -196,7 +188,7 @@ dirName = fullDir(dirName); Directory dir = FSDirectory.open(new File(dirName)); - IndexSearcher searcher = new IndexSearcher(dir); + IndexSearcher searcher = new IndexSearcher(dir, true); IndexReader reader = searcher.getIndexReader(); _TestUtil.checkIndex(dir); @@ -267,14 +259,14 @@ /* Open pre-lockless index, add docs, do a delete * setNorm, and search */ - public void changeIndexWithAdds(String dirName, boolean autoCommit) throws IOException { + public void changeIndexWithAdds(String dirName) throws IOException { String origDirName = dirName; dirName = fullDir(dirName); Directory dir = FSDirectory.open(new File(dirName)); // open writer - IndexWriter writer = new IndexWriter(dir, autoCommit, new WhitespaceAnalyzer(), false); + IndexWriter writer = new IndexWriter(dir, new WhitespaceAnalyzer(), false, IndexWriter.MaxFieldLength.UNLIMITED); // add 10 docs for(int i=0;i10;i++) { @@ -288,11 +280,11 @@ } else { expected = 46; } - assertEquals(wrong doc count, expected, writer.docCount()); + assertEquals(wrong doc count, expected, writer.maxDoc()); writer.close(); // make sure searching sees right # hits - IndexSearcher searcher = new IndexSearcher(dir); + IndexSearcher searcher = new IndexSearcher(dir, true); ScoreDoc[] hits = searcher.search(new TermQuery(new Term(content, aaa)), null, 1000).scoreDocs; Document d = searcher.doc(hits[0].doc); assertEquals(wrong first document, 21, d.get(id)); @@ -301,7 +293,7 @@ // make sure we can do delete setNorm against this // pre-lockless segment: - IndexReader reader = IndexReader.open(dir); + IndexReader reader = IndexReader.open(dir, false); Term searchTerm = new Term(id, 6); int delCount =
[jira] Updated: (LUCENE-1458) Further steps towards flexible indexing
[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1458: --- Attachment: LUCENE-1458.patch Uber-patch attached: started from Mark's patch (thanks!), added my contrib back-compat branch changes. All tests pass. Also, I removed pfor from this issue. I'll attach the pfor codec to LUCENE-1410. Note that I didn't use svn move in generating the patch, so that the patch can be applied cleanly. When it [finally] comes time to commit for real, I'll svn move so we preserve history. Further steps towards flexible indexing --- Key: LUCENE-1458 URL: https://issues.apache.org/jira/browse/LUCENE-1458 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: 2.9 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Attachments: LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2 I attached a very rough checkpoint of my current patch, to get early feedback. All tests pass, though back compat tests don't pass due to changes to package-private APIs plus certain bugs in tests that happened to work (eg call TermPostions.nextPosition() too many times, which the new API asserts against). [Aside: I think, when we commit changes to package-private APIs such that back-compat tests don't pass, we could go back, make a branch on the back-compat tag, commit changes to the tests to use the new package private APIs on that branch, then fix nightly build to use the tip of that branch?o] There's still plenty to do before this is committable! This is a rather large change: * Switches to a new more efficient terms dict format. This still uses tii/tis files, but the tii only stores term long offset (not a TermInfo). At seek points, tis encodes term freq/prox offsets absolutely instead of with deltas delta. Also, tis/tii are structured by field, so we don't have to record field number in every term. . On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB). . RAM usage when loading terms dict index is significantly less since we only load an array of offsets and an array of String (no more TermInfo array). It should be faster to init too. . This part is basically done. * Introduces modular reader codec that strongly decouples terms dict from docs/positions readers. EG there is no more TermInfo used when reading the new format. . There's nice symmetry now between reading writing in the codec chain -- the current docs/prox format is captured in: {code} FormatPostingsTermsDictWriter/Reader FormatPostingsDocsWriter/Reader (.frq file) and FormatPostingsPositionsWriter/Reader (.prx file). {code} This part is basically done. * Introduces a new flex API for iterating through the fields, terms, docs and positions: {code} FieldProducer - TermsEnum - DocsEnum - PostingsEnum {code} This replaces TermEnum/Docs/Positions. SegmentReader emulates the old API on top of the new API to keep back-compat. Next steps: * Plug in new codecs (pulsing, pfor) to exercise the modularity / fix any hidden assumptions. * Expose new API out of IndexReader, deprecate old API but emulate old API on top of new one, switch all core/contrib users to the new API. * Maybe switch to AttributeSources as the base class for TermsEnum, DocsEnum, PostingsEnum -- this would give readers API flexibility (not just index-file-format flexibility). EG if someone wanted to store payload at the term-doc level instead of term-doc-position level, you could just add a new attribute. * Test performance iterate. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: [jira] Updated: (LUCENE-1458) Further steps towards flexible indexing
Michael McCandless (JIRA) wrote: [ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1458: --- Attachment: LUCENE-1458.patch Uber-patch attached: started from Mark's patch (thanks!), Anytime! Grunt work and I go together like Michael Bay and Uwe Boll. Hope I can actually make a meaningful contribution to flexible indexing at some point. -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
De-basing / re-basing docIDs, or how to effectively pass calculated values from a Scorer or Filter up to (Solr's) QueryComponent.process
In the code I'm working with, I generate a cache of calculated values as a by-product within a Filter.getDocidSet implementation (and within a Query-ized version of the filter and its Scorer method) . These values are keyed off the IndexReader's docID values, since that's all that's accessible at that level. Ultimately, however, I need to be able to access these values much higher up in the stack (Solr's QueryComponent.process method), so that I can inject the dynamic values into the response as a fake field. The IDs available here, however, are for the entire index and not just relative to the current IndexReader. I'm still fairly new to Lucene and I've been scratching my head a bit trying to find a reliable way to map these values into the same space, without having to hack up too many base classes. I noticed that there was a related discussion at: http://issues.apache.org/jira/browse/LUCENE-1821?focusedCommentId=12745041#action_12745041 ... but also a bit of disagreement on the suggested strategies. Ideally, I'm also hoping there's a strategy that won't require me to hack up too much of the core product; subclassing IndexSearcher in the way suggested would basically require me to change all of the various SearchComponents I use in Solr, and that sounds like it'd end up a real maintenance nightmare. I was looking at the Collector class as possible solution, since it has knowledge of the docbase, but it looks like I'd then need to change every derived collector that the code ultimately uses and, including the various anonymous Collectors in Solr, that also looks like it'd be a fairly ghoulish solution. I suppose I'm being wishful, or lazy, but is there a reasonable and reliable way to do this, without having to fork the core code? If not, any suggestion on the best strategy to accomplish this, without adding too much overhead every time I wanted to up-rev the core Lucene and/or Solr code to the latest version? Thanks a ton, Aaron - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: De-basing / re-basing docIDs, or how to effectively pass calculated values from a Scorer or Filter up to (Solr's) QueryComponent.process
Aaron, could you move this to solr-user? -Yonik http://www.lucidimagination.com On Tue, Oct 6, 2009 at 11:22 AM, Aaron McKee ucbmc...@gmail.com wrote: In the code I'm working with, I generate a cache of calculated values as a by-product within a Filter.getDocidSet implementation (and within a Query-ized version of the filter and its Scorer method) . These values are keyed off the IndexReader's docID values, since that's all that's accessible at that level. Ultimately, however, I need to be able to access these values much higher up in the stack (Solr's QueryComponent.process method), so that I can inject the dynamic values into the response as a fake field. The IDs available here, however, are for the entire index and not just relative to the current IndexReader. I'm still fairly new to Lucene and I've been scratching my head a bit trying to find a reliable way to map these values into the same space, without having to hack up too many base classes. I noticed that there was a related discussion at: http://issues.apache.org/jira/browse/LUCENE-1821?focusedCommentId=12745041#action_12745041 ... but also a bit of disagreement on the suggested strategies. Ideally, I'm also hoping there's a strategy that won't require me to hack up too much of the core product; subclassing IndexSearcher in the way suggested would basically require me to change all of the various SearchComponents I use in Solr, and that sounds like it'd end up a real maintenance nightmare. I was looking at the Collector class as possible solution, since it has knowledge of the docbase, but it looks like I'd then need to change every derived collector that the code ultimately uses and, including the various anonymous Collectors in Solr, that also looks like it'd be a fairly ghoulish solution. I suppose I'm being wishful, or lazy, but is there a reasonable and reliable way to do this, without having to fork the core code? If not, any suggestion on the best strategy to accomplish this, without adding too much overhead every time I wanted to up-rev the core Lucene and/or Solr code to the latest version? Thanks a ton, Aaron - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: De-basing / re-basing docIDs, or how to effectively pass calculated values from a Scorer or Filter up to (Solr's) QueryComponent.process
Might still be lucene-ish issue. We already have getSequentialSubReaders() on IR, in my patched version I augmented this with public readerIndex(), and getSubReaderStarts(). Pretty much impossible to do some postprocessing on gathered hits without at least one of these. On Tue, Oct 6, 2009 at 19:50, Yonik Seeley yo...@lucidimagination.com wrote: Aaron, could you move this to solr-user? -Yonik http://www.lucidimagination.com On Tue, Oct 6, 2009 at 11:22 AM, Aaron McKee ucbmc...@gmail.com wrote: In the code I'm working with, I generate a cache of calculated values as a by-product within a Filter.getDocidSet implementation (and within a Query-ized version of the filter and its Scorer method) . These values are keyed off the IndexReader's docID values, since that's all that's accessible at that level. Ultimately, however, I need to be able to access these values much higher up in the stack (Solr's QueryComponent.process method), so that I can inject the dynamic values into the response as a fake field. The IDs available here, however, are for the entire index and not just relative to the current IndexReader. I'm still fairly new to Lucene and I've been scratching my head a bit trying to find a reliable way to map these values into the same space, without having to hack up too many base classes. I noticed that there was a related discussion at: http://issues.apache.org/jira/browse/LUCENE-1821?focusedCommentId=12745041#action_12745041 ... but also a bit of disagreement on the suggested strategies. Ideally, I'm also hoping there's a strategy that won't require me to hack up too much of the core product; subclassing IndexSearcher in the way suggested would basically require me to change all of the various SearchComponents I use in Solr, and that sounds like it'd end up a real maintenance nightmare. I was looking at the Collector class as possible solution, since it has knowledge of the docbase, but it looks like I'd then need to change every derived collector that the code ultimately uses and, including the various anonymous Collectors in Solr, that also looks like it'd be a fairly ghoulish solution. I suppose I'm being wishful, or lazy, but is there a reasonable and reliable way to do this, without having to fork the core code? If not, any suggestion on the best strategy to accomplish this, without adding too much overhead every time I wanted to up-rev the core Lucene and/or Solr code to the latest version? Thanks a ton, Aaron - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1410) PFOR implementation
[ https://issues.apache.org/jira/browse/LUCENE-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1410: --- Attachment: LUCENE-1410-codecs.tar.bz2 Attaching sep, intblock and pfordelta codecs, spun out of the last patch on LUCENE-1458. Once LUCENE-1458 is in, we should finish the pfordelta codec to make it a real choice. I actually think some combination of pulsing, standard, pfordelta and simple bit packing (in order by increasing term's docFreq), within a single codec, may be best. Ie, rare terms (only in a doc or two) could be inlined into the the terms dict. Slightly more common terms can use the more CPU intensive standard codec. Common terms can use cpu-friendly-yet-still-decent-compression pfordelta. Obsenely common terms can use bit packing for the fastest decode. PFOR implementation --- Key: LUCENE-1410 URL: https://issues.apache.org/jira/browse/LUCENE-1410 Project: Lucene - Java Issue Type: New Feature Components: Other Reporter: Paul Elschot Priority: Minor Attachments: autogen.tgz, LUCENE-1410-codecs.tar.bz2, LUCENE-1410b.patch, LUCENE-1410c.patch, LUCENE-1410d.patch, LUCENE-1410e.patch, TermQueryTests.tgz, TestPFor2.java, TestPFor2.java, TestPFor2.java Original Estimate: 21840h Remaining Estimate: 21840h Implementation of Patched Frame of Reference. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-1949) FilterManager uses unsafe keys for its filter cache
FilterManager uses unsafe keys for its filter cache --- Key: LUCENE-1949 URL: https://issues.apache.org/jira/browse/LUCENE-1949 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 2.9 Reporter: Aaron McKee Priority: Minor re: FilterManager.getFilter(Filter filter) FilterManager is using the filter's hash code as the key to its filter cache, however hash codes are intrinsically not guaranteed to be distinct; different filters may hash to the same value. Although the chance of a conflict is hopefully low, given reasonable implementations of hashCode, it's certainly not impossible. When a conflict does occur, an unintended filter may be returned. I'm unaware to what extent this class is actively being used, but noticed the issue during a code browse and thought I'd at least mention it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1410) PFOR implementation
[ https://issues.apache.org/jira/browse/LUCENE-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12762742#action_12762742 ] Eks Dev commented on LUCENE-1410: - Mike, That is definitely the way to go, distribution dependent encoding, where every Term gets individual treatment. Take for an example simple, but not all that rare case where Index gets sorted on some of the indexed fields (we use it really extensively, e.g. presorted doc collection on user_rights/zip/city, all indexed). There you get perfectly compressible postings by simply managing intervals of set bits. Updates distort this picture, but we rebuild index periodically and all gets good again. At the moment we load them into RAM as Filters in IntervalSets. if that would be possible in lucene, we wouldn't bother with Filters (VInt decoding on such super dense fields was killing us, even in RAMDirectory) ... Thinking about your comments, isn't pulsing somewhat orthogonal to packing method? For example, if you load index into RAMDirecectory, one could avoid one indirection level and inline all postings. Flex Indexing rocks, that is going to be the most important addition to lucene since it started (imo)... I would even bet on double search speed in first attempt for average queries :) Cheers, eks PFOR implementation --- Key: LUCENE-1410 URL: https://issues.apache.org/jira/browse/LUCENE-1410 Project: Lucene - Java Issue Type: New Feature Components: Other Reporter: Paul Elschot Priority: Minor Attachments: autogen.tgz, LUCENE-1410-codecs.tar.bz2, LUCENE-1410b.patch, LUCENE-1410c.patch, LUCENE-1410d.patch, LUCENE-1410e.patch, TermQueryTests.tgz, TestPFor2.java, TestPFor2.java, TestPFor2.java Original Estimate: 21840h Remaining Estimate: 21840h Implementation of Patched Frame of Reference. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: [jira] Commented: (LUCENE-1410) PFOR implementation
Eks, [ https://issues.apache.org/jira/browse/LUCENE-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12762742#action_12762742 ] Eks Dev commented on LUCENE-1410: - Mike, That is definitely the way to go, distribution dependent encoding, where every Term gets individual treatment. Take for an example simple, but not all that rare case where Index gets sorted on some of the indexed fields (we use it really extensively, e.g. presorted doc collection on user_rights/zip/city, all indexed). There you get perfectly compressible postings by simply managing intervals of set bits. Updates distort this picture, but we rebuild index periodically and all gets good again. At the moment we load them into RAM as Filters in IntervalSets. if that would be possible in lucene, we wouldn't bother with Filters (VInt decoding on such super dense fields was killing us, even in RAMDirectory) ... You could try switching the Filter to OpenBitSet when that takes fewer bytes than SortedVIntList. Regards, Paul Elschot
Re: [jira] Commented: (LUCENE-1410) PFOR implementation
Paul, the point I was trying to make with this example was extreme, but realistic. Imagine 100Mio docs, sorted on field user_rights, a term user_rights:XX selects 40Mio of them (user rights...). To encode this, you need format with two integers (for more of such intervals you would need slightly more, but nevertheless, much less than for OpenBitSet, VInts, PFor... ). Strictly speaking this term is dense, but highly compressible and could be inlined with pulsing trick... cheers, eks From: Paul Elschot paul.elsc...@xs4all.nl To: java-dev@lucene.apache.org Sent: Tuesday, 6 October, 2009 23:33:03 Subject: Re: [jira] Commented: (LUCENE-1410) PFOR implementation Eks, [ https://issues.apache.org/jira/browse/LUCENE-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12762742#action_12762742 ] Eks Dev commented on LUCENE-1410: - Mike, That is definitely the way to go, distribution dependent encoding, where every Term gets individual treatment. Take for an example simple, but not all that rare case where Index gets sorted on some of the indexed fields (we use it really extensively, e.g. presorted doc collection on user_rights/zip/city, all indexed). There you get perfectly compressible postings by simply managing intervals of set bits. Updates distort this picture, but we rebuild index periodically and all gets good again. At the moment we load them into RAM as Filters in IntervalSets. if that would be possible in lucene, we wouldn't bother with Filters (VInt decoding on such super dense fields was killing us, even in RAMDirectory) ... You could try switching the Filter to OpenBitSet when that takes fewer bytes than SortedVIntList. Regards, Paul Elschot
Re: [jira] Commented: (LUCENE-1410) PFOR implementation
if you would drive this example further in combination with flex-indexing permitting per term postings format, I could imagine some nice tools for optimizeHard() , where normal index construction works with defaults as planned for solid mix-performance case and at the end you run optimizeHard() where postings get resorted on such fields (basically enabling rle encoding to work) and at the same time all other terms get optimal encoding format for postings... perfect for read only indexes where you want to max performance and reduce ix size From: eks dev eks...@yahoo.co.uk To: java-dev@lucene.apache.org Sent: Tuesday, 6 October, 2009 23:59:12 Subject: Re: [jira] Commented: (LUCENE-1410) PFOR implementation Paul, the point I was trying to make with this example was extreme, but realistic. Imagine 100Mio docs, sorted on field user_rights, a term user_rights:XX selects 40Mio of them (user rights...). To encode this, you need format with two integers (for more of such intervals you would need slightly more, but nevertheless, much less than for OpenBitSet, VInts, PFor... ). Strictly speaking this term is dense, but highly compressible and could be inlined with pulsing trick... cheers, eks From: Paul Elschot paul.elsc...@xs4all.nl To: java-dev@lucene.apache.org Sent: Tuesday, 6 October, 2009 23:33:03 Subject: Re: [jira] Commented: (LUCENE-1410) PFOR implementation Eks, [ https://issues.apache.org/jira/browse/LUCENE-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12762742#action_12762742 ] Eks Dev commented on LUCENE-1410: - Mike, That is definitely the way to go, distribution dependent encoding, where every Term gets individual treatment. Take for an example simple, but not all that rare case where Index gets sorted on some of the indexed fields (we use it really extensively, e.g. presorted doc collection on user_rights/zip/city, all indexed). There you get perfectly compressible postings by simply managing intervals of set bits. Updates distort this picture, but we rebuild index periodically and all gets good again. At the moment we load them into RAM as Filters in IntervalSets. if that would be possible in lucene, we wouldn't bother with Filters (VInt decoding on such super dense fields was killing us, even in RAMDirectory) ... You could try switching the Filter to OpenBitSet when that takes fewer bytes than SortedVIntList. Regards, Paul Elschot
Re: [jira] Commented: (LUCENE-1410) PFOR implementation
On Tuesday 06 October 2009 23:59:12 eks dev wrote: Paul, the point I was trying to make with this example was extreme, but realistic. Imagine 100Mio docs, sorted on field user_rights, a term user_rights:XX selects 40Mio of them (user rights...). To encode this, you need format with two integers (for more of such intervals you would need slightly more, but nevertheless, much less than for OpenBitSet, VInts, PFor... ). Strictly speaking this term is dense, but highly compressible and could be inlined with pulsing trick... Well, I've been considering to add compressed consecutive ranges to SortedVIntList, but I did not get further than considering. This sounds like the perfect use case for that. Regards, Paul Elschot cheers, eks From: Paul Elschot paul.elsc...@xs4all.nl To: java-dev@lucene.apache.org Sent: Tuesday, 6 October, 2009 23:33:03 Subject: Re: [jira] Commented: (LUCENE-1410) PFOR implementation Eks, [ https://issues.apache.org/jira/browse/LUCENE-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12762742#action_12762742 ] Eks Dev commented on LUCENE-1410: - Mike, That is definitely the way to go, distribution dependent encoding, where every Term gets individual treatment. Take for an example simple, but not all that rare case where Index gets sorted on some of the indexed fields (we use it really extensively, e.g. presorted doc collection on user_rights/zip/city, all indexed). There you get perfectly compressible postings by simply managing intervals of set bits. Updates distort this picture, but we rebuild index periodically and all gets good again. At the moment we load them into RAM as Filters in IntervalSets. if that would be possible in lucene, we wouldn't bother with Filters (VInt decoding on such super dense fields was killing us, even in RAMDirectory) ... You could try switching the Filter to OpenBitSet when that takes fewer bytes than SortedVIntList. Regards, Paul Elschot
[jira] Updated: (LUCENE-1856) Remove Hits
[ https://issues.apache.org/jira/browse/LUCENE-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Busch updated LUCENE-1856: -- Attachment: lucene-1856.patch Removes Hits from core and all contribs. All core and contrib tests pass. I'll have to commit some changes also to the bw-compat branch. Remove Hits --- Key: LUCENE-1856 URL: https://issues.apache.org/jira/browse/LUCENE-1856 Project: Lucene - Java Issue Type: Task Components: Search Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Fix For: 3.0 Attachments: lucene-1856.patch LUCENE-1290 removed all references to Hits from core. Most work to be done here is to remove all references from the contrib modules and some new ones that crept into core after 1290. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Removing deprecated classes
Hi all, I've attached a patch to LUCENE-1856, which removes Hits. I'm not sure if someone has uncommitted big 3.0 patches that I'll mess up if I commit 1856? While working on 1856 I realized how tedious this stuff is! So Uwe, Mark Co, let me know if you want me to wait with committing my patch! Michael - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing
[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12762828#action_12762828 ] Mark Miller commented on LUCENE-1458: - bq. Hey! I'm not so old But yeah I still cling to emacs. can you say both of those things in the same breath? Just how long did it take to get that phd... I'd look it up and guestimate your age, but I think MIT still has my ip blocked from back when I was applying to colleges. So I'm going with the uses emacs guestimate. bq. Hey, I know people who still cling to vi! vi is the only one I can half way use - I know 3 commands - edit mode, leave edit mode, and save. And every now and then I accidently delete a whole line. When I make a change that I don't want to save, I have to kill the power. The patch is in a bit of an unpatchable state ;) I think I know what editor to blame...Pico! Our old friend, the $id is messing up WildcardTermEnum - no problem, I can fix that... But also, NumericUtils is unpatched, Codec is missing, along with most of the classes from the codecs packages! This looks like my work :) My only conclusion is that your one of those guys that can write the whole program once without even running it - and then it works perfectly on the first go. Thats the only way I can explain those classes in the wrong package previously as well :) No bug hunting tonight :( Further steps towards flexible indexing --- Key: LUCENE-1458 URL: https://issues.apache.org/jira/browse/LUCENE-1458 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: 2.9 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Attachments: LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2 I attached a very rough checkpoint of my current patch, to get early feedback. All tests pass, though back compat tests don't pass due to changes to package-private APIs plus certain bugs in tests that happened to work (eg call TermPostions.nextPosition() too many times, which the new API asserts against). [Aside: I think, when we commit changes to package-private APIs such that back-compat tests don't pass, we could go back, make a branch on the back-compat tag, commit changes to the tests to use the new package private APIs on that branch, then fix nightly build to use the tip of that branch?o] There's still plenty to do before this is committable! This is a rather large change: * Switches to a new more efficient terms dict format. This still uses tii/tis files, but the tii only stores term long offset (not a TermInfo). At seek points, tis encodes term freq/prox offsets absolutely instead of with deltas delta. Also, tis/tii are structured by field, so we don't have to record field number in every term. . On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB). . RAM usage when loading terms dict index is significantly less since we only load an array of offsets and an array of String (no more TermInfo array). It should be faster to init too. . This part is basically done. * Introduces modular reader codec that strongly decouples terms dict from docs/positions readers. EG there is no more TermInfo used when reading the new format. . There's nice symmetry now between reading writing in the codec chain -- the current docs/prox format is captured in: {code} FormatPostingsTermsDictWriter/Reader FormatPostingsDocsWriter/Reader (.frq file) and FormatPostingsPositionsWriter/Reader (.prx file). {code} This part is basically done. * Introduces a new flex API for iterating through the fields, terms, docs and positions: {code} FieldProducer - TermsEnum - DocsEnum - PostingsEnum {code} This replaces TermEnum/Docs/Positions. SegmentReader emulates the old API on top of the new API to keep back-compat. Next steps: * Plug in new codecs (pulsing, pfor) to exercise the modularity / fix any hidden assumptions. * Expose new API out of IndexReader, deprecate old API but emulate old API on top of new one, switch all core/contrib users to the new API. * Maybe switch to AttributeSources as the base class for
[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing
[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12762831#action_12762831 ] Mark Miller commented on LUCENE-1458: - nope - something else - looking through the patch I see the files I want - a second attempt at patching has gone over better. A couple errors still, but stuff I think I can fix so that I can at least look over. False alarm. My patcher wonked out or something. I can resolve the few errors that popped up this time. Sweet. Further steps towards flexible indexing --- Key: LUCENE-1458 URL: https://issues.apache.org/jira/browse/LUCENE-1458 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: 2.9 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Attachments: LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2 I attached a very rough checkpoint of my current patch, to get early feedback. All tests pass, though back compat tests don't pass due to changes to package-private APIs plus certain bugs in tests that happened to work (eg call TermPostions.nextPosition() too many times, which the new API asserts against). [Aside: I think, when we commit changes to package-private APIs such that back-compat tests don't pass, we could go back, make a branch on the back-compat tag, commit changes to the tests to use the new package private APIs on that branch, then fix nightly build to use the tip of that branch?o] There's still plenty to do before this is committable! This is a rather large change: * Switches to a new more efficient terms dict format. This still uses tii/tis files, but the tii only stores term long offset (not a TermInfo). At seek points, tis encodes term freq/prox offsets absolutely instead of with deltas delta. Also, tis/tii are structured by field, so we don't have to record field number in every term. . On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB). . RAM usage when loading terms dict index is significantly less since we only load an array of offsets and an array of String (no more TermInfo array). It should be faster to init too. . This part is basically done. * Introduces modular reader codec that strongly decouples terms dict from docs/positions readers. EG there is no more TermInfo used when reading the new format. . There's nice symmetry now between reading writing in the codec chain -- the current docs/prox format is captured in: {code} FormatPostingsTermsDictWriter/Reader FormatPostingsDocsWriter/Reader (.frq file) and FormatPostingsPositionsWriter/Reader (.prx file). {code} This part is basically done. * Introduces a new flex API for iterating through the fields, terms, docs and positions: {code} FieldProducer - TermsEnum - DocsEnum - PostingsEnum {code} This replaces TermEnum/Docs/Positions. SegmentReader emulates the old API on top of the new API to keep back-compat. Next steps: * Plug in new codecs (pulsing, pfor) to exercise the modularity / fix any hidden assumptions. * Expose new API out of IndexReader, deprecate old API but emulate old API on top of new one, switch all core/contrib users to the new API. * Maybe switch to AttributeSources as the base class for TermsEnum, DocsEnum, PostingsEnum -- this would give readers API flexibility (not just index-file-format flexibility). EG if someone wanted to store payload at the term-doc level instead of term-doc-position level, you could just add a new attribute. * Test performance iterate. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-1458) Further steps towards flexible indexing
[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12762831#action_12762831 ] Mark Miller edited comment on LUCENE-1458 at 10/6/09 6:23 PM: -- nope - something else - looking through the patch I see the files I want - a second attempt at patching has gone over better. A couple errors still, but stuff I think I can fix so that I can at least look over. False alarm. My patcher wonked out or something. I can resolve the few errors that popped up this time. Sweet. *edit* Just for reference - not sure what happened the first time - my patch preview looked the same both times (was only complaining about the $id), but completely failed on attempt one and worked on attempt two - the only issue now appears to be you have half switch deletedDocs to Bits from BitVector - but only have way, so its broken in a dozen places. Not sure what you are doing about size() and what not, so I'm just gonna read around. was (Author: markrmil...@gmail.com): nope - something else - looking through the patch I see the files I want - a second attempt at patching has gone over better. A couple errors still, but stuff I think I can fix so that I can at least look over. False alarm. My patcher wonked out or something. I can resolve the few errors that popped up this time. Sweet. Further steps towards flexible indexing --- Key: LUCENE-1458 URL: https://issues.apache.org/jira/browse/LUCENE-1458 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: 2.9 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Attachments: LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2 I attached a very rough checkpoint of my current patch, to get early feedback. All tests pass, though back compat tests don't pass due to changes to package-private APIs plus certain bugs in tests that happened to work (eg call TermPostions.nextPosition() too many times, which the new API asserts against). [Aside: I think, when we commit changes to package-private APIs such that back-compat tests don't pass, we could go back, make a branch on the back-compat tag, commit changes to the tests to use the new package private APIs on that branch, then fix nightly build to use the tip of that branch?o] There's still plenty to do before this is committable! This is a rather large change: * Switches to a new more efficient terms dict format. This still uses tii/tis files, but the tii only stores term long offset (not a TermInfo). At seek points, tis encodes term freq/prox offsets absolutely instead of with deltas delta. Also, tis/tii are structured by field, so we don't have to record field number in every term. . On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB). . RAM usage when loading terms dict index is significantly less since we only load an array of offsets and an array of String (no more TermInfo array). It should be faster to init too. . This part is basically done. * Introduces modular reader codec that strongly decouples terms dict from docs/positions readers. EG there is no more TermInfo used when reading the new format. . There's nice symmetry now between reading writing in the codec chain -- the current docs/prox format is captured in: {code} FormatPostingsTermsDictWriter/Reader FormatPostingsDocsWriter/Reader (.frq file) and FormatPostingsPositionsWriter/Reader (.prx file). {code} This part is basically done. * Introduces a new flex API for iterating through the fields, terms, docs and positions: {code} FieldProducer - TermsEnum - DocsEnum - PostingsEnum {code} This replaces TermEnum/Docs/Positions. SegmentReader emulates the old API on top of the new API to keep back-compat. Next steps: * Plug in new codecs (pulsing, pfor) to exercise the modularity / fix any hidden assumptions. * Expose new API out of IndexReader, deprecate old API but emulate old API on top of new one, switch all core/contrib users to the new API. * Maybe switch to AttributeSources as the base class for TermsEnum,
[jira] Issue Comment Edited: (LUCENE-1458) Further steps towards flexible indexing
[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12762831#action_12762831 ] Mark Miller edited comment on LUCENE-1458 at 10/6/09 6:34 PM: -- nope - something else - looking through the patch I see the files I want - a second attempt at patching has gone over better. A couple errors still, but stuff I think I can fix so that I can at least look over. False alarm. My patcher wonked out or something. I can resolve the few errors that popped up this time. Sweet. *edit* Just for reference - not sure what happened the first time - my patch preview looked the same both times (was only complaining about the $id), but completely failed on attempt one and worked on attempt two - the only issue now appears to be you have half switch deletedDocs to Bits from BitVector - but only have way, so its broken in a dozen places. Not sure what you are doing about size() and what not, so I'm just gonna read around. *edit* Yes - I found it - BitVector was supposed to implement Bits - which was in the patch ... this patch just did not want to apply. I guess it was right, but Eclipse just did not want it to take ... was (Author: markrmil...@gmail.com): nope - something else - looking through the patch I see the files I want - a second attempt at patching has gone over better. A couple errors still, but stuff I think I can fix so that I can at least look over. False alarm. My patcher wonked out or something. I can resolve the few errors that popped up this time. Sweet. *edit* Just for reference - not sure what happened the first time - my patch preview looked the same both times (was only complaining about the $id), but completely failed on attempt one and worked on attempt two - the only issue now appears to be you have half switch deletedDocs to Bits from BitVector - but only have way, so its broken in a dozen places. Not sure what you are doing about size() and what not, so I'm just gonna read around. Further steps towards flexible indexing --- Key: LUCENE-1458 URL: https://issues.apache.org/jira/browse/LUCENE-1458 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: 2.9 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Attachments: LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2 I attached a very rough checkpoint of my current patch, to get early feedback. All tests pass, though back compat tests don't pass due to changes to package-private APIs plus certain bugs in tests that happened to work (eg call TermPostions.nextPosition() too many times, which the new API asserts against). [Aside: I think, when we commit changes to package-private APIs such that back-compat tests don't pass, we could go back, make a branch on the back-compat tag, commit changes to the tests to use the new package private APIs on that branch, then fix nightly build to use the tip of that branch?o] There's still plenty to do before this is committable! This is a rather large change: * Switches to a new more efficient terms dict format. This still uses tii/tis files, but the tii only stores term long offset (not a TermInfo). At seek points, tis encodes term freq/prox offsets absolutely instead of with deltas delta. Also, tis/tii are structured by field, so we don't have to record field number in every term. . On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB). . RAM usage when loading terms dict index is significantly less since we only load an array of offsets and an array of String (no more TermInfo array). It should be faster to init too. . This part is basically done. * Introduces modular reader codec that strongly decouples terms dict from docs/positions readers. EG there is no more TermInfo used when reading the new format. . There's nice symmetry now between reading writing in the codec chain -- the current docs/prox format is captured in: {code} FormatPostingsTermsDictWriter/Reader FormatPostingsDocsWriter/Reader (.frq file) and FormatPostingsPositionsWriter/Reader (.prx file). {code} This part is
[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing
[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12762842#action_12762842 ] Mark Miller commented on LUCENE-1458: - Bah - all this huffing an puffing over the patch and I'm too sick to stay up late anyway. Have you started benching at all? I'm seeing like a 40-50% drop in same reader search benches with standard, sep, and pulsing. Like 80% with intblock. Further steps towards flexible indexing --- Key: LUCENE-1458 URL: https://issues.apache.org/jira/browse/LUCENE-1458 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: 2.9 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Attachments: LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2 I attached a very rough checkpoint of my current patch, to get early feedback. All tests pass, though back compat tests don't pass due to changes to package-private APIs plus certain bugs in tests that happened to work (eg call TermPostions.nextPosition() too many times, which the new API asserts against). [Aside: I think, when we commit changes to package-private APIs such that back-compat tests don't pass, we could go back, make a branch on the back-compat tag, commit changes to the tests to use the new package private APIs on that branch, then fix nightly build to use the tip of that branch?o] There's still plenty to do before this is committable! This is a rather large change: * Switches to a new more efficient terms dict format. This still uses tii/tis files, but the tii only stores term long offset (not a TermInfo). At seek points, tis encodes term freq/prox offsets absolutely instead of with deltas delta. Also, tis/tii are structured by field, so we don't have to record field number in every term. . On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB). . RAM usage when loading terms dict index is significantly less since we only load an array of offsets and an array of String (no more TermInfo array). It should be faster to init too. . This part is basically done. * Introduces modular reader codec that strongly decouples terms dict from docs/positions readers. EG there is no more TermInfo used when reading the new format. . There's nice symmetry now between reading writing in the codec chain -- the current docs/prox format is captured in: {code} FormatPostingsTermsDictWriter/Reader FormatPostingsDocsWriter/Reader (.frq file) and FormatPostingsPositionsWriter/Reader (.prx file). {code} This part is basically done. * Introduces a new flex API for iterating through the fields, terms, docs and positions: {code} FieldProducer - TermsEnum - DocsEnum - PostingsEnum {code} This replaces TermEnum/Docs/Positions. SegmentReader emulates the old API on top of the new API to keep back-compat. Next steps: * Plug in new codecs (pulsing, pfor) to exercise the modularity / fix any hidden assumptions. * Expose new API out of IndexReader, deprecate old API but emulate old API on top of new one, switch all core/contrib users to the new API. * Maybe switch to AttributeSources as the base class for TermsEnum, DocsEnum, PostingsEnum -- this would give readers API flexibility (not just index-file-format flexibility). EG if someone wanted to store payload at the term-doc level instead of term-doc-position level, you could just add a new attribute. * Test performance iterate. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
RE: Removing deprecated classes
Hi Mark, no problem, go forward. I am on a trip until Saturday evening, so no problems. Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Michael Busch [mailto:busch...@gmail.com] Sent: Wednesday, October 07, 2009 2:17 AM To: java-dev@lucene.apache.org Subject: Removing deprecated classes Hi all, I've attached a patch to LUCENE-1856, which removes Hits. I'm not sure if someone has uncommitted big 3.0 patches that I'll mess up if I commit 1856? While working on 1856 I realized how tedious this stuff is! So Uwe, Mark Co, let me know if you want me to wait with committing my patch! Michael - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
wildcardquery rewrite()
someone asked this question on the user list: http://www.lucidimagination.com/search/document/6f38de391b242102/prefixquery_vs_wildcardquery it made me look at the wildcard rewrite(), where i see this: if (!termContainsWildcard) return new TermQuery(getTerm()); is it a problem the boost is not preserved in this special case? is it also a problem that if the user sets the default MultiTermQuery rewriteMethod to say, CONSTANT_SCORE_FILTER_REWRITE, that this rewritten TermQuery isn't wrapped with a constant score? Sorry if it seems a bit nitpicky, really the issue is that I want to do the right thing for a more complex query I am working on, but don't want to overkill either. -- Robert Muir rcm...@gmail.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: wildcardquery rewrite()
separately, perhaps we should consider doing the prefixquery rewrite here for wildcardquery. for example, SolrQueryParser will emit these 'wildcardqueries that should be prefixqueries' if you are using the new reverse stuff for leading wildcards: WildcardQuery(*foobar) - WildcardQuery(U+0001raboof*) I don't think the prefix enumeration is really that much faster than the wildcard one, but still thought I would mention it. On Tue, Oct 6, 2009 at 10:22 PM, Robert Muir rcm...@gmail.com wrote: someone asked this question on the user list: http://www.lucidimagination.com/search/document/6f38de391b242102/prefixquery_vs_wildcardquery it made me look at the wildcard rewrite(), where i see this: if (!termContainsWildcard) return new TermQuery(getTerm()); is it a problem the boost is not preserved in this special case? is it also a problem that if the user sets the default MultiTermQuery rewriteMethod to say, CONSTANT_SCORE_FILTER_REWRITE, that this rewritten TermQuery isn't wrapped with a constant score? Sorry if it seems a bit nitpicky, really the issue is that I want to do the right thing for a more complex query I am working on, but don't want to overkill either. -- Robert Muir rcm...@gmail.com -- Robert Muir rcm...@gmail.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Removing deprecated classes
Cool, I'll commit 1856 soon then. Thanks! Michael On 10/6/09 7:12 PM, Uwe Schindler wrote: Hi Mark, no problem, go forward. I am on a trip until Saturday evening, so no problems. Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Michael Busch [mailto:busch...@gmail.com] Sent: Wednesday, October 07, 2009 2:17 AM To: java-dev@lucene.apache.org Subject: Removing deprecated classes Hi all, I've attached a patch to LUCENE-1856, which removes Hits. I'm not sure if someone has uncommitted big 3.0 patches that I'll mess up if I commit 1856? While working on 1856 I realized how tedious this stuff is! So Uwe, Mark Co, let me know if you want me to wait with committing my patch! Michael - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1085) search.function should support all capabilities of Solr's search.function
[ https://issues.apache.org/jira/browse/LUCENE-1085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Busch updated LUCENE-1085: -- Fix Version/s: (was: 3.0) 3.1 search.function should support all capabilities of Solr's search.function - Key: LUCENE-1085 URL: https://issues.apache.org/jira/browse/LUCENE-1085 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Doron Cohen Priority: Minor Fix For: 3.1 Lucene search.function does not allow Solr to move to use it, and so Solr currently maintains its own version of this package. Enhance Lucene's search.function so that Solr can move to use it, and avoid this redundancy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-1856) Remove Hits
[ https://issues.apache.org/jira/browse/LUCENE-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Busch resolved LUCENE-1856. --- Resolution: Fixed Committed revision 822587. Remove Hits --- Key: LUCENE-1856 URL: https://issues.apache.org/jira/browse/LUCENE-1856 Project: Lucene - Java Issue Type: Task Components: Search Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Fix For: 3.0 Attachments: lucene-1856.patch LUCENE-1290 removed all references to Hits from core. Most work to be done here is to remove all references from the contrib modules and some new ones that crept into core after 1290. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org