[jira] Resolved: (LUCENE-486) Core Test should not have dependencies on the Demo code

2009-10-06 Thread Michael Busch (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Busch resolved LUCENE-486.
--

Resolution: Fixed

Committed revision 822139.

 Core Test should not have dependencies on the Demo code
 ---

 Key: LUCENE-486
 URL: https://issues.apache.org/jira/browse/LUCENE-486
 Project: Lucene - Java
  Issue Type: Test
  Components: Build
Affects Versions: 1.4
Reporter: Grant Ingersoll
Assignee: Michael Busch
Priority: Trivial
 Fix For: 3.0

 Attachments: FileDocument.java, lucene-486.patch, testdoc.patch


 The TestDoc.java Test file has a dependency on the Demo FileDocument code.  
 Some of us don't keep the Demo code around after downloading, so this breaks 
 the build.
 Patch will be along shortly

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-10-06 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12762573#action_12762573
 ] 

Michael McCandless commented on LUCENE-1458:


Whoa thanks for the sudden sprint Mark!

bq. Come on old man, stop clinging to emacs

Hey!  I'm not so old :) But yeah I still cling to emacs.  Hey, I know
people who still cling to vi!

{quote}
I didn't really look at the code, but some stuff I noticed:

java 6 in pfor Arrays.copy

skiplist stuff in codecs still have package of index - not sure what is going 
on there - changed them

in IndexWriter: 
+ // Mark: read twice?
segmentInfos.read(directory);
+ segmentInfos.read(directory, codecs);
{quote}

Excellent catches!  All of these are not right.

bq. (since you don't include contrib in the tar)

Gak, sorry.  I have a bunch of mods there, cutting over to flex API.

bq. You left getEnum(IndexReader reader) in the MultiTerm queries, but no in 
PrefixQuery - just checkin'.

Woops, for back compat I think we need to leave it in (it's a
protected method), deprecated.  I'll put it back if you haven't.

bq. I guess TestBackwardsCompatibility.java has been removed from trunk or 
something? kept it here for now.

Eek, it shouldn't be -- indeed it is.  When did that happen?  We
should fix this (separately from this issue!).

Do you have more fixes coming?  If so, I'll let you sprint some more; else, 
I'll merge in, add contrib  back-compat branch, and post new patch!  Thanks :)


 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-10-06 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12762576#action_12762576
 ] 

Michael McCandless commented on LUCENE-1458:


bq.  One of the common statistics one needs to implement some more advanced 
scoring approaches is the average document length. Is this patch far enough 
along that I could take a look at it and think about how one might do this?

Well, thinking through how you'd do this... likely you'd want to store
the avg length (in tokens), eg as a single float per field per
segment, right?  The natural place to store this would be in the
FieldInfos, I think?.  Unfortunately, this patch doesn't yet add
extensibility to FieldInfos.

And you'd need a small customization to the indexing chain to
compute this when indexing new docs, which is already doable today
(though, package private).

But then on merging segments, you'd need an extensions point, which we
don't have today, to recompute the avg.  Hmm: how would you handle
deleted docs?  Would you want to go back to the field length for every
doc  recompute the average?  (Which'd mean you need to per doc per
field length, not just the averages).

Unfortunately, this patch doesn't yet address things like customizing
what's stored in FieldInfo or SegmentInfo, nor customizing what
happens during merging (though it takes us a big step closer to this).
I think we need both of these to finish flexible indexing, but I'm
thinking at this point that these should really be tackled in followon
issue(s).  This issue is already ridiculously massive.


 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, 

Re: [jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-10-06 Thread Michael McCandless
On Tue, Oct 6, 2009 at 5:54 AM, Michael McCandless (JIRA)
j...@apache.org wrote:
 bq. I guess TestBackwardsCompatibility.java has been removed from trunk or 
 something? kept it here for now.

 Eek, it shouldn't be -- indeed it is.  When did that happen?  We
 should fix this (separately from this issue!).

I'm working on restoring TestBackCompat on trunk...

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



RE: svn commit: r822203 - /lucene/java/trunk/src/test/org/apache/lucene/index/TestBackwardsCompatibility.java

2009-10-06 Thread Uwe Schindler
Sorry, I think this was one test too much to remove :-)

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


 -Original Message-
 From: mikemcc...@apache.org [mailto:mikemcc...@apache.org]
 Sent: Tuesday, October 06, 2009 12:33 PM
 To: java-comm...@lucene.apache.org
 Subject: svn commit: r822203 -
 /lucene/java/trunk/src/test/org/apache/lucene/index/TestBackwardsCompatibi
 lity.java
 
 Author: mikemccand
 Date: Tue Oct  6 10:32:43 2009
 New Revision: 822203
 
 URL: http://svn.apache.org/viewvc?rev=822203view=rev
 Log:
 restore TestBackwardsCompatibility
 
 Added:
 
 lucene/java/trunk/src/test/org/apache/lucene/index/TestBackwardsCompatibil
 ity.java   (with props)
 
 Added:
 lucene/java/trunk/src/test/org/apache/lucene/index/TestBackwardsCompatibil
 ity.java
 URL:
 http://svn.apache.org/viewvc/lucene/java/trunk/src/test/org/apache/lucene/
 index/TestBackwardsCompatibility.java?rev=822203view=auto
 ==
 
 ---
 lucene/java/trunk/src/test/org/apache/lucene/index/TestBackwardsCompatibil
 ity.java (added)
 +++
 lucene/java/trunk/src/test/org/apache/lucene/index/TestBackwardsCompatibil
 ity.java Tue Oct  6 10:32:43 2009
 @@ -0,0 +1,530 @@
 +package org.apache.lucene.index;
 +
 +/**
 + * Licensed to the Apache Software Foundation (ASF) under one or more
 + * contributor license agreements.  See the NOTICE file distributed with
 + * this work for additional information regarding copyright ownership.
 + * The ASF licenses this file to You under the Apache License, Version
 2.0
 + * (the License); you may not use this file except in compliance with
 + * the License.  You may obtain a copy of the License at
 + *
 + * http://www.apache.org/licenses/LICENSE-2.0
 + *
 + * Unless required by applicable law or agreed to in writing, software
 + * distributed under the License is distributed on an AS IS BASIS,
 + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
 implied.
 + * See the License for the specific language governing permissions and
 + * limitations under the License.
 + */
 +
 +import java.io.BufferedOutputStream;
 +import java.io.File;
 +import java.io.FileOutputStream;
 +import java.io.IOException;
 +import java.io.InputStream;
 +import java.io.OutputStream;
 +import java.util.Arrays;
 +import java.util.Enumeration;
 +import java.util.List;
 +import java.util.zip.ZipEntry;
 +import java.util.zip.ZipFile;
 +
 +import org.apache.lucene.analysis.WhitespaceAnalyzer;
 +import org.apache.lucene.document.Document;
 +import org.apache.lucene.document.Field;
 +import org.apache.lucene.search.IndexSearcher;
 +import org.apache.lucene.search.ScoreDoc;
 +import org.apache.lucene.search.TermQuery;
 +import org.apache.lucene.store.Directory;
 +import org.apache.lucene.store.FSDirectory;
 +import org.apache.lucene.util.LuceneTestCase;
 +import org.apache.lucene.util._TestUtil;
 +
 +/*
 +  Verify we can read the pre-2.1 file format, do searches
 +  against it, and add documents to it.
 +*/
 +
 +public class TestBackwardsCompatibility extends LuceneTestCase
 +{
 +
 +  // Uncomment these cases  run them on an older Lucene
 +  // version, to generate an index to test backwards
 +  // compatibility.  Then, cd to build/test/index.cfs and
 +  // run zip index.VERSION.cfs.zip *; cd to
 +  // build/test/index.nocfs and run zip
 +  // index.VERSION.nocfs.zip *.  Then move those 2 zip
 +  // files to your trunk checkout and add them to the
 +  // oldNames array.
 +
 +  /*
 +  public void testCreatePreLocklessCFS() throws IOException {
 +createIndex(index.cfs, true);
 +  }
 +
 +  public void testCreatePreLocklessNoCFS() throws IOException {
 +createIndex(index.nocfs, false);
 +  }
 +  */
 +
 +  /* Unzips dirName + .zip -- dirName, removing dirName
 + first */
 +  public void unzip(String zipName, String destDirName) throws
 IOException {
 +
 +Enumeration entries;
 +ZipFile zipFile;
 +zipFile = new ZipFile(zipName + .zip);
 +
 +entries = zipFile.entries();
 +
 +String dirName = fullDir(destDirName);
 +
 +File fileDir = new File(dirName);
 +rmDir(destDirName);
 +
 +fileDir.mkdir();
 +
 +while (entries.hasMoreElements()) {
 +  ZipEntry entry = (ZipEntry) entries.nextElement();
 +
 +  InputStream in = zipFile.getInputStream(entry);
 +  OutputStream out = new BufferedOutputStream(new
 FileOutputStream(new File(fileDir, entry.getName(;
 +
 +  byte[] buffer = new byte[8192];
 +  int len;
 +  while((len = in.read(buffer)) = 0) {
 +out.write(buffer, 0, len);
 +  }
 +
 +  in.close();
 +  out.close();
 +}
 +
 +zipFile.close();
 +  }
 +
 +  public void testCreateCFS() throws IOException {
 +String dirName = testindex.cfs;
 +createIndex(dirName, true);
 +rmDir(dirName);
 +  }
 +
 +  public void testCreateNoCFS() throws IOException {
 +String dirName = testindex.nocfs;
 +

Re: [jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-10-06 Thread Mark Miller
Merge away - still sleeping over here. Would love to look more again  
but don't know when, so no use waiting on me.


- Mark

http://www.lucidimagination.com (mobile)

On Oct 6, 2009, at 5:54 AM, Michael McCandless (JIRA)  
j...@apache.org wrote:




   [ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12762573#action_12762573 
 ]


Michael McCandless commented on LUCENE-1458:


Whoa thanks for the sudden sprint Mark!

bq. Come on old man, stop clinging to emacs

Hey!  I'm not so old :) But yeah I still cling to emacs.  Hey, I know
people who still cling to vi!

{quote}
I didn't really look at the code, but some stuff I noticed:

java 6 in pfor Arrays.copy

skiplist stuff in codecs still have package of index - not sure what  
is going on there - changed them


in IndexWriter:
+ // Mark: read twice?
segmentInfos.read(directory);
+ segmentInfos.read(directory, codecs);
{quote}

Excellent catches!  All of these are not right.

bq. (since you don't include contrib in the tar)

Gak, sorry.  I have a bunch of mods there, cutting over to flex API.

bq. You left getEnum(IndexReader reader) in the MultiTerm queries,  
but no in PrefixQuery - just checkin'.


Woops, for back compat I think we need to leave it in (it's a
protected method), deprecated.  I'll put it back if you haven't.

bq. I guess TestBackwardsCompatibility.java has been removed from  
trunk or something? kept it here for now.


Eek, it shouldn't be -- indeed it is.  When did that happen?  We
should fix this (separately from this issue!).

Do you have more fixes coming?  If so, I'll let you sprint some  
more; else, I'll merge in, add contrib  back-compat branch, and  
post new patch!  Thanks :)




Further steps towards flexible indexing
---

   Key: LUCENE-1458
   URL: https://issues.apache.org/jira/browse/LUCENE-1458
   Project: Lucene - Java
Issue Type: New Feature
Components: Index
  Affects Versions: 2.9
  Reporter: Michael McCandless
  Assignee: Michael McCandless
  Priority: Minor
   Attachments: LUCENE-1458-back-compat.patch, LUCENE-1458-back- 
compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back- 
compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back- 
compat.patch, LUCENE-1458.patch, LUCENE-1458.patch,  
LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch,  
LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch,  
LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2,  
LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2,  
LUCENE-1458.tar.bz2



I attached a very rough checkpoint of my current patch, to get early
feedback.  All tests pass, though back compat tests don't pass due to
changes to package-private APIs plus certain bugs in tests that
happened to work (eg call TermPostions.nextPosition() too many times,
which the new API asserts against).
[Aside: I think, when we commit changes to package-private APIs such
that back-compat tests don't pass, we could go back, make a branch on
the back-compat tag, commit changes to the tests to use the new
package private APIs on that branch, then fix nightly build to use  
the

tip of that branch?o]
There's still plenty to do before this is committable! This is a
rather large change:
 * Switches to a new more efficient terms dict format.  This still
   uses tii/tis files, but the tii only stores term  long offset
   (not a TermInfo).  At seek points, tis encodes term  freq/prox
   offsets absolutely instead of with deltas delta.  Also, tis/tii
   are structured by field, so we don't have to record field number
   in every term.
.
   On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
   - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
.
   RAM usage when loading terms dict index is significantly less
   since we only load an array of offsets and an array of String (no
   more TermInfo array).  It should be faster to init too.
.
   This part is basically done.
 * Introduces modular reader codec that strongly decouples terms dict
   from docs/positions readers.  EG there is no more TermInfo used
   when reading the new format.
.
   There's nice symmetry now between reading  writing in the codec
   chain -- the current docs/prox format is captured in:
{code}
FormatPostingsTermsDictWriter/Reader
FormatPostingsDocsWriter/Reader (.frq file) and
FormatPostingsPositionsWriter/Reader (.prx file).
{code}
   This part is basically done.
 * Introduces a new flex API for iterating through the fields,
   terms, docs and positions:
{code}
FieldProducer - TermsEnum - DocsEnum - PostingsEnum
{code}
   This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
   old API on top of the new API to keep back-compat.

Next steps:
 * Plug in new codecs (pulsing, pfor) to exercise the modularity /
   fix 

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-10-06 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12762590#action_12762590
 ] 

Uwe Schindler commented on LUCENE-1458:
---

{quote}
bq. I guess TestBackwardsCompatibility.java has been removed from trunk or 
something? kept it here for now.

Eek, it shouldn't be - indeed it is. When did that happen? We
should fix this (separately from this issue!).
{quote}

My fault, I removed it during the remove backwards tests on Saturday. If we do 
not remove DateTools/DateField for 3.0 (we may need to leave it in for index 
compatibility), I will restore, these tests, too. It's easy with TortoiseSVN 
and you can also preserve the history (using svn:mergeinfo prop).

I have this on my list when going forward with removing the old TokenStream API.

 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of
 term-doc-position level, you could just add a new attribute.
   * Test performance  iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: 

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-10-06 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12762592#action_12762592
 ] 

Michael McCandless commented on LUCENE-1458:


bq. It's easy with TortoiseSVN and you can also preserve the history (using 
svn:mergeinfo prop).

Ahh -- can you do this for TestBackwardsCompatibility?  I restored it, but, 
lost all history.  Thanks.

 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of
 term-doc-position level, you could just add a new attribute.
   * Test performance  iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-10-06 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12762600#action_12762600
 ] 

Uwe Schindler commented on LUCENE-1458:
---

Done. I also did it for the BW branch, but didn't create a tag yet. The next 
tag creation for the next bigger patch is enough (no need to do it now).

What I have done: svn copy from the older revision to the same path :-)

 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of
 term-doc-position level, you could just add a new attribute.
   * Test performance  iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-10-06 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12762633#action_12762633
 ] 

Michael McCandless commented on LUCENE-1458:


bq. What I have done: svn copy from the older revision to the same path

Excellent, thanks!  It had a few problems (was still trying to deprecated APIs, 
some of which were gone) -- I just committed fixes.

 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of
 term-doc-position level, you could just add a new attribute.
   * Test performance  iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



RE: svn commit: r822284 - /lucene/java/trunk/src/test/org/apache/lucene/index/TestBackwardsCompatibility.java

2009-10-06 Thread Uwe Schindler
Can you add this patch to backwards, too? I forgot, that some of the
backwards-changes also applied to BW, but for completeness, not sure, if a
tag is also needed.

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


 -Original Message-
 From: mikemcc...@apache.org [mailto:mikemcc...@apache.org]
 Sent: Tuesday, October 06, 2009 4:13 PM
 To: java-comm...@lucene.apache.org
 Subject: svn commit: r822284 -
 /lucene/java/trunk/src/test/org/apache/lucene/index/TestBackwardsCompatibi
 lity.java
 
 Author: mikemccand
 Date: Tue Oct  6 14:12:46 2009
 New Revision: 822284
 
 URL: http://svn.apache.org/viewvc?rev=822284view=rev
 Log:
 fix TestBackwardsCompability to not use deprecated APIs
 
 Modified:
 
 lucene/java/trunk/src/test/org/apache/lucene/index/TestBackwardsCompatibil
 ity.java
 
 Modified:
 lucene/java/trunk/src/test/org/apache/lucene/index/TestBackwardsCompatibil
 ity.java
 URL:
 http://svn.apache.org/viewvc/lucene/java/trunk/src/test/org/apache/lucene/
 index/TestBackwardsCompatibility.java?rev=822284r1=822283r2=822284view=
 diff
 ==
 
 ---
 lucene/java/trunk/src/test/org/apache/lucene/index/TestBackwardsCompatibil
 ity.java (original)
 +++
 lucene/java/trunk/src/test/org/apache/lucene/index/TestBackwardsCompatibil
 ity.java Tue Oct  6 14:12:46 2009
 @@ -158,11 +158,7 @@
  for(int i=0;ioldNames.length;i++) {
String dirName = src/test/org/apache/lucene/index/index. +
 oldNames[i];
unzip(dirName, oldNames[i]);
 -  changeIndexNoAdds(oldNames[i], true);
 -  rmDir(oldNames[i]);
 -
 -  unzip(dirName, oldNames[i]);
 -  changeIndexNoAdds(oldNames[i], false);
 +  changeIndexNoAdds(oldNames[i]);
rmDir(oldNames[i]);
  }
}
 @@ -171,11 +167,7 @@
  for(int i=0;ioldNames.length;i++) {
String dirName = src/test/org/apache/lucene/index/index. +
 oldNames[i];
unzip(dirName, oldNames[i]);
 -  changeIndexWithAdds(oldNames[i], true);
 -  rmDir(oldNames[i]);
 -
 -  unzip(dirName, oldNames[i]);
 -  changeIndexWithAdds(oldNames[i], false);
 +  changeIndexWithAdds(oldNames[i]);
rmDir(oldNames[i]);
  }
}
 @@ -196,7 +188,7 @@
  dirName = fullDir(dirName);
 
  Directory dir = FSDirectory.open(new File(dirName));
 -IndexSearcher searcher = new IndexSearcher(dir);
 +IndexSearcher searcher = new IndexSearcher(dir, true);
  IndexReader reader = searcher.getIndexReader();
 
  _TestUtil.checkIndex(dir);
 @@ -267,14 +259,14 @@
 
/* Open pre-lockless index, add docs, do a delete 
 * setNorm, and search */
 -  public void changeIndexWithAdds(String dirName, boolean autoCommit)
 throws IOException {
 +  public void changeIndexWithAdds(String dirName) throws IOException {
  String origDirName = dirName;
  dirName = fullDir(dirName);
 
  Directory dir = FSDirectory.open(new File(dirName));
 
  // open writer
 -IndexWriter writer = new IndexWriter(dir, autoCommit, new
 WhitespaceAnalyzer(), false);
 +IndexWriter writer = new IndexWriter(dir, new WhitespaceAnalyzer(),
 false, IndexWriter.MaxFieldLength.UNLIMITED);
 
  // add 10 docs
  for(int i=0;i10;i++) {
 @@ -288,11 +280,11 @@
  } else {
expected = 46;
  }
 -assertEquals(wrong doc count, expected, writer.docCount());
 +assertEquals(wrong doc count, expected, writer.maxDoc());
  writer.close();
 
  // make sure searching sees right # hits
 -IndexSearcher searcher = new IndexSearcher(dir);
 +IndexSearcher searcher = new IndexSearcher(dir, true);
  ScoreDoc[] hits = searcher.search(new TermQuery(new Term(content,
 aaa)), null, 1000).scoreDocs;
  Document d = searcher.doc(hits[0].doc);
  assertEquals(wrong first document, 21, d.get(id));
 @@ -301,7 +293,7 @@
 
  // make sure we can do delete  setNorm against this
  // pre-lockless segment:
 -IndexReader reader = IndexReader.open(dir);
 +IndexReader reader = IndexReader.open(dir, false);
  Term searchTerm = new Term(id, 6);
  int delCount = reader.deleteDocuments(searchTerm);
  assertEquals(wrong delete count, 1, delCount);
 @@ -309,7 +301,7 @@
  reader.close();
 
  // make sure they took:
 -searcher = new IndexSearcher(dir);
 +searcher = new IndexSearcher(dir, true);
  hits = searcher.search(new TermQuery(new Term(content, aaa)),
 null, 1000).scoreDocs;
  assertEquals(wrong number of hits, 43, hits.length);
  d = searcher.doc(hits[0].doc);
 @@ -318,11 +310,11 @@
  searcher.close();
 
  // optimize
 -writer = new IndexWriter(dir, autoCommit, new WhitespaceAnalyzer(),
 false);
 +writer = new IndexWriter(dir, new WhitespaceAnalyzer(), false,
 IndexWriter.MaxFieldLength.UNLIMITED);
  writer.optimize();
  writer.close();
 
 -searcher = new IndexSearcher(dir);
 +searcher = new IndexSearcher(dir, true);
   

Re: svn commit: r822284 - /lucene/java/trunk/src/test/org/apache/lucene/index/TestBackwardsCompatibility.java

2009-10-06 Thread Michael McCandless
OK will do.

Mike

On Tue, Oct 6, 2009 at 10:23 AM, Uwe Schindler u...@thetaphi.de wrote:
 Can you add this patch to backwards, too? I forgot, that some of the
 backwards-changes also applied to BW, but for completeness, not sure, if a
 tag is also needed.

 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen
 http://www.thetaphi.de
 eMail: u...@thetaphi.de


 -Original Message-
 From: mikemcc...@apache.org [mailto:mikemcc...@apache.org]
 Sent: Tuesday, October 06, 2009 4:13 PM
 To: java-comm...@lucene.apache.org
 Subject: svn commit: r822284 -
 /lucene/java/trunk/src/test/org/apache/lucene/index/TestBackwardsCompatibi
 lity.java

 Author: mikemccand
 Date: Tue Oct  6 14:12:46 2009
 New Revision: 822284

 URL: http://svn.apache.org/viewvc?rev=822284view=rev
 Log:
 fix TestBackwardsCompability to not use deprecated APIs

 Modified:

 lucene/java/trunk/src/test/org/apache/lucene/index/TestBackwardsCompatibil
 ity.java

 Modified:
 lucene/java/trunk/src/test/org/apache/lucene/index/TestBackwardsCompatibil
 ity.java
 URL:
 http://svn.apache.org/viewvc/lucene/java/trunk/src/test/org/apache/lucene/
 index/TestBackwardsCompatibility.java?rev=822284r1=822283r2=822284view=
 diff
 ==
 
 ---
 lucene/java/trunk/src/test/org/apache/lucene/index/TestBackwardsCompatibil
 ity.java (original)
 +++
 lucene/java/trunk/src/test/org/apache/lucene/index/TestBackwardsCompatibil
 ity.java Tue Oct  6 14:12:46 2009
 @@ -158,11 +158,7 @@
      for(int i=0;ioldNames.length;i++) {
        String dirName = src/test/org/apache/lucene/index/index. +
 oldNames[i];
        unzip(dirName, oldNames[i]);
 -      changeIndexNoAdds(oldNames[i], true);
 -      rmDir(oldNames[i]);
 -
 -      unzip(dirName, oldNames[i]);
 -      changeIndexNoAdds(oldNames[i], false);
 +      changeIndexNoAdds(oldNames[i]);
        rmDir(oldNames[i]);
      }
    }
 @@ -171,11 +167,7 @@
      for(int i=0;ioldNames.length;i++) {
        String dirName = src/test/org/apache/lucene/index/index. +
 oldNames[i];
        unzip(dirName, oldNames[i]);
 -      changeIndexWithAdds(oldNames[i], true);
 -      rmDir(oldNames[i]);
 -
 -      unzip(dirName, oldNames[i]);
 -      changeIndexWithAdds(oldNames[i], false);
 +      changeIndexWithAdds(oldNames[i]);
        rmDir(oldNames[i]);
      }
    }
 @@ -196,7 +188,7 @@
      dirName = fullDir(dirName);

      Directory dir = FSDirectory.open(new File(dirName));
 -    IndexSearcher searcher = new IndexSearcher(dir);
 +    IndexSearcher searcher = new IndexSearcher(dir, true);
      IndexReader reader = searcher.getIndexReader();

      _TestUtil.checkIndex(dir);
 @@ -267,14 +259,14 @@

    /* Open pre-lockless index, add docs, do a delete 
     * setNorm, and search */
 -  public void changeIndexWithAdds(String dirName, boolean autoCommit)
 throws IOException {
 +  public void changeIndexWithAdds(String dirName) throws IOException {
      String origDirName = dirName;
      dirName = fullDir(dirName);

      Directory dir = FSDirectory.open(new File(dirName));

      // open writer
 -    IndexWriter writer = new IndexWriter(dir, autoCommit, new
 WhitespaceAnalyzer(), false);
 +    IndexWriter writer = new IndexWriter(dir, new WhitespaceAnalyzer(),
 false, IndexWriter.MaxFieldLength.UNLIMITED);

      // add 10 docs
      for(int i=0;i10;i++) {
 @@ -288,11 +280,11 @@
      } else {
        expected = 46;
      }
 -    assertEquals(wrong doc count, expected, writer.docCount());
 +    assertEquals(wrong doc count, expected, writer.maxDoc());
      writer.close();

      // make sure searching sees right # hits
 -    IndexSearcher searcher = new IndexSearcher(dir);
 +    IndexSearcher searcher = new IndexSearcher(dir, true);
      ScoreDoc[] hits = searcher.search(new TermQuery(new Term(content,
 aaa)), null, 1000).scoreDocs;
      Document d = searcher.doc(hits[0].doc);
      assertEquals(wrong first document, 21, d.get(id));
 @@ -301,7 +293,7 @@

      // make sure we can do delete  setNorm against this
      // pre-lockless segment:
 -    IndexReader reader = IndexReader.open(dir);
 +    IndexReader reader = IndexReader.open(dir, false);
      Term searchTerm = new Term(id, 6);
      int delCount = reader.deleteDocuments(searchTerm);
      assertEquals(wrong delete count, 1, delCount);
 @@ -309,7 +301,7 @@
      reader.close();

      // make sure they took:
 -    searcher = new IndexSearcher(dir);
 +    searcher = new IndexSearcher(dir, true);
      hits = searcher.search(new TermQuery(new Term(content, aaa)),
 null, 1000).scoreDocs;
      assertEquals(wrong number of hits, 43, hits.length);
      d = searcher.doc(hits[0].doc);
 @@ -318,11 +310,11 @@
      searcher.close();

      // optimize
 -    writer = new IndexWriter(dir, autoCommit, new WhitespaceAnalyzer(),
 false);
 +    writer = new IndexWriter(dir, new WhitespaceAnalyzer(), false,
 IndexWriter.MaxFieldLength.UNLIMITED);
      writer.optimize();
      writer.close();

 -    

RE: svn commit: r822284 - /lucene/java/trunk/src/test/org/apache/lucene/index/TestBackwardsCompatibility.java

2009-10-06 Thread Uwe Schindler
Thanks, sorry for extra work! I missed to do this after the svn copy :(

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


 -Original Message-
 From: Michael McCandless [mailto:luc...@mikemccandless.com]
 Sent: Tuesday, October 06, 2009 4:37 PM
 To: java-dev@lucene.apache.org
 Subject: Re: svn commit: r822284 -
 /lucene/java/trunk/src/test/org/apache/lucene/index/TestBackwardsCompatibi
 lity.java
 
 OK will do.
 
 Mike
 
 On Tue, Oct 6, 2009 at 10:23 AM, Uwe Schindler u...@thetaphi.de wrote:
  Can you add this patch to backwards, too? I forgot, that some of the
  backwards-changes also applied to BW, but for completeness, not sure, if
 a
  tag is also needed.
 
  -
  Uwe Schindler
  H.-H.-Meier-Allee 63, D-28213 Bremen
  http://www.thetaphi.de
  eMail: u...@thetaphi.de
 
 
  -Original Message-
  From: mikemcc...@apache.org [mailto:mikemcc...@apache.org]
  Sent: Tuesday, October 06, 2009 4:13 PM
  To: java-comm...@lucene.apache.org
  Subject: svn commit: r822284 -
 
 /lucene/java/trunk/src/test/org/apache/lucene/index/TestBackwardsCompatibi
  lity.java
 
  Author: mikemccand
  Date: Tue Oct  6 14:12:46 2009
  New Revision: 822284
 
  URL: http://svn.apache.org/viewvc?rev=822284view=rev
  Log:
  fix TestBackwardsCompability to not use deprecated APIs
 
  Modified:
 
 
 lucene/java/trunk/src/test/org/apache/lucene/index/TestBackwardsCompatibil
  ity.java
 
  Modified:
 
 lucene/java/trunk/src/test/org/apache/lucene/index/TestBackwardsCompatibil
  ity.java
  URL:
 
 http://svn.apache.org/viewvc/lucene/java/trunk/src/test/org/apache/lucene/
 
 index/TestBackwardsCompatibility.java?rev=822284r1=822283r2=822284view=
  diff
 
 ==
  
  ---
 
 lucene/java/trunk/src/test/org/apache/lucene/index/TestBackwardsCompatibil
  ity.java (original)
  +++
 
 lucene/java/trunk/src/test/org/apache/lucene/index/TestBackwardsCompatibil
  ity.java Tue Oct  6 14:12:46 2009
  @@ -158,11 +158,7 @@
       for(int i=0;ioldNames.length;i++) {
         String dirName = src/test/org/apache/lucene/index/index. +
  oldNames[i];
         unzip(dirName, oldNames[i]);
  -      changeIndexNoAdds(oldNames[i], true);
  -      rmDir(oldNames[i]);
  -
  -      unzip(dirName, oldNames[i]);
  -      changeIndexNoAdds(oldNames[i], false);
  +      changeIndexNoAdds(oldNames[i]);
         rmDir(oldNames[i]);
       }
     }
  @@ -171,11 +167,7 @@
       for(int i=0;ioldNames.length;i++) {
         String dirName = src/test/org/apache/lucene/index/index. +
  oldNames[i];
         unzip(dirName, oldNames[i]);
  -      changeIndexWithAdds(oldNames[i], true);
  -      rmDir(oldNames[i]);
  -
  -      unzip(dirName, oldNames[i]);
  -      changeIndexWithAdds(oldNames[i], false);
  +      changeIndexWithAdds(oldNames[i]);
         rmDir(oldNames[i]);
       }
     }
  @@ -196,7 +188,7 @@
       dirName = fullDir(dirName);
 
       Directory dir = FSDirectory.open(new File(dirName));
  -    IndexSearcher searcher = new IndexSearcher(dir);
  +    IndexSearcher searcher = new IndexSearcher(dir, true);
       IndexReader reader = searcher.getIndexReader();
 
       _TestUtil.checkIndex(dir);
  @@ -267,14 +259,14 @@
 
     /* Open pre-lockless index, add docs, do a delete 
      * setNorm, and search */
  -  public void changeIndexWithAdds(String dirName, boolean autoCommit)
  throws IOException {
  +  public void changeIndexWithAdds(String dirName) throws IOException {
       String origDirName = dirName;
       dirName = fullDir(dirName);
 
       Directory dir = FSDirectory.open(new File(dirName));
 
       // open writer
  -    IndexWriter writer = new IndexWriter(dir, autoCommit, new
  WhitespaceAnalyzer(), false);
  +    IndexWriter writer = new IndexWriter(dir, new
 WhitespaceAnalyzer(),
  false, IndexWriter.MaxFieldLength.UNLIMITED);
 
       // add 10 docs
       for(int i=0;i10;i++) {
  @@ -288,11 +280,11 @@
       } else {
         expected = 46;
       }
  -    assertEquals(wrong doc count, expected, writer.docCount());
  +    assertEquals(wrong doc count, expected, writer.maxDoc());
       writer.close();
 
       // make sure searching sees right # hits
  -    IndexSearcher searcher = new IndexSearcher(dir);
  +    IndexSearcher searcher = new IndexSearcher(dir, true);
       ScoreDoc[] hits = searcher.search(new TermQuery(new
 Term(content,
  aaa)), null, 1000).scoreDocs;
       Document d = searcher.doc(hits[0].doc);
       assertEquals(wrong first document, 21, d.get(id));
  @@ -301,7 +293,7 @@
 
       // make sure we can do delete  setNorm against this
       // pre-lockless segment:
  -    IndexReader reader = IndexReader.open(dir);
  +    IndexReader reader = IndexReader.open(dir, false);
       Term searchTerm = new Term(id, 6);
       int delCount = reader.deleteDocuments(searchTerm);
       assertEquals(wrong delete count, 1, delCount);
  @@ -309,7 +301,7 @@
       reader.close();
 
       

Re: svn commit: r822284 - /lucene/java/trunk/src/test/org/apache/lucene/index/TestBackwardsCompatibility.java

2009-10-06 Thread Michael McCandless
No problem!  It's exciting :)

Mike

On Tue, Oct 6, 2009 at 10:40 AM, Uwe Schindler u...@thetaphi.de wrote:
 Thanks, sorry for extra work! I missed to do this after the svn copy :(

 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen
 http://www.thetaphi.de
 eMail: u...@thetaphi.de


 -Original Message-
 From: Michael McCandless [mailto:luc...@mikemccandless.com]
 Sent: Tuesday, October 06, 2009 4:37 PM
 To: java-dev@lucene.apache.org
 Subject: Re: svn commit: r822284 -
 /lucene/java/trunk/src/test/org/apache/lucene/index/TestBackwardsCompatibi
 lity.java

 OK will do.

 Mike

 On Tue, Oct 6, 2009 at 10:23 AM, Uwe Schindler u...@thetaphi.de wrote:
  Can you add this patch to backwards, too? I forgot, that some of the
  backwards-changes also applied to BW, but for completeness, not sure, if
 a
  tag is also needed.
 
  -
  Uwe Schindler
  H.-H.-Meier-Allee 63, D-28213 Bremen
  http://www.thetaphi.de
  eMail: u...@thetaphi.de
 
 
  -Original Message-
  From: mikemcc...@apache.org [mailto:mikemcc...@apache.org]
  Sent: Tuesday, October 06, 2009 4:13 PM
  To: java-comm...@lucene.apache.org
  Subject: svn commit: r822284 -
 
 /lucene/java/trunk/src/test/org/apache/lucene/index/TestBackwardsCompatibi
  lity.java
 
  Author: mikemccand
  Date: Tue Oct  6 14:12:46 2009
  New Revision: 822284
 
  URL: http://svn.apache.org/viewvc?rev=822284view=rev
  Log:
  fix TestBackwardsCompability to not use deprecated APIs
 
  Modified:
 
 
 lucene/java/trunk/src/test/org/apache/lucene/index/TestBackwardsCompatibil
  ity.java
 
  Modified:
 
 lucene/java/trunk/src/test/org/apache/lucene/index/TestBackwardsCompatibil
  ity.java
  URL:
 
 http://svn.apache.org/viewvc/lucene/java/trunk/src/test/org/apache/lucene/
 
 index/TestBackwardsCompatibility.java?rev=822284r1=822283r2=822284view=
  diff
 
 ==
  
  ---
 
 lucene/java/trunk/src/test/org/apache/lucene/index/TestBackwardsCompatibil
  ity.java (original)
  +++
 
 lucene/java/trunk/src/test/org/apache/lucene/index/TestBackwardsCompatibil
  ity.java Tue Oct  6 14:12:46 2009
  @@ -158,11 +158,7 @@
       for(int i=0;ioldNames.length;i++) {
         String dirName = src/test/org/apache/lucene/index/index. +
  oldNames[i];
         unzip(dirName, oldNames[i]);
  -      changeIndexNoAdds(oldNames[i], true);
  -      rmDir(oldNames[i]);
  -
  -      unzip(dirName, oldNames[i]);
  -      changeIndexNoAdds(oldNames[i], false);
  +      changeIndexNoAdds(oldNames[i]);
         rmDir(oldNames[i]);
       }
     }
  @@ -171,11 +167,7 @@
       for(int i=0;ioldNames.length;i++) {
         String dirName = src/test/org/apache/lucene/index/index. +
  oldNames[i];
         unzip(dirName, oldNames[i]);
  -      changeIndexWithAdds(oldNames[i], true);
  -      rmDir(oldNames[i]);
  -
  -      unzip(dirName, oldNames[i]);
  -      changeIndexWithAdds(oldNames[i], false);
  +      changeIndexWithAdds(oldNames[i]);
         rmDir(oldNames[i]);
       }
     }
  @@ -196,7 +188,7 @@
       dirName = fullDir(dirName);
 
       Directory dir = FSDirectory.open(new File(dirName));
  -    IndexSearcher searcher = new IndexSearcher(dir);
  +    IndexSearcher searcher = new IndexSearcher(dir, true);
       IndexReader reader = searcher.getIndexReader();
 
       _TestUtil.checkIndex(dir);
  @@ -267,14 +259,14 @@
 
     /* Open pre-lockless index, add docs, do a delete 
      * setNorm, and search */
  -  public void changeIndexWithAdds(String dirName, boolean autoCommit)
  throws IOException {
  +  public void changeIndexWithAdds(String dirName) throws IOException {
       String origDirName = dirName;
       dirName = fullDir(dirName);
 
       Directory dir = FSDirectory.open(new File(dirName));
 
       // open writer
  -    IndexWriter writer = new IndexWriter(dir, autoCommit, new
  WhitespaceAnalyzer(), false);
  +    IndexWriter writer = new IndexWriter(dir, new
 WhitespaceAnalyzer(),
  false, IndexWriter.MaxFieldLength.UNLIMITED);
 
       // add 10 docs
       for(int i=0;i10;i++) {
  @@ -288,11 +280,11 @@
       } else {
         expected = 46;
       }
  -    assertEquals(wrong doc count, expected, writer.docCount());
  +    assertEquals(wrong doc count, expected, writer.maxDoc());
       writer.close();
 
       // make sure searching sees right # hits
  -    IndexSearcher searcher = new IndexSearcher(dir);
  +    IndexSearcher searcher = new IndexSearcher(dir, true);
       ScoreDoc[] hits = searcher.search(new TermQuery(new
 Term(content,
  aaa)), null, 1000).scoreDocs;
       Document d = searcher.doc(hits[0].doc);
       assertEquals(wrong first document, 21, d.get(id));
  @@ -301,7 +293,7 @@
 
       // make sure we can do delete  setNorm against this
       // pre-lockless segment:
  -    IndexReader reader = IndexReader.open(dir);
  +    IndexReader reader = IndexReader.open(dir, false);
       Term searchTerm = new Term(id, 6);
       int delCount = 

[jira] Updated: (LUCENE-1458) Further steps towards flexible indexing

2009-10-06 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1458:
---

Attachment: LUCENE-1458.patch

Uber-patch attached: started from Mark's patch (thanks!), added my contrib  
back-compat branch changes.  All tests pass.

Also, I removed pfor from this issue.  I'll attach the pfor codec to 
LUCENE-1410.

Note that I didn't use svn move in generating the patch, so that the patch 
can be applied cleanly.  When it [finally] comes time to commit for real, I'll 
svn move so we preserve history.

 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of
 term-doc-position level, you could just add a new attribute.
   * Test performance  iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: [jira] Updated: (LUCENE-1458) Further steps towards flexible indexing

2009-10-06 Thread Mark Miller
Michael McCandless (JIRA) wrote:
  [ 
 https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
  ]

 Michael McCandless updated LUCENE-1458:
 ---

 Attachment: LUCENE-1458.patch

 Uber-patch attached: started from Mark's patch (thanks!),
Anytime! Grunt work and I go together like Michael Bay and Uwe Boll.

Hope I can actually make a meaningful contribution to flexible indexing
at some point.


-- 
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



De-basing / re-basing docIDs, or how to effectively pass calculated values from a Scorer or Filter up to (Solr's) QueryComponent.process

2009-10-06 Thread Aaron McKee


In the code I'm working with, I generate a cache of calculated values as 
a by-product within a Filter.getDocidSet implementation (and within a 
Query-ized version of the filter and its Scorer method) . These values 
are keyed off the IndexReader's docID values, since that's all that's 
accessible at that level. Ultimately, however, I need to be able to 
access these values much higher up in the stack (Solr's 
QueryComponent.process method), so that I can inject the dynamic values 
into the response as a fake field. The IDs available here, however, are 
for the entire index and not just relative to the current IndexReader. 
I'm still fairly new to Lucene and I've been scratching my head a bit 
trying to find a reliable way to map these values into the same space, 
without having to hack up too many base classes. I noticed that there 
was a related discussion at:


http://issues.apache.org/jira/browse/LUCENE-1821?focusedCommentId=12745041#action_12745041

... but also a bit of disagreement on the suggested strategies. Ideally, 
I'm also hoping there's a strategy that won't require me to hack up too 
much of the core product; subclassing IndexSearcher in the way suggested 
would basically require me to change all of the various SearchComponents 
I use in Solr, and that sounds like it'd end up a real maintenance 
nightmare. I was looking at the Collector class as possible solution, 
since it has knowledge of the docbase, but it looks like I'd then need 
to change every derived collector that the code ultimately uses and, 
including the various anonymous Collectors in Solr, that also looks like 
it'd be a fairly ghoulish solution. I suppose I'm being wishful, or 
lazy, but is there a reasonable and reliable way to do this, without 
having to fork the core code? If not, any suggestion on the best 
strategy to accomplish this, without adding too much overhead every time 
I wanted to up-rev the core Lucene and/or Solr code to the latest version?


Thanks a ton,
Aaron


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: De-basing / re-basing docIDs, or how to effectively pass calculated values from a Scorer or Filter up to (Solr's) QueryComponent.process

2009-10-06 Thread Yonik Seeley
Aaron, could you move this to solr-user?

-Yonik
http://www.lucidimagination.com


On Tue, Oct 6, 2009 at 11:22 AM, Aaron McKee ucbmc...@gmail.com wrote:

 In the code I'm working with, I generate a cache of calculated values as a
 by-product within a Filter.getDocidSet implementation (and within a
 Query-ized version of the filter and its Scorer method) . These values are
 keyed off the IndexReader's docID values, since that's all that's accessible
 at that level. Ultimately, however, I need to be able to access these values
 much higher up in the stack (Solr's QueryComponent.process method), so that
 I can inject the dynamic values into the response as a fake field. The IDs
 available here, however, are for the entire index and not just relative to
 the current IndexReader. I'm still fairly new to Lucene and I've been
 scratching my head a bit trying to find a reliable way to map these values
 into the same space, without having to hack up too many base classes. I
 noticed that there was a related discussion at:

 http://issues.apache.org/jira/browse/LUCENE-1821?focusedCommentId=12745041#action_12745041

 ... but also a bit of disagreement on the suggested strategies. Ideally, I'm
 also hoping there's a strategy that won't require me to hack up too much of
 the core product; subclassing IndexSearcher in the way suggested would
 basically require me to change all of the various SearchComponents I use in
 Solr, and that sounds like it'd end up a real maintenance nightmare. I was
 looking at the Collector class as possible solution, since it has knowledge
 of the docbase, but it looks like I'd then need to change every derived
 collector that the code ultimately uses and, including the various anonymous
 Collectors in Solr, that also looks like it'd be a fairly ghoulish solution.
 I suppose I'm being wishful, or lazy, but is there a reasonable and reliable
 way to do this, without having to fork the core code? If not, any suggestion
 on the best strategy to accomplish this, without adding too much overhead
 every time I wanted to up-rev the core Lucene and/or Solr code to the latest
 version?

 Thanks a ton,
 Aaron

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: De-basing / re-basing docIDs, or how to effectively pass calculated values from a Scorer or Filter up to (Solr's) QueryComponent.process

2009-10-06 Thread Earwin Burrfoot
Might still be lucene-ish issue.
We already have getSequentialSubReaders() on IR, in my patched version
I augmented this with public readerIndex(), and getSubReaderStarts().
Pretty much impossible to do some postprocessing on gathered hits
without at least one of these.

On Tue, Oct 6, 2009 at 19:50, Yonik Seeley yo...@lucidimagination.com wrote:
 Aaron, could you move this to solr-user?

 -Yonik
 http://www.lucidimagination.com


 On Tue, Oct 6, 2009 at 11:22 AM, Aaron McKee ucbmc...@gmail.com wrote:

 In the code I'm working with, I generate a cache of calculated values as a
 by-product within a Filter.getDocidSet implementation (and within a
 Query-ized version of the filter and its Scorer method) . These values are
 keyed off the IndexReader's docID values, since that's all that's accessible
 at that level. Ultimately, however, I need to be able to access these values
 much higher up in the stack (Solr's QueryComponent.process method), so that
 I can inject the dynamic values into the response as a fake field. The IDs
 available here, however, are for the entire index and not just relative to
 the current IndexReader. I'm still fairly new to Lucene and I've been
 scratching my head a bit trying to find a reliable way to map these values
 into the same space, without having to hack up too many base classes. I
 noticed that there was a related discussion at:

 http://issues.apache.org/jira/browse/LUCENE-1821?focusedCommentId=12745041#action_12745041

 ... but also a bit of disagreement on the suggested strategies. Ideally, I'm
 also hoping there's a strategy that won't require me to hack up too much of
 the core product; subclassing IndexSearcher in the way suggested would
 basically require me to change all of the various SearchComponents I use in
 Solr, and that sounds like it'd end up a real maintenance nightmare. I was
 looking at the Collector class as possible solution, since it has knowledge
 of the docbase, but it looks like I'd then need to change every derived
 collector that the code ultimately uses and, including the various anonymous
 Collectors in Solr, that also looks like it'd be a fairly ghoulish solution.
 I suppose I'm being wishful, or lazy, but is there a reasonable and reliable
 way to do this, without having to fork the core code? If not, any suggestion
 on the best strategy to accomplish this, without adding too much overhead
 every time I wanted to up-rev the core Lucene and/or Solr code to the latest
 version?

 Thanks a ton,
 Aaron

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org





-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1410) PFOR implementation

2009-10-06 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1410:
---

Attachment: LUCENE-1410-codecs.tar.bz2

Attaching sep, intblock and pfordelta codecs, spun out of the last patch on 
LUCENE-1458.

Once LUCENE-1458 is in, we should finish the pfordelta codec to make it a real 
choice.

I actually think some combination of pulsing, standard, pfordelta and simple 
bit packing (in order by increasing term's docFreq), within a single codec, may 
be best.

Ie, rare terms (only in a doc or two) could be inlined into the the terms dict. 
 Slightly more common terms can use the more CPU intensive standard codec.  
Common terms can use cpu-friendly-yet-still-decent-compression pfordelta.  
Obsenely common terms can use bit packing for the fastest decode.

 PFOR implementation
 ---

 Key: LUCENE-1410
 URL: https://issues.apache.org/jira/browse/LUCENE-1410
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Other
Reporter: Paul Elschot
Priority: Minor
 Attachments: autogen.tgz, LUCENE-1410-codecs.tar.bz2, 
 LUCENE-1410b.patch, LUCENE-1410c.patch, LUCENE-1410d.patch, 
 LUCENE-1410e.patch, TermQueryTests.tgz, TestPFor2.java, TestPFor2.java, 
 TestPFor2.java

   Original Estimate: 21840h
  Remaining Estimate: 21840h

 Implementation of Patched Frame of Reference.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-1949) FilterManager uses unsafe keys for its filter cache

2009-10-06 Thread Aaron McKee (JIRA)
FilterManager uses unsafe keys for its filter cache
---

 Key: LUCENE-1949
 URL: https://issues.apache.org/jira/browse/LUCENE-1949
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 2.9
Reporter: Aaron McKee
Priority: Minor


re: FilterManager.getFilter(Filter filter)

FilterManager is using the filter's hash code as the key to its filter cache, 
however hash codes are intrinsically not guaranteed to be distinct; different 
filters may hash to the same value. Although the chance of a conflict is 
hopefully low, given reasonable implementations of hashCode, it's certainly not 
impossible. When a conflict does occur, an unintended filter may be returned.

I'm unaware to what extent this class is actively being used, but noticed the 
issue during a code browse and thought I'd at least mention it. 


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1410) PFOR implementation

2009-10-06 Thread Eks Dev (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12762742#action_12762742
 ] 

Eks Dev commented on LUCENE-1410:
-

Mike, 
That is definitely the way to go, distribution dependent encoding, where every 
Term gets individual treatment.
  
Take for an example simple, but not all that rare case where Index gets sorted 
on some of the indexed fields (we use it really extensively, e.g. presorted doc 
collection on user_rights/zip/city, all indexed). There you get perfectly 
compressible  postings by simply managing intervals of set bits. Updates 
distort this picture, but we rebuild index periodically and all gets good 
again.  At the moment we load them into RAM as Filters in IntervalSets. if that 
would be possible in lucene, we wouldn't bother with Filters (VInt decoding on 
such super dense fields was killing us, even in RAMDirectory) ...  

Thinking about your comments, isn't pulsing somewhat orthogonal to packing 
method? For example, if you load index into RAMDirecectory, one could avoid one 
indirection level and inline all postings.

Flex Indexing rocks, that is going to be the most important addition to lucene 
since it started (imo)... I would even bet on double search speed  in first 
attempt for average queries :)

Cheers, 
eks 

 PFOR implementation
 ---

 Key: LUCENE-1410
 URL: https://issues.apache.org/jira/browse/LUCENE-1410
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Other
Reporter: Paul Elschot
Priority: Minor
 Attachments: autogen.tgz, LUCENE-1410-codecs.tar.bz2, 
 LUCENE-1410b.patch, LUCENE-1410c.patch, LUCENE-1410d.patch, 
 LUCENE-1410e.patch, TermQueryTests.tgz, TestPFor2.java, TestPFor2.java, 
 TestPFor2.java

   Original Estimate: 21840h
  Remaining Estimate: 21840h

 Implementation of Patched Frame of Reference.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: [jira] Commented: (LUCENE-1410) PFOR implementation

2009-10-06 Thread Paul Elschot
Eks,

 
 [ 
 https://issues.apache.org/jira/browse/LUCENE-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12762742#action_12762742
  ] 
 
 Eks Dev commented on LUCENE-1410:
 -
 
 Mike, 
 That is definitely the way to go, distribution dependent encoding, where 
 every Term gets individual treatment.
   
 Take for an example simple, but not all that rare case where Index gets 
 sorted on some of the indexed fields (we use it really extensively, e.g. 
 presorted doc collection on user_rights/zip/city, all indexed). There you get 
 perfectly compressible  postings by simply managing intervals of set bits. 
 Updates distort this picture, but we rebuild index periodically and all gets 
 good again.  At the moment we load them into RAM as Filters in IntervalSets. 
 if that would be possible in lucene, we wouldn't bother with Filters (VInt 
 decoding on such super dense fields was killing us, even in RAMDirectory) ... 
  

You could try switching the Filter to OpenBitSet when that takes fewer bytes 
than SortedVIntList.

Regards,
Paul Elschot


Re: [jira] Commented: (LUCENE-1410) PFOR implementation

2009-10-06 Thread eks dev
Paul,
the point I was trying to make with this example was extreme,  but realistic. 
Imagine 100Mio docs, sorted on field user_rights,  a term user_rights:XX 
selects 40Mio of them (user rights...). To encode this, you need format with  
two integers (for more of such intervals you would need slightly more, but 
nevertheless, much less than for OpenBitSet, VInts, PFor...  ). Strictly 
speaking this term is dense, but highly compressible and could be inlined with 
pulsing trick...

cheers, eks  





From: Paul Elschot paul.elsc...@xs4all.nl
To: java-dev@lucene.apache.org
Sent: Tuesday, 6 October, 2009 23:33:03
Subject: Re: [jira] Commented: (LUCENE-1410) PFOR implementation

Eks,


 
 [ 
 https://issues.apache.org/jira/browse/LUCENE-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12762742#action_12762742
  ] 
 
 Eks Dev commented on LUCENE-1410:
 -
 
 Mike, 
 That is definitely the way to go, distribution dependent encoding, where 
 every Term gets individual treatment.
 
 Take for an example simple, but not all that rare case where Index gets 
 sorted on some of the indexed fields (we use it really extensively, e.g. 
 presorted doc collection on user_rights/zip/city, all indexed). There you 
 get perfectly compressible  postings by simply managing intervals of set 
 bits. Updates distort this picture, but we rebuild index periodically and 
 all gets good again.  At the moment we load them into RAM as Filters in 
 IntervalSets. if that would be possible in lucene, we wouldn't bother with 
 Filters (VInt decoding on such super dense fields was killing us, even in 
 RAMDirectory) ... 


You could try switching the Filter to OpenBitSet when that takes fewer bytes 
than SortedVIntList.


Regards,
Paul Elschot





  

Re: [jira] Commented: (LUCENE-1410) PFOR implementation

2009-10-06 Thread eks dev
if you would drive this example further in combination with flex-indexing 
permitting per term postings format, I could imagine some nice tools for 
optimizeHard() , where normal index construction works with defaults as planned 
for solid mix-performance case and at the end you run optimizeHard() where 
postings get resorted on such fields (basically enabling rle encoding to work) 
and at the same time all other terms get optimal encoding format for 
postings... perfect for read only indexes where you want to max performance and 
reduce ix size



From: eks dev eks...@yahoo.co.uk
To: java-dev@lucene.apache.org
Sent: Tuesday, 6 October, 2009 23:59:12
Subject: Re: [jira] Commented: (LUCENE-1410) PFOR implementation


Paul,
the point I was trying to make with this example was extreme,  but realistic. 
Imagine 100Mio docs, sorted on field user_rights,  a term user_rights:XX 
selects 40Mio of them (user rights...). To encode this, you need format with  
two integers (for more of such intervals you would need slightly more, but 
nevertheless, much less than for OpenBitSet, VInts, PFor...  ). Strictly 
speaking this term is dense, but highly compressible and could be inlined with 
pulsing trick...

cheers, eks  





From: Paul Elschot paul.elsc...@xs4all.nl
To: java-dev@lucene.apache.org
Sent: Tuesday, 6 October, 2009 23:33:03
Subject: Re: [jira] Commented: (LUCENE-1410) PFOR implementation

Eks,


 
 [ 
 https://issues.apache.org/jira/browse/LUCENE-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12762742#action_12762742
  ] 
 
 Eks Dev commented on LUCENE-1410:
 -
 
 Mike, 
 That is definitely the way to go, distribution dependent encoding, where 
 every Term gets individual treatment.
 
 Take for an example simple, but not all that rare case where Index gets 
 sorted on some of the indexed fields (we use it really extensively, e.g. 
 presorted doc collection on user_rights/zip/city, all indexed). There you 
 get perfectly compressible  postings by simply managing intervals of 
 set bits. Updates distort this picture, but we rebuild index periodically 
 and all gets good again.  At the moment we load them into RAM as Filters 
 in IntervalSets. if that would be possible in lucene, we wouldn't bother 
 with Filters (VInt decoding on such super dense fields was killing us, 
 even in RAMDirectory) ... 


You could try switching the Filter to OpenBitSet when that takes fewer bytes 
than SortedVIntList.


Regards,
Paul Elschot






  

Re: [jira] Commented: (LUCENE-1410) PFOR implementation

2009-10-06 Thread Paul Elschot
On Tuesday 06 October 2009 23:59:12 eks dev wrote:
 Paul,
 the point I was trying to make with this example was extreme,  but realistic. 
 Imagine 100Mio docs, sorted on field user_rights,  a term user_rights:XX 
 selects 40Mio of them (user rights...). To encode this, you need format with  
 two integers (for more of such intervals you would need slightly more, but 
 nevertheless, much less than for OpenBitSet, VInts, PFor...  ). Strictly 
 speaking this term is dense, but highly compressible and could be inlined 
 with pulsing trick...

Well, I've been considering to add compressed consecutive ranges to 
SortedVIntList, but I did not
get further than considering. This sounds like the perfect use case for that.

Regards,
Paul Elschot


 
 cheers, eks  
 
 
 
 
 
 From: Paul Elschot paul.elsc...@xs4all.nl
 To: java-dev@lucene.apache.org
 Sent: Tuesday, 6 October, 2009 23:33:03
 Subject: Re: [jira] Commented: (LUCENE-1410) PFOR implementation
 
 Eks,
 
 
  
  [ 
  https://issues.apache.org/jira/browse/LUCENE-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12762742#action_12762742
   ] 
  
  Eks Dev commented on LUCENE-1410:
  -
  
  Mike, 
  That is definitely the way to go, distribution dependent encoding, where 
  every Term gets individual treatment.
  
  Take for an example simple, but not all that rare case where Index gets 
  sorted on some of the indexed fields (we use it really extensively, e.g. 
  presorted doc collection on user_rights/zip/city, all indexed). There you 
  get perfectly compressible  postings by simply managing intervals of 
  set bits. Updates distort this picture, but we rebuild index periodically 
  and all gets good again.  At the moment we load them into RAM as Filters 
  in IntervalSets. if that would be possible in lucene, we wouldn't bother 
  with Filters (VInt decoding on such super dense fields was killing us, 
  even in RAMDirectory) ... 
 
 
 You could try switching the Filter to OpenBitSet when that takes fewer bytes 
 than SortedVIntList.
 
 
 Regards,
 Paul Elschot
 
 
 
 
 
   



[jira] Updated: (LUCENE-1856) Remove Hits

2009-10-06 Thread Michael Busch (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Busch updated LUCENE-1856:
--

Attachment: lucene-1856.patch

Removes Hits from core and all contribs.

All core and contrib tests pass. I'll have to commit some changes also to the 
bw-compat branch.

 Remove Hits
 ---

 Key: LUCENE-1856
 URL: https://issues.apache.org/jira/browse/LUCENE-1856
 Project: Lucene - Java
  Issue Type: Task
  Components: Search
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: 3.0

 Attachments: lucene-1856.patch


 LUCENE-1290 removed all references to Hits from core.
 Most work to be done here is to remove all references from the contrib 
 modules and some new ones that crept into core after 1290.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Removing deprecated classes

2009-10-06 Thread Michael Busch

Hi all,

I've attached a patch to LUCENE-1856, which removes Hits. I'm not sure 
if someone has uncommitted big 3.0 patches that I'll mess up if I commit 
1856?


While working on 1856 I realized how tedious this stuff is! So Uwe, Mark 
 Co, let me know if you want me to wait with committing my patch!


 Michael

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-10-06 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12762828#action_12762828
 ] 

Mark Miller commented on LUCENE-1458:
-

bq. Hey! I'm not so old  But yeah I still cling to emacs. 

can you say both of those things in the same breath? Just how long did it take 
to get that phd...

I'd look it up and guestimate your age, but I think MIT still has my ip blocked 
from back when I was applying to colleges. So I'm going with the uses emacs 
guestimate.

bq. Hey, I know people who still cling to vi!

vi is the only one I can half way use - I know 3 commands - edit mode, leave 
edit mode, and save. And every now and then I accidently delete a whole line. 
When I make a change that I don't want to save, I have to kill the power.

The patch is in a bit of an unpatchable state ;) I think I know what editor to 
blame...Pico!

Our old friend, the $id is messing up WildcardTermEnum - no problem, I can fix 
that...

But also, NumericUtils is unpatched, Codec is missing, along with most of the 
classes from the codecs packages! This looks like my work :)

My only conclusion is that your one of those guys that can write the whole 
program once without even running it - and then it works perfectly on the first 
go. Thats the only way I can explain those classes in the wrong package 
previously as well :) No bug hunting tonight :(

 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for 

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-10-06 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12762831#action_12762831
 ] 

Mark Miller commented on LUCENE-1458:
-

nope - something else - looking through the patch I see the files I want - a 
second attempt at patching has gone over better.

A couple errors still, but stuff I think I can fix so that I can at least look 
over. False alarm. My patcher wonked out or something. I can resolve the few 
errors that popped up this time. Sweet.

 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of
 term-doc-position level, you could just add a new attribute.
   * Test performance  iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-1458) Further steps towards flexible indexing

2009-10-06 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12762831#action_12762831
 ] 

Mark Miller edited comment on LUCENE-1458 at 10/6/09 6:23 PM:
--

nope - something else - looking through the patch I see the files I want - a 
second attempt at patching has gone over better.

A couple errors still, but stuff I think I can fix so that I can at least look 
over. False alarm. My patcher wonked out or something. I can resolve the few 
errors that popped up this time. Sweet.

*edit*

Just for reference - not sure what happened the first time - my patch preview 
looked the same both times (was only complaining about the $id), but completely 
failed on attempt one and worked on attempt two - the only issue now appears to 
be you have half switch deletedDocs to Bits from BitVector - but only have way, 
so its broken in a dozen places. Not sure what you are doing about size() and 
what not, so I'm just gonna read around.

  was (Author: markrmil...@gmail.com):
nope - something else - looking through the patch I see the files I want - 
a second attempt at patching has gone over better.

A couple errors still, but stuff I think I can fix so that I can at least look 
over. False alarm. My patcher wonked out or something. I can resolve the few 
errors that popped up this time. Sweet.
  
 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 

[jira] Issue Comment Edited: (LUCENE-1458) Further steps towards flexible indexing

2009-10-06 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12762831#action_12762831
 ] 

Mark Miller edited comment on LUCENE-1458 at 10/6/09 6:34 PM:
--

nope - something else - looking through the patch I see the files I want - a 
second attempt at patching has gone over better.

A couple errors still, but stuff I think I can fix so that I can at least look 
over. False alarm. My patcher wonked out or something. I can resolve the few 
errors that popped up this time. Sweet.

*edit*

Just for reference - not sure what happened the first time - my patch preview 
looked the same both times (was only complaining about the $id), but completely 
failed on attempt one and worked on attempt two - the only issue now appears to 
be you have half switch deletedDocs to Bits from BitVector - but only have way, 
so its broken in a dozen places. Not sure what you are doing about size() and 
what not, so I'm just gonna read around.

*edit*

Yes - I found it - BitVector was supposed to implement Bits - which was in the 
patch ... this patch just did not want to apply. I guess it was right, but 
Eclipse just did not want it to take ...

  was (Author: markrmil...@gmail.com):
nope - something else - looking through the patch I see the files I want - 
a second attempt at patching has gone over better.

A couple errors still, but stuff I think I can fix so that I can at least look 
over. False alarm. My patcher wonked out or something. I can resolve the few 
errors that popped up this time. Sweet.

*edit*

Just for reference - not sure what happened the first time - my patch preview 
looked the same both times (was only complaining about the $id), but completely 
failed on attempt one and worked on attempt two - the only issue now appears to 
be you have half switch deletedDocs to Bits from BitVector - but only have way, 
so its broken in a dozen places. Not sure what you are doing about size() and 
what not, so I'm just gonna read around.
  
 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is 

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-10-06 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12762842#action_12762842
 ] 

Mark Miller commented on LUCENE-1458:
-

Bah - all this huffing an puffing over the patch and I'm too sick to stay up 
late anyway.

Have you started benching at all? I'm seeing like a 40-50% drop in same reader 
search benches with standard, sep, and pulsing. Like 80% with intblock.

 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of
 term-doc-position level, you could just add a new attribute.
   * Test performance  iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



RE: Removing deprecated classes

2009-10-06 Thread Uwe Schindler
Hi Mark,

no problem, go forward. I am on a trip until Saturday evening, so no
problems.

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


 -Original Message-
 From: Michael Busch [mailto:busch...@gmail.com]
 Sent: Wednesday, October 07, 2009 2:17 AM
 To: java-dev@lucene.apache.org
 Subject: Removing deprecated classes
 
 Hi all,
 
 I've attached a patch to LUCENE-1856, which removes Hits. I'm not sure
 if someone has uncommitted big 3.0 patches that I'll mess up if I commit
 1856?
 
 While working on 1856 I realized how tedious this stuff is! So Uwe, Mark
  Co, let me know if you want me to wait with committing my patch!
 
   Michael
 
 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



wildcardquery rewrite()

2009-10-06 Thread Robert Muir
someone asked this question on the user list:
http://www.lucidimagination.com/search/document/6f38de391b242102/prefixquery_vs_wildcardquery

it made me look at the wildcard rewrite(), where i see this:
if (!termContainsWildcard)
  return new TermQuery(getTerm());

is it a problem the boost is not preserved in this special case?

is it also a problem that if the user sets the default MultiTermQuery
rewriteMethod to say, CONSTANT_SCORE_FILTER_REWRITE,
that this rewritten TermQuery isn't wrapped with a constant score?

Sorry if it seems a bit nitpicky, really the issue is that I want to
do the right thing for a more complex query I am working on, but don't
want to overkill either.
-- 
Robert Muir
rcm...@gmail.com

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: wildcardquery rewrite()

2009-10-06 Thread Robert Muir
separately, perhaps we should consider doing the prefixquery rewrite
here for wildcardquery.

for example, SolrQueryParser will emit these 'wildcardqueries that
should be prefixqueries' if you are using the new reverse stuff for
leading wildcards: WildcardQuery(*foobar) -
WildcardQuery(U+0001raboof*)

I don't think the prefix enumeration is really that much faster than
the wildcard one, but still thought I would mention it.

On Tue, Oct 6, 2009 at 10:22 PM, Robert Muir rcm...@gmail.com wrote:
 someone asked this question on the user list:
 http://www.lucidimagination.com/search/document/6f38de391b242102/prefixquery_vs_wildcardquery

 it made me look at the wildcard rewrite(), where i see this:
    if (!termContainsWildcard)
      return new TermQuery(getTerm());

 is it a problem the boost is not preserved in this special case?

 is it also a problem that if the user sets the default MultiTermQuery
 rewriteMethod to say, CONSTANT_SCORE_FILTER_REWRITE,
 that this rewritten TermQuery isn't wrapped with a constant score?

 Sorry if it seems a bit nitpicky, really the issue is that I want to
 do the right thing for a more complex query I am working on, but don't
 want to overkill either.
 --
 Robert Muir
 rcm...@gmail.com




-- 
Robert Muir
rcm...@gmail.com

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Removing deprecated classes

2009-10-06 Thread Michael Busch

Cool, I'll commit 1856 soon then. Thanks!
 Michael

On 10/6/09 7:12 PM, Uwe Schindler wrote:

Hi Mark,

no problem, go forward. I am on a trip until Saturday evening, so no
problems.

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


   

-Original Message-
From: Michael Busch [mailto:busch...@gmail.com]
Sent: Wednesday, October 07, 2009 2:17 AM
To: java-dev@lucene.apache.org
Subject: Removing deprecated classes

Hi all,

I've attached a patch to LUCENE-1856, which removes Hits. I'm not sure
if someone has uncommitted big 3.0 patches that I'll mess up if I commit
1856?

While working on 1856 I realized how tedious this stuff is! So Uwe, Mark
  Co, let me know if you want me to wait with committing my patch!

   Michael

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org
 



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org


   



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1085) search.function should support all capabilities of Solr's search.function

2009-10-06 Thread Michael Busch (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Busch updated LUCENE-1085:
--

Fix Version/s: (was: 3.0)
   3.1

 search.function should support all capabilities of Solr's search.function
 -

 Key: LUCENE-1085
 URL: https://issues.apache.org/jira/browse/LUCENE-1085
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Doron Cohen
Priority: Minor
 Fix For: 3.1


 Lucene search.function does not allow Solr to move to use it, and so Solr 
 currently maintains its own version of this package.
 Enhance Lucene's search.function so that Solr can move to use it, and avoid 
 this redundancy. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-1856) Remove Hits

2009-10-06 Thread Michael Busch (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Busch resolved LUCENE-1856.
---

Resolution: Fixed

Committed revision 822587.

 Remove Hits
 ---

 Key: LUCENE-1856
 URL: https://issues.apache.org/jira/browse/LUCENE-1856
 Project: Lucene - Java
  Issue Type: Task
  Components: Search
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: 3.0

 Attachments: lucene-1856.patch


 LUCENE-1290 removed all references to Hits from core.
 Most work to be done here is to remove all references from the contrib 
 modules and some new ones that crept into core after 1290.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org