from:"Mark Miller"


[ 
https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855732#action_12855732
 ] 

Mark Miller commented on LUCENE-2386:
-

Is this change worth it with all of its repercussions? What are the upsides? 
There do appear to be downsides...

 IndexWriter commits unnecessarily on fresh Directory
 

 Key: LUCENE-2386
 URL: https://issues.apache.org/jira/browse/LUCENE-2386
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
 Fix For: 3.1

 Attachments: LUCENE-2386.patch, LUCENE-2386.patch, LUCENE-2386.patch


 I've noticed IndexWriter's ctor commits a first commit (empty one) if a fresh 
 Directory is passed, w/ OpenMode.CREATE or CREATE_OR_APPEND. This seems 
 unnecessarily, and kind of brings back an autoCommit mode, in a strange way 
 ... why do we need that commit? Do we really expect people to open an 
 IndexReader on an empty Directory which they just passed to an IW w/ 
 create=true? If they want, they can simply call commit() right away on the IW 
 they created.
 I ran into this when writing a test which committed N times, then compared 
 the number of commits (via IndexReader.listCommits) and was surprised to see 
 N+1 commits.
 Tried to change doCommit to false in IW ctor, but it got IndexFileDeleter 
 jumping on me .. so the change might not be that simple. But I think it's 
 manageable, so I'll try to attack it (and IFD specifically !) back :).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory

[
https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855740#action_12855740
]

Mark Miller commented on LUCENE-2386:
-

{quote}I do think this is a good change - IW was previously inconsistent, first
that it would even make a commit when we no longer have an autoCommit=true,
and, second, that it would not make the commit for a directory that already had
an index (we fixed this case a while back). So I like that this fix makes IW's
init behavior more consistent / simpler.{quote}

Thats not a very strong argument for a back compat break on a minor release
though...

IndexWriter commits unnecessarily on fresh Directory

Key: LUCENE-2386
URL: https://issues.apache.org/jira/browse/LUCENE-2386
Project: Lucene - Java
Issue Type: Bug
Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
Fix For: 3.1

Attachments: LUCENE-2386.patch, LUCENE-2386.patch, LUCENE-2386.patch,
LUCENE-2386.patch

I've noticed IndexWriter's ctor commits a first commit (empty one) if a fresh
Directory is passed, w/ OpenMode.CREATE or CREATE_OR_APPEND. This seems
unnecessarily, and kind of brings back an autoCommit mode, in a strange way
... why do we need that commit? Do we really expect people to open an
IndexReader on an empty Directory which they just passed to an IW w/
create=true? If they want, they can simply call commit() right away on the IW
they created.
I ran into this when writing a test which committed N times, then compared
the number of commits (via IndexReader.listCommits) and was surprised to see
N+1 commits.
Tried to change doCommit to false in IW ctor, but it got IndexFileDeleter
jumping on me .. so the change might not be that simple. But I think it's
manageable, so I'll try to attack it (and IFD specifically !) back :).

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory


[ 
https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855748#action_12855748
 ] 

Mark Miller commented on LUCENE-2386:
-

bq. Hmmm... I think the back compat break is very minor

Yes - it is - but so was the argument for it IMO.

Your extended argument is more compelling though.

 IndexWriter commits unnecessarily on fresh Directory
 

 Key: LUCENE-2386
 URL: https://issues.apache.org/jira/browse/LUCENE-2386
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
 Fix For: 3.1

 Attachments: LUCENE-2386.patch, LUCENE-2386.patch, LUCENE-2386.patch, 
 LUCENE-2386.patch


 I've noticed IndexWriter's ctor commits a first commit (empty one) if a fresh 
 Directory is passed, w/ OpenMode.CREATE or CREATE_OR_APPEND. This seems 
 unnecessarily, and kind of brings back an autoCommit mode, in a strange way 
 ... why do we need that commit? Do we really expect people to open an 
 IndexReader on an empty Directory which they just passed to an IW w/ 
 create=true? If they want, they can simply call commit() right away on the IW 
 they created.
 I ran into this when writing a test which committed N times, then compared 
 the number of commits (via IndexReader.listCommits) and was surprised to see 
 N+1 commits.
 Tried to change doCommit to false in IW ctor, but it got IndexFileDeleter 
 jumping on me .. so the change might not be that simple. But I think it's 
 manageable, so I'll try to attack it (and IFD specifically !) back :).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Created: (LUCENE-2391) Spellchecker uses default IW mergefactor/ramMB settings of 300/10

Spellchecker uses default IW mergefactor/ramMB settings of 300/10
-

 Key: LUCENE-2391
 URL: https://issues.apache.org/jira/browse/LUCENE-2391
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/spellchecker
Reporter: Mark Miller
Priority: Trivial


These settings seem odd - I'd like to investigate what makes most sense here.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Controlling the maximum size of a segment during indexing

2010-04-09 Thread Mark Miller

Setting maxMergeMB does not limit the size of segments you will see - it 
simply limits what segments will be merged - segments over maxMergeMB 
will not be merged with other segments - you can still buffer up a ton 
of docs in RAM and flush a segment larger than maxMergeMB, or merge n 
segments smaller than maxMergeMB that create a segment larger than 
maxMergeMB.



--
- Mark

http://www.lucidimagination.com



On 04/09/2010 01:01 AM, Lance Norskog wrote:

Here is a Java unit test that uses the LogByteSizeMergePolicy to
control the maximum size of segment files during indexing. That is, it
tries. It does not succeed. Will someone who truly understands the
merge policy code please examine it. There is probably one tiny
parameter missing.

It adds 20 documents that each are 100k in size.

It creates an index in a RAMDirectory which should have one segment
that's a tad over 1mb, and then a set of segments that are a tad over
500k. Instead, the data does not flush until it commits, writing one
5m segment.


-
org.apache.lucene.index.TestIndexWriterMergeMB
---

package org.apache.lucene.index;

/**
  * Licensed to the Apache Software Foundation (ASF) under one or more
  * contributor license agreements.  See the NOTICE file distributed with
  * this work for additional information regarding copyright ownership.
  * The ASF licenses this file to You under the Apache License, Version 2.0
  * (the License); you may not use this file except in compliance with
  * the License.  You may obtain a copy of the License at
  *
  * http://www.apache.org/licenses/LICENSE-2.0
  *
  * Unless required by applicable law or agreed to in writing, software
  * distributed under the License is distributed on an AS IS BASIS,
  * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  * See the License for the specific language governing permissions and
  * limitations under the License.
  */

import java.io.IOException;

import org.apache.lucene.analysis.WhitespaceAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.FieldSelectorResult;
import org.apache.lucene.document.Field.Index;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.LuceneTestCase;

/*
  * Verify that segment sizes are limited to # of bytes.
  *
  * Sizing:
  *  Max MB is 0.5m. Verify against thiAs plus 100k slop. (1.2x)
  *  Min MB is 10k.
  *  Each document is 100k.
  *  mergeSegments=2
  *  MaxRAMBuffer=1m. Verify against this plus 200k slop. (1.2x)
  *
  *  This test should cause the ram buffer to flush after 10 documents,
and create a CFS a little over 1meg.
  *  The later documents should be flushed to disk every 5-6 documents,
and create CFS files a little over 0.5meg.
  */


public class TestIndexWriterMergeMB extends LuceneTestCase {
   private static final int MERGE_FACTOR = 2;
   private static final double RAMBUFFER_MB = 1.0;
   static final double MIN_MB = 0.01d;
   static final double MAX_MB = 0.5d;
   static final double SLOP_FACTOR = 1.2d;
   static final double MB = 1000*1000;
   static String VALUE_100k = null;

   // Test controlling the mergePolicy for max # of docs
   public void testMaxMergeMB() throws IOException {
 Directory dir = new RAMDirectory();
 IndexWriterConfig config = new IndexWriterConfig(
 TEST_VERSION_CURRENT, new WhitespaceAnalyzer(TEST_VERSION_CURRENT));

 LogByteSizeMergePolicy mergeMB = new LogByteSizeMergePolicy();
 config.setMergePolicy(mergeMB);
 mergeMB.setMinMergeMB(MIN_MB);
 mergeMB.setMaxMergeMB(MAX_MB);
 mergeMB.setUseCompoundFile(true);
 mergeMB.setMergeFactor(MERGE_FACTOR);
 config.setMaxBufferedDocs(100);// irrelevant
but the next line fails without this.
 config.setRAMBufferSizeMB(IndexWriterConfig.DISABLE_AUTO_FLUSH);
 MergeScheduler scheduler = new SerialMergeScheduler();
 config.setMergeScheduler(scheduler);
 IndexWriter writer = new IndexWriter(dir, config);

 System.out.println(Start indexing);
 for (int i = 0; i  50; i++) {
   addDoc(writer, i);
   printSegmentSizes(dir);
 }
 checkSegmentSizes(dir);
 System.out.println(Commit);
 writer.commit();
 printSegmentSizes(dir);
 checkSegmentSizes(dir);
 writer.close();
   }

   // document that takes of 100k of RAM
   private void addDoc(IndexWriter writer, int i) throws IOException {
 if (VALUE_100k == null) {
   StringBuilder value = new StringBuilder(10);
   for(int fill = 0; fill  10; fill ++) {
 value.append('a');
   }
   VALUE_100k = value.toString();
 }
 Document doc = new Document();
 doc.add(new Field(id, i + , Field.Store.YES, 
Field.Index.NOT_ANALYZED));
 doc.add(new Field(content,

[jira] Commented: (LUCENE-2372) Replace deprecated TermAttribute by new CharTermAttribute

2010-04-09 Thread Mark Miller (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855489#action_12855489
 ] 

Mark Miller commented on LUCENE-2372:
-

bq.If I make it final and

+1 - lets just remember to add these breaks to the CHANGES BW break section...

 Replace deprecated TermAttribute by new CharTermAttribute
 -

 Key: LUCENE-2372
 URL: https://issues.apache.org/jira/browse/LUCENE-2372
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 3.1
Reporter: Uwe Schindler
 Fix For: 3.1

 Attachments: LUCENE-2372.patch, LUCENE-2372.patch, LUCENE-2372.patch


 After LUCENE-2302 is merged to trunk with flex, we need to carry over all 
 tokenizers and consumers of the TokenStreams to the new CharTermAttribute.
 We should also think about adding a AttributeFactory that creates a subclass 
 of CharTermAttributeImpl that returns collation keys in toBytesRef() 
 accessor. CollationKeyFilter is then obsolete, instead you can simply convert 
 every TokenStream to indexing only CollationKeys by changing the attribute 
 implementation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2074) Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer

2010-04-08 Thread Mark Miller (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854899#action_12854899
]

Mark Miller commented on LUCENE-2074:
-

{quote}Uwe, must this be coupled with that issue? This one waits for a long
time (why? for JFlex 1.5 release?) and protecting against a huge buffer
allocation can be a real quick and tiny fix. And this one also focuses on
getting Unicode 5 to work, which is unrelated to the buffer size. But the
buffer size is not a critical issue either that we need to move fast with it
... so it's your call. Just thought they are two unrelated problems.{quote}

Agreed. Whether its fixed as part of this commit or not, it really deserves its
own issue anyway, for changes and tracking. It has nothing to do with this
issue other than convenience.

Use a separate JFlex generated Unicode 4 by Java 5 compatible
StandardTokenizer
---

Key: LUCENE-2074
URL: https://issues.apache.org/jira/browse/LUCENE-2074
Project: Lucene - Java
Issue Type: Bug
Affects Versions: 3.0
Reporter: Uwe Schindler
Assignee: Uwe Schindler
Fix For: 3.1

Attachments: jflex-1.4.1-vs-1.5-snapshot.diff, jflexwarning.patch,
LUCENE-2074-lucene30.patch, LUCENE-2074.patch, LUCENE-2074.patch,
LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch,
LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch

The current trunk version of StandardTokenizerImpl was generated by Java 1.4
(according to the warning). In Java 3.0 we switch to Java 1.5, so we should
regenerate the file.
After regeneration the Tokenizer behaves different for some characters.
Because of that we should only use the new TokenizerImpl when
Version.LUCENE_30 or LUCENE_31 is used as matchVersion.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1895) Point2D defines equals by comparing double types with ==

2010-04-05 Thread Mark Miller (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12853402#action_12853402
 ] 

Mark Miller commented on LUCENE-1895:
-

I put this up not knowing really anything about the specific use case(s) of the 
Point2D class - I have never used Spatial - so close if it makes sense to do so.

My generic worry is that you can come to the *same* double value in two 
different ways, but == will not find them to be equal.

 Point2D defines equals by comparing double types with ==
 

 Key: LUCENE-1895
 URL: https://issues.apache.org/jira/browse/LUCENE-1895
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/spatial
Reporter: Mark Miller
Assignee: Chris Male
Priority: Trivial

 Ideally, this should allow for a margin of error right?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Welcome Shai Erera as Lucene/Solr committer

2010-03-26 Thread Mark Miller


On 03/26/2010 09:07 AM, Michael McCandless wrote:

I'm happy to announce that the PMC has accepted Shai Erera as
Lucene/Solr committer!

Welcome aboard Shai,

Mike

PS: it's custom to introduce yourself with a brief bio :)

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

   

Congrats Shai!

--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Modules

2010-03-26 Thread Mark Miller

I also like the idea of a very basic analyzer set - I think you should 
still be able to do things with just the core jar - even if its only 
very basic things.


On 03/26/2010 11:56 AM, Uwe Schindler wrote:

That will be also heavy ANT build refactoring (oh no...). But I am also for a 
basic analyzer set (without Standard!!!).

Uwe

   

-Original Message-
From: Shai Erera [mailto:ser...@gmail.com]
Sent: Friday, March 26, 2010 4:16 PM
To: java-dev@lucene.apache.org
Subject: Re: Modules

+1 for moving modules up one level.

As for analyzers, I also prefer if lucene won't depend on modules even
if just for the tests. That way one who doesn't use any module can
check out lucene only. We can keep in lucene some basic analyzers
(Whitespace, Simple) as well as a best out of the box choice -
Standard for new users.

Shai

On Friday, March 26, 2010, Earwin Burrfootear...@gmail.com  wrote:
 

Sounds good to me. I guess one thing to think about is the analyzers
in core (should they move to this module, too?).
If so, perhaps we could make 'ant test' of lucene depend on this
module, since core tests use analyzers.
But you could use lucene without an analyzers module, it wouldnt be
 

a
 

real dependency.
 

You can leave only most basic analyzers (one of them?) in core,
switching over the tests that currently use more advanced ones.
Then you move everything else to the module, along with analyzer-
   

specific tests.
 

--
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org


   

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org
 



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

   



--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: New LuSolr trunk

2010-03-23 Thread Mark Miller


This looks good to me.

+1 on landing flex now.

On 03/22/2010 08:27 AM, Uwe Schindler wrote:

Hi all,

the discussion where to do the development after the merge, now gets actual:

Currently a lusolr test-trunk is done as a branch inside solr 
(https://svn.apache.org/repos/asf/lucene/solr/branches/newtrunk). The question 
is, where to put the main development and how to switch, so non-developers that 
have checkouts of solr and/or lucene will see the change and do not send us 
outdated patches.

I propose to do the following:

- Start a new top-level project folder inside /lucene root svn folder: 
https://svn.apache.org/repos/asf/lucene/lusolr (please see lusolr as a 
placeholder name) and add branches, tags subfolders to it. Do not create trunk and do 
this together with the next step.
- Move the branch from https://svn.apache.org/repos/asf/lucene/solr/branches/newtrunk to 
this new directory as trunk
- For lucene flexible indexing, create a corresponding flex branch there and 
svn copy it from current new trunk. Merge the lucene flex changes into it. 
Alternatively, land flex now. Or simply do svn copy of current flex branch 
instead of merging (may be less work).
- Do the same for possible solr branches in development
- Create a tag in the lucene tags folder and in the solr tags folder with the 
current state of each trunk. After that delete all contents from old trunk in 
solr and lucene and place a readme file pointing developers to the new merged 
trunk folder (for both old trunks). This last step is important, else people 
who checkout the old trunk will soon see a very outdated view and may send us 
outdated patches in JIRA. When the contents of old-trunk disappear it's obvious 
to them what happened. If they had already some changes in their checkout, the 
svn client will keep the changed files as unversioned (after upgrade). The 
history keeps available, so it's also possible to checkout an older version 
from trunk using @rev or -r rev. I did a similar step with some backwards 
compatibility changes in lucene (add a README).

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


   

-Original Message-
From: Michael McCandless [mailto:luc...@mikemccandless.com]
Sent: Monday, March 22, 2010 11:37 AM
To: java-dev@lucene.apache.org
Subject: Re: (LUCENE-2297) IndexWriter should let you optionally enable
reader pooling

I think we should.

It (newtrunk) was created to test Hoss's side-by-sdie proposal, and
that approach looks to be working very well.

Up until now we've been committing to the old trunk and then
systematically merging over to newtrunk.  I think we should now flip
that, ie, commit to newtrunk and only merge back to the old trunk if
for some strange reason it's needed.

Mike

On Mon, Mar 22, 2010 at 6:32 AM, Uwe Schindleru...@thetaphi.de  wrote:
 

Are we now only working on newtrunk?

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

   

-Original Message-
From: Michael McCandless (JIRA) [mailto:j...@apache.org]
Sent: Monday, March 22, 2010 11:22 AM
To: java-dev@lucene.apache.org
Subject: [jira] Resolved: (LUCENE-2297) IndexWriter should let you
optionally enable reader pooling


  [ https://issues.apache.org/jira/browse/LUCENE-
2297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-
 

tabpanel
 

]

Michael McCandless resolved LUCENE-2297.


 Resolution: Fixed

Fixed on newtrunk.

 

IndexWriter should let you optionally enable reader pooling
---

 Key: LUCENE-2297
 URL: https://issues.apache.org/jira/browse/LUCENE-
   

2297
 

 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Priority: Minor
 Fix For: 3.1

 Attachments: LUCENE-2297.patch


For apps using a large index and frequently need to commit and
   

resolve deletes, the cost of opening the SegmentReaders on demand
 

for
 

every commit can be prohibitive.
 

We an already pool readers (NRT does so), but, we only turn it on
   

if
 

NRT readers are in use.
 

We should allow separate control.
We should do this after LUCENE-2294.
   

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



 

-
 

To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org
 



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail:

[jira] Commented: (LUCENE-1709) Parallelize Tests

2010-03-23 Thread Mark Miller (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12848712#action_12848712
]

Mark Miller commented on LUCENE-1709:
-

+1 on removing those flags - personally I find them unnecessary - and they
complicate the build.

And I would love to Lucene parallel like Solr now.

Parallelize Tests
-

Key: LUCENE-1709
URL: https://issues.apache.org/jira/browse/LUCENE-1709
Project: Lucene - Java
Issue Type: Improvement
Components: Index
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Fix For: 3.1

Attachments: LUCENE-1709.patch, runLuceneTests.py

Original Estimate: 48h
Remaining Estimate: 48h

The Lucene tests can be parallelized to make for a faster testing system.
This task from ANT can be used:
http://ant.apache.org/manual/CoreTasks/parallel.html
Previous discussion:
http://www.gossamer-threads.com/lists/lucene/java-dev/69669
Notes from Mike M.:
{quote}
I'd love to see a clean solution here (the tests are embarrassingly
parallelizable, and we all have machines with good concurrency these
days)... I have a rather hacked up solution now, that uses
-Dtestpackage=XXX to split the tests up.
Ideally I would be able to say use N threads and it'd do the right
thing... like the -j flag to make.
{quote}

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Running the Solr/Lucene tests failed

2010-03-23 Thread Mark Miller

Robert very recently committed some stuff that parallelizes the solr 
tests that may need to be worked out in all cases still (if that is 
indeed the problem here). A variety of devs have tested it, but there 
may be a lingering issue?


No helpful errors printed above BUILD FAILED? The line the errors you 
pasted gives is simply the line that fails the build if tests failed.


There is still a way to run them sequentially (as Hudson should be 
doing) that Robert should be able to let you in on as well. But it would 
be nice to get to the bottom of this.


- Mark

On 03/23/2010 03:36 PM, Michael Busch wrote:

Hi all,

I wanted to commit LUCENE-2329.  I just checked out the new combined 
trunk https://svn.apache.org/repos/asf/lucene/dev/trunk and ran ant 
test.  After 20 mins the build failed on the unmodified code (see 
below).  I hadn't applied my patch yet.


What's the status of the combined trunk?  Should the tests pass?  As 
far as I can tell all lucene tests were successful (core, contrib, 
bw), but the Solr tests failed.  Is there more setup for the Solr part 
necessary after 'svn checkout'?


 Michael

BUILD FAILED
/Users/michael/Documents/workspace/lucene-solr-trunk/trunk/build.xml:28: 
The following error occurred while executing this line:
/Users/michael/Documents/workspace/lucene-solr-trunk/trunk/solr/build.xml:393: 


The following error occurred while executing this line:
/Users/michael/Documents/workspace/lucene-solr-trunk/trunk/solr/build.xml:472: 
Tests failed!

The following error occurred while executing this line:
/Users/michael/Documents/workspace/lucene-solr-trunk/trunk/solr/build.xml:472: 
Tests failed!

The following error occurred while executing this line:
/Users/michael/Documents/workspace/lucene-solr-trunk/trunk/solr/build.xml:472: 
Tests failed!

The following error occurred while executing this line:
/Users/michael/Documents/workspace/lucene-solr-trunk/trunk/solr/build.xml:472: 
Tests failed!

The following error occurred while executing this line:
/Users/michael/Documents/workspace/lucene-solr-trunk/trunk/solr/build.xml:472: 
Tests failed!

The following error occurred while executing this line:
/Users/michael/Documents/workspace/lucene-solr-trunk/trunk/solr/build.xml:472: 
Tests failed!

The following error occurred while executing this line:
/Users/michael/Documents/workspace/lucene-solr-trunk/trunk/solr/build.xml:472: 
Tests failed!

The following error occurred while executing this line:
/Users/michael/Documents/workspace/lucene-solr-trunk/trunk/solr/build.xml:472: 
Tests failed!

The following error occurred while executing this line:
/Users/michael/Documents/workspace/lucene-solr-trunk/trunk/solr/build.xml:472: 
Tests failed!

The following error occurred while executing this line:
/Users/michael/Documents/workspace/lucene-solr-trunk/trunk/solr/build.xml:472: 
Tests failed!

The following error occurred while executing this line:
/Users/michael/Documents/workspace/lucene-solr-trunk/trunk/solr/build.xml:472: 
Tests failed!

The following error occurred while executing this line:
/Users/michael/Documents/workspace/lucene-solr-trunk/trunk/solr/build.xml:472: 
Tests failed!

The following error occurred while executing this line:
/Users/michael/Documents/workspace/lucene-solr-trunk/trunk/solr/build.xml:472: 
Tests failed!

The following error occurred while executing this line:
/Users/michael/Documents/workspace/lucene-solr-trunk/trunk/solr/build.xml:472: 
Tests failed!

The following error occurred while executing this line:
/Users/michael/Documents/workspace/lucene-solr-trunk/trunk/solr/build.xml:472: 
Tests failed!

The following error occurred while executing this line:
/Users/michael/Documents/workspace/lucene-solr-trunk/trunk/solr/build.xml:472: 
Tests failed!

The following error occurred while executing this line:
/Users/michael/Documents/workspace/lucene-solr-trunk/trunk/solr/build.xml:472: 
Tests failed!

The following error occurred while executing this line:
/Users/michael/Documents/workspace/lucene-solr-trunk/trunk/solr/build.xml:472: 
Tests failed!

The following error occurred while executing this line:
/Users/michael/Documents/workspace/lucene-solr-trunk/trunk/solr/build.xml:472: 
Tests failed!


Total time: 19 minutes 38 seconds


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org




--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Running the Solr/Lucene tests failed

2010-03-23 Thread Mark Miller

If you do an update your issue should be resolved. This is something we 
ran into the other day as well, and have been solving it a bit at a time ;)


- Mark

On 03/23/2010 04:29 PM, Robert Muir wrote:

Yeah its a bit confusing... before, exceptions happening in other
threads were silently hidden.

Uwe fixed this in Lucene I think, and right now the verbosity is
cranked for Solr, too.
Yonik is hacking away at these tests to quiet the ones that are truly
expected exceptions...

At least I think I got this right...

On Tue, Mar 23, 2010 at 4:26 PM, Michael Buschbusch...@gmail.com  wrote:
   

I see.  And all the other exceptions printed are expected?

  Michael

On 3/23/10 1:20 PM, Robert Muir wrote:
 

Thanks Michael, this isn't a parallel test problem at all, its a
sporatic problem with solr's jetty tests (the same problem I mentioned
in the previous response).

You might/will see this problem running the tests sequentially too.

Test org.apache.solr.client.solrj.embedded.JettyWebappTest FAILED

On Tue, Mar 23, 2010 at 4:15 PM, Michael Buschbusch...@gmail.comwrote:

   

Sorry for the lack of details.  Thought I had just not done an obvious
step.
Attached is the output from the Solr part.

Btw: This machine is a Solr virgin,  Solr never ran on it before.

  Michael

On 3/23/10 1:00 PM, Mark Miller wrote:

 

Robert very recently committed some stuff that parallelizes the solr
tests
that may need to be worked out in all cases still (if that is indeed the
problem here). A variety of devs have tested it, but there may be a
lingering issue?

No helpful errors printed above BUILD FAILED? The line the errors you
pasted gives is simply the line that fails the build if tests failed.

There is still a way to run them sequentially (as Hudson should be
doing)
that Robert should be able to let you in on as well. But it would be
nice to
get to the bottom of this.

- Mark

On 03/23/2010 03:36 PM, Michael Busch wrote:

   

Hi all,

I wanted to commit LUCENE-2329.  I just checked out the new combined
trunk https://svn.apache.org/repos/asf/lucene/dev/trunk and ran ant
test.
  After 20 mins the build failed on the unmodified code (see below).  I
hadn't applied my patch yet.

What's the status of the combined trunk?  Should the tests pass?  As
far
as I can tell all lucene tests were successful (core, contrib, bw), but
the
Solr tests failed.  Is there more setup for the Solr part necessary
after
'svn checkout'?

  Michael

BUILD FAILED

/Users/michael/Documents/workspace/lucene-solr-trunk/trunk/build.xml:28:
The following error occurred while executing this line:


/Users/michael/Documents/workspace/lucene-solr-trunk/trunk/solr/build.xml:393:
The following error occurred while executing this line:


/Users/michael/Documents/workspace/lucene-solr-trunk/trunk/solr/build.xml:472:
Tests failed!
The following error occurred while executing this line:


/Users/michael/Documents/workspace/lucene-solr-trunk/trunk/solr/build.xml:472:
Tests failed!
The following error occurred while executing this line:


/Users/michael/Documents/workspace/lucene-solr-trunk/trunk/solr/build.xml:472:
Tests failed!
The following error occurred while executing this line:


/Users/michael/Documents/workspace/lucene-solr-trunk/trunk/solr/build.xml:472:
Tests failed!
The following error occurred while executing this line:


/Users/michael/Documents/workspace/lucene-solr-trunk/trunk/solr/build.xml:472:
Tests failed!
The following error occurred while executing this line:


/Users/michael/Documents/workspace/lucene-solr-trunk/trunk/solr/build.xml:472:
Tests failed!
The following error occurred while executing this line:


/Users/michael/Documents/workspace/lucene-solr-trunk/trunk/solr/build.xml:472:
Tests failed!
The following error occurred while executing this line:


/Users/michael/Documents/workspace/lucene-solr-trunk/trunk/solr/build.xml:472:
Tests failed!
The following error occurred while executing this line:


/Users/michael/Documents/workspace/lucene-solr-trunk/trunk/solr/build.xml:472:
Tests failed!
The following error occurred while executing this line:


/Users/michael/Documents/workspace/lucene-solr-trunk/trunk/solr/build.xml:472:
Tests failed!
The following error occurred while executing this line:


/Users/michael/Documents/workspace/lucene-solr-trunk/trunk/solr/build.xml:472:
Tests failed!
The following error occurred while executing this line:


/Users/michael/Documents/workspace/lucene-solr-trunk/trunk/solr/build.xml:472:
Tests failed!
The following error occurred while executing this line:


/Users/michael/Documents/workspace/lucene-solr-trunk/trunk/solr/build.xml:472:
Tests failed!
The following error occurred while executing this line:


/Users/michael/Documents/workspace/lucene-solr-trunk/trunk/solr/build.xml:472:
Tests failed!
The following error occurred while executing this line:


/Users/michael/Documents/workspace/lucene-solr-trunk/trunk/solr/build.xml:472:
Tests failed!
The following error occurred

[jira] Commented: (LUCENE-1814) Some Lucene tests try and use a Junit Assert in new threads

2010-03-21 Thread Mark Miller (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847931#action_12847931
 ] 

Mark Miller commented on LUCENE-1814:
-

Chris Male mentioned to me that he thinks Uwe has fixed this?

 Some Lucene tests try and use a Junit Assert in new threads
 ---

 Key: LUCENE-1814
 URL: https://issues.apache.org/jira/browse/LUCENE-1814
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Mark Miller
Priority: Minor

 There are a few cases in Lucene tests where JUnit Asserts are used inside a 
 new threads run method - this won't work because Junit throws an exception 
 when a call to Assert fails - that will kill the thread, but the exception 
 will not propagate to JUnit - so unless a failure is caused later from the 
 thread termination, the Asserts are invalid.
 TestThreadSafe
 TestStressIndexing2
 TestStringIntern

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: lucene and solr trunk

2010-03-18 Thread Mark Miller

Alight, so we have implemented Hoss' suggestion here on the lucene/solr 
merged dev branch at lucene/solr/branches/newtrunk.


Feel free to check it out and give some feedback.

We also roughly have Solr running on Lucene trunk - eg compiling Solr 
will first compile lucene and run off those compiled class files. 
Running dist or example in Solr will grab Lucene's jars and put them in 
the war. This still needs further love, but it works.


There is also a top level build.xml with two targets: clean, and test. 
Clean will clean both Lucene and Solr, and test will run tests for both 
Lucene and Solr.


Thanks to everyone that contributed to getting all this working!

--
- Mark

http://www.lucidimagination.com



On 03/17/2010 12:40 PM, Mark Miller wrote:
Okay, so this looks good to me (a few others seemed to like it - 
though Lucene-Dev was somehow dropped earlier) - lets try this out on 
the branch? (then we can get rid of that horrible branch name ;) )


Anyone on the current branch object to having to do a quick svn switch?

On 03/16/2010 06:46 PM, Chris Hostetter wrote:
: Otis, yes, I think so, eventually.  But that's gonna take much more 
discussion.

:
: I don't think this initial cutover should try to solve how modules
: will be organized, yet... we'll get there, eventually.

But we should at least consider it, and not move in a direction that's
distinct from the ultimate goal of better refactoring (especailly since
that was one of the main goals of unifying development efforts)

Here's my concrete suggestion that could be done today (for simplicity:
$svn = https://svn.apache.org/repos/asf/lucene)...

   svn mv $svn/java/trunk $svn/java/tmp-migration
   svn mkdir $svn/java/trunk
   svn mv $svn/solr/trunk $svn/java/trunk/solr
   svn mv $svn/java/tmp-migration $svn/java/trunk/core

At which point:

0. People who want to work only on Lucene-Java can start checking out
$svn/java/trunk/core (i'm pretty sure existing checkouts will 
continue to

work w/o any changes, the svn info should just update itself)

1. build files can be added to (the new) $svn/java/trunk to build ./core
followed by ./solr

2. the build files in $svn/java/trunk/solr can be modified to look at
../core/ to find lucene jars

3. people who care about Solr (including all committers) should start
checking out and building all of $svn/java/trunk

4. Long term, we could choose to branch all of $svn/java/trunk
for releases ... AND/OR we could choose to branch specific modules
(ie: solr) independently (with modifications to the build files on those
branches to pull in their dependencies from alternate locations

5. Long term, we can start refactoring additional modules out of
$svn/java/trunk/solr and $svn/java/trunk/core (like
$svn/java/trunk/core/contrib) into their own directory in 
$svn/java/trunk


6. Long term, people who want to work on more then just core but don't
care about certain modules (like solr) can do a simple non-recursive
checkout of $svn/java/trunk and then do full checkouts of whatever 
modules

they care about


(Please note: I'm just trying to list things we *could* do if we go this
route, i'm not advocating that we *should* do any of these things)

I can't think of any objections people have raised to any of the 
previous
suggestions which apply to this suggestion.  Is there anything people 
can

think of that would be useful, but not possible, if we go this route?


-Hoss








-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2305) Introduce Version in more places long before 4.0

2010-03-17 Thread Mark Miller (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-2305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846542#action_12846542
]

Mark Miller commented on LUCENE-2305:
-

Ah, yes - I didnt remember your comment right:

{quote}
We could make the change under Version? (Change to true, starting in 3.1).

Or maybe not make the change. If set to true, we use pct deletion on
a segment to reduce its perceived size when selecting merges, which
generally causes segments with pending deletions to be merged away
sooner
{quote}

Sounds like a good move.

Introduce Version in more places long before 4.0

Key: LUCENE-2305
URL: https://issues.apache.org/jira/browse/LUCENE-2305
Project: Lucene - Java
Issue Type: Improvement
Reporter: Shai Erera
Fix For: 3.1

We need to introduce Version in as many places as we can (wherever it makes
sense of course), and preferably long before 4.0 (or shall I say 3.9?) is
out. That way, we can have a bunch of deprecated API now, that will be gone
in 4.0, rather than doing it one class at a time and never finish :).
The purpose is to introduce Version wherever it is mandatory now, and also in
places where we think it might be useful in the future (like most of our
Analyzers, configured classes and configuration classes).
I marked this issue for 3.1, though I don't expect it to end in 3.1. I still
think it will be done one step at a time, perhaps for cluster of classes
together. But on the other hand I don't want to mark it for 4.0.0 because
that needs to be resolved much sooner. So if I had a 3.9 version defined, I'd
mark it for 3.9. We can do several commits in one issue right? So this one
can live for a while in JIRA, while we gradually convert more and more
classes.
The first candidate is InstantiatedIndexWriter which probably should take an
IndexWriterConfig. While I converted the code to use IWC, I've noticed
Instantiated defaults its maxFieldLength to the current default (10,000)
which is deprecated. I couldn't change it for back-compat reasons. But we can
upgrade it to accept IWC, and set to unlimited if the version is onOrAfter
3.1, otherwise stay w/ the deprecated default.
if it's acceptable to have several commits in one issue, I can start w/
Instantiated, post a patch and then we can continue to more classes.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2320) Add MergePolicy to IndexWriterConfig

2010-03-17 Thread Mark Miller (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846622#action_12846622
 ] 

Mark Miller commented on LUCENE-2320:
-

+1 - I've had to do this in the past too. Just dropping tests doesn't seem like 
the way to go in many cases.

 Add MergePolicy to IndexWriterConfig
 

 Key: LUCENE-2320
 URL: https://issues.apache.org/jira/browse/LUCENE-2320
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Michael McCandless
 Fix For: 3.1

 Attachments: LUCENE-2320.patch, LUCENE-2320.patch, LUCENE-2320.patch, 
 LUCENE-2320.patch


 Now that IndexWriterConfig is in place, I'd like to move MergePolicy to it as 
 well. The change is not straightforward and so I've kept it for a separate 
 issue. MergePolicy requires in its ctor an IndexWriter, however none can be 
 passed to it before an IndexWriter actually exists. And today IW may create 
 an MP just for it to be overridden by the application one line afterwards. I 
 don't want to make iw member of MP non-final, or settable by extending 
 classes, however it needs to remain protected so they can access it directly. 
 So the proposed changes are:
 * Add a SetOnce object (to o.a.l.util), or Immutable, which can only be set 
 once (hence its name). It'll have the signature SetOnceT w/ *synchronized 
 setT* and *T get()*. T will be declared volatile, so that get() won't be 
 synchronized.
 * MP will define a *protected final SetOnceIndexWriter writer* instead of 
 the current writer. *NOTE: this is a bw break*. any suggestions are welcomed.
 * MP will offer a public default ctor, together with a set(IndexWriter).
 * IndexWriter will set itself on MP using set(this). Note that if set will be 
 called more than once, it will throw an exception (AlreadySetException - or 
 does someone have a better suggestion, preferably an already existing Java 
 exception?).
 That's the core idea. I'd like to post a patch soon, so I'd appreciate your 
 review and proposals.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2323) reorganize contrib modules

2010-03-17 Thread Mark Miller (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846711#action_12846711
 ] 

Mark Miller commented on LUCENE-2323:
-

This reorg is a great a great step for contrib IMO!

+1

 reorganize contrib modules
 --

 Key: LUCENE-2323
 URL: https://issues.apache.org/jira/browse/LUCENE-2323
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/*
Reporter: Robert Muir

 it would be nice to reorganize contrib modules, so that they are bundled 
 together by functionality.
 For example:
 * the wikipedia contrib is a tokenizer, i think really belongs in 
 contrib/analyzers
 * there are two highlighters, i think could be one highlighters package.
 * there are many queryparsers and queries in different places in contrib

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: lucene and solr trunk

2010-03-16 Thread Mark Miller


On 03/16/2010 03:43 AM, Simon Willnauer wrote:


One more thing which I wonder about even more is that this whole
merging happens so quickly for reasons I don't see right now. I don't
want to keep anybody from making progress but it appears like a rush
to me.
   


Meh - I think your just plain wrong about this. Anyone can work as fast 
as they want on anything. Nothing has happened faster than the community 
wants yet. Your too concerned. This is called discussion. Nothing has 
happened. In my opinion, the whole freak out of what goes where in svn 
was so over blown - its so easy to move this stuff around at the drop of 
a hat. That's why it was suggested we put a branch there and no one saw 
anything wrong it with for the moment - everyone said, well we can just 
easily move it if someone has an issue - which we did. Didn't expect the 
freak out though. Frankly, we were just seeking a branch really, and 
didn't care where it went.


Some of us are anxious to do some work - some of us are anxious to merge 
some code - no one is forcing this stuff on the others at a rapid pace - 
everyone gets there say as always. This is why we wanted a branch we 
could committ what we wanted to. SVN locations make starting the merge 
of code easier. They are easy to change. This is not like rushing index 
format changes. Its src code location - it can be moved at the drop of 
the hat. The sooner we resolve what we are going to do, the sooner we 
can start getting more work done that we hoped to get down with this 
merge. This thread starts that discussion. You can't start a discussion 
to early. Perhaps it leads to another discussion first, but their is no 
such thing as rushing the start of discussion. It doesn't say figure it 
out by tomorrow, cause we are doing this tomorrow.  It doesn't say, 
figure this out by next week, because we are doing this next week. It 
says lets discuss where this is going to go.


I think some people just need to relax, and discuss what they would like 
to see and worry less about how fast others are working. Fast work is 
good. It means more work. Nothing is going to happen until the community 
figures things out.



BTW: I still have the impression that if I don't follow IRC constantly
I'm missing important things.
   
That's your impression then. Follow IRC if you want. People talk all 
over the places about Lucen/Solr - many times in places you can't follow 
- if it didn't happen on the list, it didn't happen. Michael Busch 
follows up saying, people say it was discussed thoroughly on IRC - so 
what? It doesn't count as a valid point of reference - I haven't seen 
that, but you can just tell someone that says that so - they owe you an 
explanation.



--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: lucene and solr trunk

2010-03-16 Thread Mark Miller


On 03/16/2010 07:05 AM, Shalin Shekhar Mangar wrote:


Wow, you guys are moving fast! Thats a good thing.

IRC is fine if you want to discuss something quickly. But it has its 
limitations. For example, I cannot follow IRC most of the times 
because I'm in a different time zone. But I don't want to stop anyone 
either. In fact, I can't do that. Nobody can.


All I want to say is that once discussions have happened and a plan 
agreed upon, it may be a good idea to let solr-dev/java-dev know the 
plan. In this case I didn't know a new branch was created until I saw 
was a commit notification and then Yonik's email.




Hi Shalin - I like your attitude ;)

-

Yonik's email was the notification of the plan :) Though we had no plan. 
When Robert and I made the branch we had no plan really - we just needed 
a place to put together our patches and do the final work. We were 
trying to do it with patches, but it was becoming difficult. But when we 
started we had no real plan - just to see if we could get Solr up and 
running on Lucene 3.01 and then trunk. Anything beyond that, we have not 
planned for - and before that was even completed, there were emails to 
java-dev about it. But we conceived nothing beyond seeing if we could 
get Solr running on the latest Lucene.


From our perspective, we would have been just as happy with a branch on 
my local hard drive! That would have taken longer to setup though.


--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: lucene and solr trunk

2010-03-16 Thread Mark Miller


On 03/16/2010 09:05 AM, Andrzej Bialecki wrote:

On 2010-03-16 12:29, Mark Miller wrote:


 From our perspective, we would have been just as happy with a branch on
my local hard drive! That would have taken longer to setup though.


You could have used git instead. There is a good integration between 
git and svn, and it's much easier (a giant understatement...) to 
handle branching and merging in git, both between git branches and 
syncing with external svn.


Yeah, we have actually discussed doing things like GIT in the past - 
prob main reason we didn't is learning curve at the moment. I haven't 
used it yet.


--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: lucene and solr trunk

2010-03-16 Thread Mark Miller


On 03/16/2010 10:09 AM, Yonik Seeley wrote:

On Tue, Mar 16, 2010 at 2:51 AM, Michael Buschbusch...@gmail.com  wrote:
   

Also, we're in review-and-commit process, not commit-and-review.  Changes have 
to be
proposed, discussed and ideally attached to jira as patches first.
 

Correction, just for the sake of avoiding future confusion (i.e. I'm
not making any point about this thread):

Lucene and Solr have always officially been CTR.
For trunk, we normally use a bit of informal lazy consensus for
anything big, hard, or that might be controvertial... but we are not
officially RTC.

-Yonik

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

   


In any case, this is a branch. People really want to enforce RTC on a 
branch??? Even if that was our official process on trunk (which I agree 
it has not been) that's not how the flex branch worked. That's not how 
the solr_cloud branch worked. That's not how other previous branches 
have worked.


IMO - anyone should be able to create a branch for anything - to play 
around with whatever they want. We should encourage this. Branches are 
good. And they take up little space.



Branch changes have to be proposed, discussed, and attached to JIRA? 
Uggg - I certainly hope not.


Branches should be considered replacements for huge unwieldy patches. Do 
I have to propose and discuss before I put up a patch?


--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: #lucene IRC log [was: RE: lucene and solr trunk]

2010-03-16 Thread Mark Miller


On 03/16/2010 02:57 PM, Grant Ingersoll wrote:

On Mar 16, 2010, at 2:47 PM, Steven A Rowe wrote:

   

On 03/16/2010 at 6:06 AM, Michael McCandless wrote:
 

Does anyone know how other projects fold in IRC...?
   

I gather from the deafening silence that we'll have to figure it out as we go...

I think some (not all) of the discomfort associated with IRC could be addressed 
with a permanent, searchable, linkable archive of #lucene.

I went looking for IRC loggers and found http://colabti.org/.  One of the 
things hosted there is a searchable, linkable permanent archive of several 
freenode channels.  I posted on #irclogger asking about hosting #lucene 
archive, and apparently all we have to do is ask, after first determining that 
nobody objects.  Here's a link (not incidentally, this is exactly what we will 
have for #lucene once the service is switched on):

http://colabti.org/irclogger/irclogger_log/irclogger?date=2010-03-16#l2

So, would anybody participating on #lucene object to a permanent archive?

(I'm also going to provide a link to this thread on #lucene to make sure 
everybody there knows about the issue.)
 

There's also a lot of chatter that happens on IRC, so logging is going to have a lot of 
noise.  I'm still on the fence on what to do.  I don't want to get in people's way, but 
we also need to have traceability about decisions, and we certainly can't have answers 
like We discussed this on IRC and you missed it, too bad happening (not 
saying that has happening, just saying I don't want to see it).

-Grant
-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

   


Even with logging, I'm against using IRC for making decisions, or as 
something people can point to. Even with searchable logging, I think we 
should stick with, if id didn't happen on the lists, it didn't happen. 
Its the same as when some of us get together and talk about Lucene and 
Solr - thats great stuff - you can get a lot done that is a lot harder 
on the lists - you can hash a lot out. But I think people should always 
have the right to act like it didn't happen - the same as if we are at 
ApacheCon or something - we don't come back and say, sorry, you missed 
all the discussion, but we had one and this what we are going to do. We 
summarize the discussion on the list (like Mike likes to do with IRC), 
and answer questions as people have them. I personally think its great 
to come to mini agreements with real-time talk - then it just has to 
make its way through the list.


This isn't a counter point to anything you said Grant, just a nice place 
for me to drop this.


--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: [DISCUSS] Do away with Contrib Committers and make core committers

2010-03-15 Thread Mark Miller


On 03/15/2010 08:33 AM, Grant Ingersoll wrote:

Right, Mark.  I think we would be effectively raising the bar to some extent 
for what it takes to be a committer.


That's part of my point though - some are contrib committers with a 
lower bar - now they are core/solr committers with that lower bar, but 
someone else that came along would not get to the same position now?



  We'd also be making contrib a first class citizen (not that it ever wasn't, 
but some people have that perception).


I think because it was kind of true. I could come along before and 
donate contrib x, and never show I worked well with the community or 
build up the merit needed to be a committer, and be made a contrib 
committer simply to maintain my module. That's happened plenty.



  Finally, I think we need to recognize that not everyone needs to be a 
McCandless in order to contribute in a helpful way.


We obviously recognize that or else I wouldn't be here! I think its more 
about fitting in - showing you get and follow the Apache way. Showing 
that ideas and changes you might push are in line with what the other 
committers thing is appropriate of a core/solr committer. Talent is not 
key here - community is. The bar for this has been *much* higher core 
than contrib in the past. And contrib has had different bars over time - 
I think it was even lower in the past at points.



  I think sometimes we forget that you can do svn revert.


I hate to have to do that. I don't think its a great way to handle this 
- we could make everyone a committer at a drop of a hat and say we can 
just revert. I wouldn't call for a revert except in exceptional 
circumstances. I don't think that's the point.



Obviously, we don't want to have to do it often, but it's not a huge deal if it 
happens.  We've all been there.

-Grant
-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

   


I also wouldn't personally cast my vote on this broadly - some people I 
might think should be core/solr committers now, others not. Merit at 
Apache is important - you never lose it. Seems weird to get something 
like that so easily when in the past you had to work your way to it from 
contrib committership and get voted on individually by the PMC.


Personally I'd prefer we just stop adding them, and the current ones 
work their way up like normal if they are so inclined, or the ones that 
are not even around anymore can just stay as they are.


--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: lucene and solr trunk

2010-03-15 Thread Mark Miller


On 03/15/2010 11:28 PM, Yonik Seeley wrote:

So, we have a few options on where to put Solr's new trunk:


Solr moves to Lucene's trunk:
   /java/trunk, /java/trunk/sol
+1. With the goal of merged dev, merged tests, this looks the best to 
me. Simple to do patches that span both, simple to setup

Solr to use Lucene trunk rather than jars. Short paths. Simple. I like it.

--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: [DISCUSS] Do away with Contrib Committers and make core committers

2010-03-14 Thread Mark Miller




On 03/14/2010 06:37 PM, Grant Ingersoll wrote:

On Mar 14, 2010, at 2:03 PM, Uwe Schindler wrote:

   

This time a +1 without discuss :-)
 

Yeah, but Uwe, the thread was DISCUSS, not VOTE!  :-)
   


I had a whole spiel about earning merit, and some contrib committers 
were made contrib committers for just a single contrib, some long ago, 
didn't have to necessarily show they understood/followed the apache way, 
lower bar (not necessarily from talent perspective, but you might be 
made a contrib committer just to maintain the code module you 
contributed, whether you worked with the community or not), etc, etc. 
But ah, since everyone is into it without discussion, far be it from me 
to stand against. And I got my spiel in (super condensed) anyway now. 
With everyone else into it so far, I just look foolish trying to discuss :)


- Mark

   

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


 

-Original Message-
From: Grant Ingersoll [mailto:gsi...@gmail.com] On Behalf Of Grant
Ingersoll
Sent: Sunday, March 14, 2010 5:54 PM
To: java-dev@lucene.apache.org
Subject: [DISCUSS] Do away with Contrib Committers and make core
committers

Given the notion of one project, one set of committers, I think we
should do away with the notion of contrib committers for java-dev and
just have everyone be committers.  Practically speaking, this would
make all existing contrib committers be core committers.  I think the
notion of contrib committers has added to the confusion about the
status of contrib as well as acted like a probation for new
committers.  To me, I don't think we should make that distinction, as
has been evidenced time and time again, if we trust someone to commit
to contrib, we can trust them to commit to core.  And if we don't trust
them to contrib to core then we probably shouldn't contrib either.
Much of being a committer is about knowing what not to touch as it is
to touch and I trust that all of our contrib committers know that.

Thoughts?

-Grant
-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org
   



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

 

--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: 
http://www.lucidimagination.com/search


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

   



--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Welcome Chris Male as Contrib committer!

2010-03-12 Thread Mark Miller


I am happy to announce the Lucene PMC has accepted Chris Male as a
contrib committer!

Chris has been making a lot of headway in cleaning up the spacial contrib 
lately,
and hopefully now we can get more of those improvements into svn!

Congrats Chris, and welcome!


--
- Mark

http://www.lucidimagination.com

[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers

2010-03-12 Thread Mark Miller (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844516#action_12844516
]

Mark Miller commented on LUCENE-2309:
-

bq. Also IRC is not logged/archived and searchable (I think?) which makes it
impossible to trace back a discussion, and/or randomly stumble upon it in
Google.

Apaches rule is, if it didn't happen on this lists, it didn't happen. #IRC is a
great way for people to communicate and hash stuff out, but its not necessary
you follow it. If you have questions or want further elaboration, just ask. No
one can expect you to follow IRC, nor is it a valid reference for where
something was decided. IRC is great - I think its really benefited having devs
discuss there - but the official position is, if it didn't happen on the list,
it didnt actually happen.

Fully decouple IndexWriter from analyzers
-

Key: LUCENE-2309
URL: https://issues.apache.org/jira/browse/LUCENE-2309
Project: Lucene - Java
Issue Type: Improvement
Components: Index
Reporter: Michael McCandless

IndexWriter only needs an AttributeSource to do indexing.
Yet, today, it interacts with Field instances, holds a private
analyzers, invokes analyzer.reusableTokenStream, has to deal with a
wide variety (it's not analyzed; it is analyzed but it's a Reader,
String; it's pre-analyzed).
I'd like to have IW only interact with attr sources that already
arrived with the fields. This would be a powerful decoupling -- it
means others are free to make their own attr sources.
They need not even use any of Lucene's analysis impls; eg they can
integrate to other things like [OpenPipeline|http://www.openpipeline.org].
Or make something completely custom.
LUCENE-2302 is already a big step towards this: it makes IW agnostic
about which attr is the term, and only requires that it provide a
BytesRef (for flex).
Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the
FieldType knows the analyzer to use, then we could simply create a
getAttrSource() method (say) on it and move all the logic IW has today
onto there. (We'd still need existing IW code for back-compat).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: [jira] Commented: (LUCENE-2308) Separately specify a field's type

2010-03-12 Thread Mark Miller

Committers are competant in different areas of the code. Even mike
wasn't big into the search side until per segment. Commiters are
trusted to mess with the pieces they know.

I don't see anyone even remotely suggesting that users should have to
understand all of the implications of posting format modifications.

Just sounds like a nasty jab to me.

- Mark

http://www.lucidimagination.com

On Mar 12, 2010, at 2:43 PM, Marvin Humphrey (JIRA)
j...@apache.org wrote:

[ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844637#action_12844637
]

Marvin Humphrey commented on LUCENE-2308:
-

If you disable term freq, you also have to disable positions. The
freq

tells you how many positions there are.

I think it's asking an awful lot of our users to require that they
understand

all the implications of posting format modifications when committers
have difficulty mastering all the subtleties.

Separately specify a field's type
-

Key: LUCENE-2308
URL: https://issues.apache.org/jira/browse/LUCENE-2308
Project: Lucene - Java
Issue Type: Improvement
Components: Index
Reporter: Michael McCandless

This came up from dicussions on IRC. I'm summarizing here...
Today when you make a Field to add to a document you can set things
index or not, stored or not, analyzed or not, details like omitTfAP,
omitNorms, index term vectors (separately controlling
offsets/positions), etc.
I think we should factor these out into a new class (FieldType?).
Then you could re-use this FieldType instance across multiple fields.
The Field instance would still hold the actual value.
We could then do per-field analyzers by adding a setAnalyzer on the
FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise
for per-field codecs (with flex), where we now have
PerFieldCodecWrapper).
This would NOT be a schema! It's just refactoring what we already
specify today. EG it's not serialized into the index.
This has been discussed before, and I know Michael Busch opened a
more
ambitious (I think?) issue. I think this is a good first baby
step. We could
consider a hierarchy of FIeldType (NumericFieldType, etc.) but
maybe hold

off on that for starters...

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: svn commit: r921480 [1/8] - in /lucene/java/trunk: ./ contrib/analyzers/common/src/test/org/apache/lucene/analysis/query/ contrib/analyzers/common/src/test/org/apache/lucene/analysis/shingle/ cont

2010-03-10 Thread Mark Miller


On 03/10/2010 01:48 PM, Robert Muir wrote:

On Wed, Mar 10, 2010 at 1:40 PM, Shai Ereraser...@gmail.com  wrote:
   

I wrote that I defaulted to Whitespace for convenience reasons only. Now you
don't need to specify anything if you don't care how the content is indexed,
which is really the case for TONS of tests. The code became so much simpler.
 

I guess I don't see it this way. It may be convenient for us, but its
inconvenient
for new users, as they see it as 'lucene's default'. No one wants to do more
work than is necessary: currently a lot of people use StandardAnalyzer for this
reason, maybe without a lot of thought. but this is ok.
StandardAnalyzer at least
does things like lowercasing.

   

For those who do care, they anyway pay attention to it :).
 

I see it as the inverse: I would rather our tests have new WhitespaceAnalyzer
than see users complain on java-user mailing list that lucene doesn't
ignore case
differences or punctuation, because they don't need to think about this.

Whitespace is a shitty default for a search engine, its only good for tests.


   


+1. I don't think we should default an Analyzer. I agree that 
WhiteSpaceAnalyzer is not a good default. And I don't think
Standard is a good default. I'm in agreement that you should have to 
specify to force thinking about it.


--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2294) Create IndexWriterConfiguration and store all of IW configuration there

2010-03-10 Thread Mark Miller (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12843717#action_12843717
]

Mark Miller commented on LUCENE-2294:
-

bq. If we say Analyzer is mandatory, what will stop us tomorrow from saying
IndexDeletionPolicy is mandatory?

Nothing ;) But I think Analyzer should be mandatory and that
IndexDeletionPolicy should not be mandatory, looking at them case by case.

Create IndexWriterConfiguration and store all of IW configuration there
---

Key: LUCENE-2294
URL: https://issues.apache.org/jira/browse/LUCENE-2294
Project: Lucene - Java
Issue Type: Improvement
Components: Index
Reporter: Shai Erera
Assignee: Michael McCandless
Fix For: 3.1

Attachments: LUCENE-2294.patch, LUCENE-2294.patch, LUCENE-2294.patch

I would like to factor out of all IW configuration parameters into a single
configuration class, which I propose to name IndexWriterConfiguration (or
IndexWriterConfig). I want to store there almost everything besides the
Directory, and to reduce all the ctors down to one: IndexWriter(Directory,
IndexWriterConfiguration). What I was thinking of storing there are the
following parameters:
* All of ctors parameters, except for Directory.
* The different setters where it makes sense. For example I still think
infoStream should be set on IW directly.
I'm thinking that IWC should expose everything in a setter/getter methods,
and defaults to whatever IW defaults today. Except for Analyzer which will
need to be defined in the ctor of IWC and won't have a setter.
I am not sure why MaxFieldLength is required in all IW ctors, yet IW declares
a DEFAULT (which is an int and not MaxFieldLength). Do we still think that
1 should be the default? Why not default to UNLIMITED and otherwise let
the application decide what LIMITED means for it? I would like to make MFL
optional on IWC and default to something, and I hope that default will be
UNLIMITED. We can document that on IWC, so that if anyone chooses to move to
the new API, he should be aware of that ...
I plan to deprecate all the ctors and getters/setters and replace them by:
* One ctor as described above
* getIndexWriterConfiguration, or simply getConfig, which can then be queried
for the setting of interest.
* About the setters, I think maybe we can just introduce a setConfig method
which will override everything that is overridable today, except for
Analyzer. So someone could do iw.getConfig().setSomething();
iw.setConfig(newConfig);
** The setters on IWC can return an IWC to allow chaining set calls ... so
the above will turn into
iw.setConfig(iw.getConfig().setSomething1().setSomething2());
BTW, this is needed for Parallel Indexing (see LUCENE-1879), but I think it
will greatly simplify IW's API.
I'll start to work on a patch.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2294) Create IndexWriterConfiguration and store all of IW configuration there

2010-03-10 Thread Mark Miller (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12843729#action_12843729
]

Mark Miller commented on LUCENE-2294:
-

bq. Question - does SOLR requires everyone to specify an Analyzer, or does it
come w/ a default one?

Hmm... SOLR doesn't really use Lucene analyzers.

It comes with a default Schema.xml that defines FieldTypes. Then field names
can be assigned to FieldTypes. So technically speaking, no, Solr does not - but
because most
people build off the example, you could say that it does have defaults for
example FieldTypes and defaults of what field names map to those. But it also
only accepts certain example fields with the example Schema - you really
have to go in and customize it to your needs - its setup to basically show off
what options are available and work with some demo stuff.

Solr comes with almost no defaults in a way - but it does ship with an example
setup that is meant to show you how to set things up, and what is available.
You could consider those defaults since most will build off it.

example of Solr analyzer declaration:

{code}
!-- A general unstemmed text field - good if one does not know the
language of the field --
fieldType name=textgen class=solr.TextField positionIncrementGap=100
analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt enablePositionIncrements=true /
filter class=solr.WordDelimiterFilterFactory generateWordParts=1
generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0
splitOnCaseChange=0/
filter class=solr.LowerCaseFilterFactory/
/analyzer
analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
ignoreCase=true expand=true/
filter class=solr.StopFilterFactory
ignoreCase=true
words=stopwords.txt
enablePositionIncrements=true
/
filter class=solr.WordDelimiterFilterFactory generateWordParts=1
generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0
splitOnCaseChange=0/
filter class=solr.LowerCaseFilterFactory/
/analyzer
/fieldType
{code}

Create IndexWriterConfiguration and store all of IW configuration there
---

Key: LUCENE-2294
URL: https://issues.apache.org/jira/browse/LUCENE-2294
Project: Lucene - Java
Issue Type: Improvement
Components: Index
Reporter: Shai Erera
Assignee: Michael McCandless
Fix For: 3.1

Attachments: LUCENE-2294.patch, LUCENE-2294.patch, LUCENE-2294.patch

[jira] Commented: (LUCENE-2294) Create IndexWriterConfiguration and store all of IW configuration there

2010-03-10 Thread Mark Miller (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12843756#action_12843756
]

Mark Miller commented on LUCENE-2294:
-

I'm assuming you would set an Analyzer for the document - and then you could
override per field - or something along those lines.

Create IndexWriterConfiguration and store all of IW configuration there
---

Key: LUCENE-2294
URL: https://issues.apache.org/jira/browse/LUCENE-2294
Project: Lucene - Java
Issue Type: Improvement
Components: Index
Reporter: Shai Erera
Assignee: Michael McCandless
Fix For: 3.1

Attachments: LUCENE-2294.patch, LUCENE-2294.patch, LUCENE-2294.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery

2010-03-09 Thread Mark Miller (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12843331#action_12843331
 ] 

Mark Miller commented on LUCENE-2089:
-

Sweet!

 explore using automaton for fuzzyquery
 --

 Key: LUCENE-2089
 URL: https://issues.apache.org/jira/browse/LUCENE-2089
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: Flex Branch
Reporter: Robert Muir
Assignee: Robert Muir
Priority: Minor
 Fix For: Flex Branch

 Attachments: ContrivedFuzzyBenchmark.java, createLevAutomata.py, 
 gen.py, gen.py, gen.py, gen.py, gen.py, gen.py, 
 Lev2ParametricDescription.java, Lev2ParametricDescription.java, 
 Lev2ParametricDescription.java, Lev2ParametricDescription.java, 
 LUCENE-2089.patch, LUCENE-2089.patch, LUCENE-2089.patch, LUCENE-2089.patch, 
 LUCENE-2089.patch, LUCENE-2089.patch, LUCENE-2089.patch, LUCENE-2089.patch, 
 LUCENE-2089.patch, LUCENE-2089.patch, LUCENE-2089.patch, LUCENE-2089.patch, 
 LUCENE-2089.patch, LUCENE-2089.patch, LUCENE-2089.patch, LUCENE-2089.patch, 
 LUCENE-2089_concat.patch, Moman-0.2.1.tar.gz, moman-57f5dc9dd0e7.diff, 
 TestFuzzy.java


 we can optimize fuzzyquery by using AutomatonTermsEnum. The idea is to speed 
 up the core FuzzyQuery in similar fashion to Wildcard and Regex speedups, 
 maintaining all backwards compatibility.
 The advantages are:
 * we can seek to terms that are useful, instead of brute-forcing the entire 
 terms dict
 * we can determine matches faster, as true/false from a DFA is array lookup, 
 don't even need to run levenshtein.
 We build Levenshtein DFAs in linear time with respect to the length of the 
 word: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
 To implement support for 'prefix' length, we simply concatenate two DFAs, 
 which doesn't require us to do NFA-DFA conversion, as the prefix portion is 
 a singleton. the concatenation is also constant time with respect to the size 
 of the fuzzy DFA, it only need examine its start state.
 with this algorithm, parametric tables are precomputed so that DFAs can be 
 constructed very quickly.
 if the required number of edits is too large (we don't have a table for it), 
 we use dumb mode at first (no seeking, no DFA, just brute force like now).
 As the priority queue fills up during enumeration, the similarity score 
 required to be a competitive term increases, so, the enum gets faster and 
 faster as this happens. This is because terms in core FuzzyQuery are sorted 
 by boost value, then by term (in lexicographic order).
 For a large term dictionary with a low minimal similarity, you will fill the 
 pq very quickly since you will match many terms. 
 This not only provides a mechanism to switch to more efficient DFAs (edit 
 distance of 2 - edit distance of 1 - edit distance of 0) during 
 enumeration, but also to switch from dumb mode to smart mode.
 With this design, we can add more DFAs at any time by adding additional 
 tables. The tradeoff is the tables get rather large, so for very high K, we 
 would start to increase the size of Lucene's jar file. The idea is we don't 
 have include large tables for very high K, by using the 'competitive boost' 
 attribute of the priority queue.
 For more information, see http://en.wikipedia.org/wiki/Levenshtein_automaton

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Vote on merging dev of Lucene and Solr

2010-03-04 Thread Mark Miller

For those committers that don't follow the general mailing list, or 
follow it that closely, we are currently having a vote for committers:


http://search.lucidimagination.com/search/document/4722d3144c2e3a8b/vote_merge_lucene_solr_development

--
- Mark

http://www.lucidimagination.com

[jira] Commented: (LUCENE-2294) Create IndexWriterConfiguration and store all of IW configuration there

2010-03-04 Thread Mark Miller (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841249#action_12841249
 ] 

Mark Miller commented on LUCENE-2294:
-

I can see the value in this - there are a bunch of IW constructors - but 
personally I still think I prefer them.

Creating config classes to init another class is its own pain in the butt. 
Reminds me of windows C programming and structs. When I'm just coding away, its 
so much easier to just enter the params in the cnstr. And it seems like it 
would be more difficult to know whats *required* to set on the config class - 
without the same cstr business ...

 Create IndexWriterConfiguration and store all of IW configuration there
 ---

 Key: LUCENE-2294
 URL: https://issues.apache.org/jira/browse/LUCENE-2294
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
 Fix For: 3.1


 I would like to factor out of all IW configuration parameters into a single 
 configuration class, which I propose to name IndexWriterConfiguration (or 
 IndexWriterConfig). I want to store there almost everything besides the 
 Directory, and to reduce all the ctors down to one: IndexWriter(Directory, 
 IndexWriterConfiguration). What I was thinking of storing there are the 
 following parameters:
 * All of ctors parameters, except for Directory.
 * The different setters where it makes sense. For example I still think 
 infoStream should be set on IW directly.
 I'm thinking that IWC should expose everything in a setter/getter methods, 
 and defaults to whatever IW defaults today. Except for Analyzer which will 
 need to be defined in the ctor of IWC and won't have a setter.
 I am not sure why MaxFieldLength is required in all IW ctors, yet IW declares 
 a DEFAULT (which is an int and not MaxFieldLength). Do we still think that 
 1 should be the default? Why not default to UNLIMITED and otherwise let 
 the application decide what LIMITED means for it? I would like to make MFL 
 optional on IWC and default to something, and I hope that default will be 
 UNLIMITED. We can document that on IWC, so that if anyone chooses to move to 
 the new API, he should be aware of that ...
 I plan to deprecate all the ctors and getters/setters and replace them by:
 * One ctor as described above
 * getIndexWriterConfiguration, or simply getConfig, which can then be queried 
 for the setting of interest.
 * About the setters, I think maybe we can just introduce a setConfig method 
 which will override everything that is overridable today, except for 
 Analyzer. So someone could do iw.getConfig().setSomething(); 
 iw.setConfig(newConfig);
 ** The setters on IWC can return an IWC to allow chaining set calls ... so 
 the above will turn into 
 iw.setConfig(iw.getConfig().setSomething1().setSomething2()); 
 BTW, this is needed for Parallel Indexing (see LUCENE-1879), but I think it 
 will greatly simplify IW's API.
 I'll start to work on a patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-2294) Create IndexWriterConfiguration and store all of IW configuration there

2010-03-04 Thread Mark Miller (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841249#action_12841249
 ] 

Mark Miller edited comment on LUCENE-2294 at 3/4/10 1:45 PM:
-

I can see the value in this - there are a bunch of IW constructors - but 
personally I still think I prefer them.

Creating config classes to init another class is its own pain in the butt. 
Reminds me of windows C programming and structs. When I'm just coding away, its 
so much easier to just enter the params in the cnstr. And it seems like it 
would be more difficult to know whats *required* to set on the config class - 
without the same cstr business ...

*edit*

Though I suppose the chaining *does* makes this more swallowable...

new IW(new IWConfig(Analyzer).set().set().set()) isn't really so bad ...

  was (Author: markrmil...@gmail.com):
I can see the value in this - there are a bunch of IW constructors - but 
personally I still think I prefer them.

Creating config classes to init another class is its own pain in the butt. 
Reminds me of windows C programming and structs. When I'm just coding away, its 
so much easier to just enter the params in the cnstr. And it seems like it 
would be more difficult to know whats *required* to set on the config class - 
without the same cstr business ...
  
 Create IndexWriterConfiguration and store all of IW configuration there
 ---

 Key: LUCENE-2294
 URL: https://issues.apache.org/jira/browse/LUCENE-2294
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
 Fix For: 3.1


 I would like to factor out of all IW configuration parameters into a single 
 configuration class, which I propose to name IndexWriterConfiguration (or 
 IndexWriterConfig). I want to store there almost everything besides the 
 Directory, and to reduce all the ctors down to one: IndexWriter(Directory, 
 IndexWriterConfiguration). What I was thinking of storing there are the 
 following parameters:
 * All of ctors parameters, except for Directory.
 * The different setters where it makes sense. For example I still think 
 infoStream should be set on IW directly.
 I'm thinking that IWC should expose everything in a setter/getter methods, 
 and defaults to whatever IW defaults today. Except for Analyzer which will 
 need to be defined in the ctor of IWC and won't have a setter.
 I am not sure why MaxFieldLength is required in all IW ctors, yet IW declares 
 a DEFAULT (which is an int and not MaxFieldLength). Do we still think that 
 1 should be the default? Why not default to UNLIMITED and otherwise let 
 the application decide what LIMITED means for it? I would like to make MFL 
 optional on IWC and default to something, and I hope that default will be 
 UNLIMITED. We can document that on IWC, so that if anyone chooses to move to 
 the new API, he should be aware of that ...
 I plan to deprecate all the ctors and getters/setters and replace them by:
 * One ctor as described above
 * getIndexWriterConfiguration, or simply getConfig, which can then be queried 
 for the setting of interest.
 * About the setters, I think maybe we can just introduce a setConfig method 
 which will override everything that is overridable today, except for 
 Analyzer. So someone could do iw.getConfig().setSomething(); 
 iw.setConfig(newConfig);
 ** The setters on IWC can return an IWC to allow chaining set calls ... so 
 the above will turn into 
 iw.setConfig(iw.getConfig().setSomething1().setSomething2()); 
 BTW, this is needed for Parallel Indexing (see LUCENE-1879), but I think it 
 will greatly simplify IW's API.
 I'll start to work on a patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Request for clarification on unordered SpanNearQuery

2010-03-04 Thread Mark Miller


On 03/04/2010 11:34 AM, Goddard, Michael J. wrote:

// Question: why wouldn't this Span be found?
assertTrue(fourth range, spans.next());
assertEquals(fourth doc, 11, spans.doc());
assertEquals(fourth start, 2, spans.start());
assertEquals(fourth end, 6, spans.end());


Spans are funny beasts ;)

No Spans ever start from the same position more than once. In effect, 
they are always marching forward.


The third range starts at 2, and once it finds a match starting at 2, it 
moves on. So it won't find the other
match that starts at 2. Spans are not exhaustive - exhaustive matching 
would be a different algorithm.


So yes, you are wrong in your expectation :) Just how Spans were 
implemented.


--
- Mark

http://www.lucidimagination.com

[jira] Commented: (LUCENE-2287) Unexpected terms are highlighted within nested SpanQuery instances

2010-03-01 Thread Mark Miller (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12839744#action_12839744
 ] 

Mark Miller commented on LUCENE-2287:
-

bq. Breaks backward compatibility, so need to find a way around that

Wouldn't be the end of the world depending on the break.

 Unexpected terms are highlighted within nested SpanQuery instances
 --

 Key: LUCENE-2287
 URL: https://issues.apache.org/jira/browse/LUCENE-2287
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/highlighter
Affects Versions: 2.9.1
 Environment: Linux, Solaris, Windows
Reporter: Michael Goddard
Priority: Minor
 Attachments: LUCENE-2287.patch, LUCENE-2287.patch, LUCENE-2287.patch, 
 LUCENE-2287.patch, LUCENE-2287.patch

   Original Estimate: 336h
  Remaining Estimate: 336h

 I haven't yet been able to resolve why I'm seeing spurious highlighting in 
 nested SpanQuery instances.  Briefly, the issue is illustrated by the second 
 instance of Lucene being highlighted in the test below, when it doesn't 
 satisfy the inner span.  There's been some discussion about this on the 
 java-dev list, and I'm opening this issue now because I have made some 
 initial progress on this.
 This new test, added to the  HighlighterTest class in lucene_2_9_1, 
 illustrates this:
 /*
  * Ref: http://www.lucidimagination.com/blog/2009/07/18/the-spanquery/
  */
 public void testHighlightingNestedSpans2() throws Exception {
   String theText = The Lucene was made by Doug Cutting and Lucene great 
 Hadoop was; // Problem
   //String theText = The Lucene was made by Doug Cutting and the great 
 Hadoop was; // Works okay
   String fieldName = SOME_FIELD_NAME;
   SpanNearQuery spanNear = new SpanNearQuery(new SpanQuery[] {
 new SpanTermQuery(new Term(fieldName, lucene)),
 new SpanTermQuery(new Term(fieldName, doug)) }, 5, true);
   Query query = new SpanNearQuery(new SpanQuery[] { spanNear,
 new SpanTermQuery(new Term(fieldName, hadoop)) }, 4, true);
   String expected = The BLucene/B was made by BDoug/B Cutting and 
 Lucene great BHadoop/B was;
   //String expected = The BLucene/B was made by BDoug/B Cutting and 
 the great BHadoop/B was;
   String observed = highlightField(query, fieldName, theText);
   System.out.println(Expected: \ + expected + \n + Observed: \ + 
 observed);
   assertEquals(Why is that second instance of the term \Lucene\ 
 highlighted?, expected, observed);
 }
 Is this an issue that's arisen before?  I've been reading through the source 
 to QueryScorer, WeightedSpanTerm, WeightedSpanTermExtractor, Spans, and 
 NearSpansOrdered, but haven't found the solution yet.  Initially, I thought 
 that the extractWeightedSpanTerms method in WeightedSpanTermExtractor should 
 be called on each clause of a SpanNearQuery or SpanOrQuery, but that didn't 
 get me too far.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Adding .classpath.tmpl

2010-02-25 Thread Mark Miller

+1 - I'd prefer this stay out of svn as well - I'd rather it go on the 
wiki too - perhaps in the same place that you can find the formatting 
file for eclipse and intellij.


--
- Mark

http://www.lucidimagination.com



On 02/25/2010 11:10 AM, Grant Ingersoll wrote:

To me, this is stuff that can go on the wiki or somewhere else, otherwise over 
time, there will be others to add in, etc.  We could simply add a pointer to 
the wiki page in the README.

On Feb 24, 2010, at 11:55 PM, Shai Erera wrote:

   

Hi

I always find it annoying when I checkout the code to a new project in eclipse, 
that I need to put everything that I care about in the classpath and adding the 
dependent libraries. On another project I'm involved with, we did that process 
once, adding all the source code to the classpath and the libraries and created 
a .classpath.tmpl. Now when people checkout the code, they can copy the content 
of that file to their .classpath file and setting up the project is reducing 
from a couple of minutes to few seconds.

I don't want to check-in .classpath because not everyone wants all the code in 
their classpath.

I attached such file to the mail. Note that the only dependency which will 
break on other machines is the ant.jar dependency, which on my Windows is 
located under c:\ant. That jar is required to compile contrib/ant from eclipse. 
Not sure how to resolve that, except besides removing that line from the file 
and document separately that that's what you need to do if you want to add 
contrib/ant ...

The file is sorted by name, putting the core stuff at the top - so it's easy 
for people to selectively add the interesting packages.

I don't know if an issue is required, if so I can create it in and move the 
discussion there.

Shai
lucene.classpath.tmpl
-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org
 



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

   





-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Question on highlighting of nested SpanQuery instances

2010-02-22 Thread Mark Miller

Hey Michael - this is currently just a limitation of the Span
highlighter. It does a bit of fudging when determining what a good
position is - if a term from the text is found within the span of a
spanquery it is in (no matter how deeply nested), the highlighter makes
a guess that the term should be highlighted - this is because we don't
have the actual positions of each term - just the positions of the start
and end of the span. In almost all cases this works as you would expect
- but when nesting spans like this, you can get spurious results within
the overall span.

So your idea that we should recurse into the Span is on the right track
- but it just gets fairly complicated quick. Consider
SpanNear(SpanNear(mark, miller,3), SpanTerm(lucene), 4) - if we recurse
in an grab the first SpanNear (mark, miller, 3), we can correctly
highlight that - but then we will handle lucene by itself - so all
lucene terms will be hit rather than the one within 4 of the first span.
So you have to deal with SpanOr, SpanNear, SpanNot recursively, but then
also handle when they are linked, either with each other or with a
SpanTerm - and uh - its gets hard real fast. Hence the fuzziness that
goes on now.

There may be something we can do to improve things in the future, but
its kind of an accepted limitation at the moment - prob something we
should add some doc about.

- Mark

Goddard, Michael J. wrote:

 Hello,

 I initially posted a version of this question to java-user, but think
 it's more of a java-dev question.  I haven't yet been able to resolve
 why I'm seeing spurious highlighting in nested SpanQuery instances. 
 To illustrate this, I added the code below to the HighlighterTest
 class in lucene_2_9_1:

 /*
  * Ref: http://www.lucidimagination.com/blog/2009/07/18/the-spanquery/
  */
 public void testHighlightingNestedSpans2() throws Exception {

   String theText = The Lucene was made by Doug Cutting and Lucene
 great Hadoop was; // Problem
   //String theText = The Lucene was made by Doug Cutting and the
 great Hadoop was; // Works okay

   String fieldName = SOME_FIELD_NAME;

   SpanNearQuery spanNear = new SpanNearQuery(new SpanQuery[] {
 new SpanTermQuery(new Term(fieldName, lucene)),
 new SpanTermQuery(new Term(fieldName, doug)) }, 5, true);

   Query query = new SpanNearQuery(new SpanQuery[] { spanNear,
 new SpanTermQuery(new Term(fieldName, hadoop)) }, 4, true);

   String expected = The BLucene/B was made by BDoug/B Cutting
 and Lucene great BHadoop/B was;
   //String expected = The BLucene/B was made by BDoug/B
 Cutting and the great BHadoop/B was;

   String observed = highlightField(query, fieldName, theText);
   System.out.println(Expected: \ + expected + \n + Observed: \
 + observed);

   assertEquals(Why is that second instance of the term \Lucene\
 highlighted?, expected, observed);
 }

 Is this an issue that's arisen before?  I've been reading through the
 source to QueryScorer, WeightedSpanTerm, WeightedSpanTermExtractor,
 Spans, and NearSpansOrdered, but haven't found the solution yet. 
 Initially, I thought that the extractWeightedSpanTerms method in
 WeightedSpanTermExtractor should be called on each clause of a
 SpanNearQuery or SpanOrQuery, but that didn't get me too far.

 Any suggestions are welcome.

 Thanks.

   Mike



-- 
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: (LUCENE-1844) Speed up junit tests

2010-02-22 Thread Mark Miller


On 02/20/2010 05:45 PM, Michael McCandless wrote:

Currently the tests run 1 jvm per test suite (eg, TestIndexWriter has
its own jvm), I believe, and we haven't seen test failures... so I
think for the most part tests are not interfering with each other
(messing up global state).

It should be less likely that we see interactions across test suites
(but obviously still possible).

I think we should commit this and then if there are somehow problems
we can address them, then?
   

+1

Mike

On Sun, Feb 14, 2010 at 6:27 AM, Robert Muirrcm...@gmail.com  wrote:
   

its not just statics, I think we should really look at ensuring files are
closed etc, or eventually there will be a problem!

I guess in general the tradeoff is, it requires us to have better test code.

On Sun, Feb 14, 2010 at 5:53 AM, Uwe Schindleru...@thetaphi.de  wrote:
 

At least we should check all core tests to not set any static defaults
without try...finally! Are there any possibilities inside Eclipse/other-IDEs
to check this?

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

   

-Original Message-
From: Michael McCandless [mailto:luc...@mikemccandless.com]
Sent: Sunday, February 14, 2010 11:43 AM
To: java-dev@lucene.apache.org
Subject: Re: (LUCENE-1844) Speed up junit tests

Wow -- this is MUCH faster!  I think we should switch...

It seems like we use a batchtest for all core tests, then for all
back-compat tests, then once per contrib package?  Ie, so ant
test-core uses one jvm?

I think we should simply fix any badly behaved tests (that don't
restore statics).  It's impressive we already have no test failures
when we do this... I guess our tests are already cleaning things up
(though also probably not often changing global state, or, changing it
in a way that'd lead other tests to fail).

Mike

On Sat, Feb 13, 2010 at 5:23 PM, Robert Muirrcm...@gmail.com  wrote:
 

On Fri, Nov 27, 2009 at 1:27 PM, Michael McCandless
luc...@mikemccandless.com  wrote:
   

Also one thing I'd love to try is NOT forking the JVM for each test
(fork=no in the junit task).  I wonder how much time that'd buy...

 

it shaves off a good deal of time on my machine.

'ant test-core': 4 minutes, 39 seconds -  3 minutes, 3 seconds
'ant test':  11 minutes, 8 seconds -  7 minutes, 13 seconds

however, it makes me a little nervous because i'm not sure all the
   

tests
 

cleanup nicely if they change statics and stuff.
anyway, here's the trivial patch (you don't want fork=no, because it
   

turns
 

off assertions)

Index: common-build.xml
===
--- common-build.xml(revision 909395)
+++ common-build.xml(working copy)
@@ -398,7 +398,7 @@
 /condition
 mkdir dir=@{junit.output.dir}/
 junit printsummary=off haltonfailure=no
   

maxmemory=512M
 

- errorProperty=tests.failed
   

failureProperty=tests.failed
 

+ errorProperty=tests.failed
   

failureProperty=tests.failed
 

forkmode=perBatch
   classpath refid=@{junit.classpath}/
   assertions
 enable package=org.apache.lucene/

--
Robert Muir
rcm...@gmail.com

   

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org
 



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

   



--
Robert Muir
rcm...@gmail.com

 

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

   



--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Question on highlighting of nested SpanQuery instances

2010-02-22 Thread Mark Miller

I played with it sometime back, but I don't have any code left from that 
exercise.


Its fairly tricky.

Take your example:

   SpanNearQuery spanNear = new SpanNearQuery(new SpanQuery[] {
 new SpanTermQuery(new Term(fieldName, lucene)),
 new SpanTermQuery(new Term(fieldName, doug)) }, 5, true);

   Query query = new SpanNearQuery(new SpanQuery[] { spanNear,
 new SpanTermQuery(new Term(fieldName, hadoop)) }, 4, true);

First you see the top level SpanNearQuery -

you want to recurse in and just work with the lucene within 5 of dog, 
ordered, part. But you can't actually work with that alone. That whole 
span also has to be within 4 of hadoop ordered ... so how do you 
constrain the sub highlighting? Lets say you do it somehow.


Now you recurse in an want to highlight hadoop - but again, not every 
hadoop - only the haoops that are within 4, ordered, of the first Span.


So that's really the issue - you want to break up the Span and highlight 
recursively - but you can't really break them up and maintain all of the 
positional restrictions required.


So another possible option that gets a little messier might be:

when extracting the allowable positions for a term (which it does by 
checking the start and end of span), you might also run each inner span 
that contains that term, and then intersect the positions you find that 
way with the positions found with the overall span and use that list as 
the allowable positions. That could get kind of complicated though, 
especially taking into account the logic of the or and spannot spanqueries.


- Mark

On 02/22/2010 03:15 PM, Goddard, Michael J. wrote:


Mark,

Thanks a lot for the insight.  I'm working with this today and, diving 
into the WeightedSpanTermExtractor class and fiddling with it.  If you 
ever did have any code which attempted to recurse into these 
structures, I'd be happy to get my hands on it.


Thanks again.

  Mike



-Original Message-
From: Mark Miller [mailto:markrmil...@gmail.com]
Sent: Mon 2/22/2010 9:15 AM
To: java-dev@lucene.apache.org
Cc: Goddard, Michael J.
Subject: Re: Question on highlighting of nested SpanQuery instances

Hey Michael - this is currently just a limitation of the Span
highlighter. It does a bit of fudging when determining what a good
position is - if a term from the text is found within the span of a
spanquery it is in (no matter how deeply nested), the highlighter makes
a guess that the term should be highlighted - this is because we don't
have the actual positions of each term - just the positions of the start
and end of the span. In almost all cases this works as you would expect
- but when nesting spans like this, you can get spurious results within
the overall span.

So your idea that we should recurse into the Span is on the right track
- but it just gets fairly complicated quick. Consider
SpanNear(SpanNear(mark, miller,3), SpanTerm(lucene), 4) - if we recurse
in an grab the first SpanNear (mark, miller, 3), we can correctly
highlight that - but then we will handle lucene by itself - so all
lucene terms will be hit rather than the one within 4 of the first span.
So you have to deal with SpanOr, SpanNear, SpanNot recursively, but then
also handle when they are linked, either with each other or with a
SpanTerm - and uh - its gets hard real fast. Hence the fuzziness that
goes on now.

There may be something we can do to improve things in the future, but
its kind of an accepted limitation at the moment - prob something we
should add some doc about.

- Mark

Goddard, Michael J. wrote:

 Hello,

 I initially posted a version of this question to java-user, but think
 it's more of a java-dev question.  I haven't yet been able to resolve
 why I'm seeing spurious highlighting in nested SpanQuery instances.
 To illustrate this, I added the code below to the HighlighterTest
 class in lucene_2_9_1:

 /*
  * Ref: http://www.lucidimagination.com/blog/2009/07/18/the-spanquery/
  */
 public void testHighlightingNestedSpans2() throws Exception {

   String theText = The Lucene was made by Doug Cutting and Lucene
 great Hadoop was; // Problem
   //String theText = The Lucene was made by Doug Cutting and the
 great Hadoop was; // Works okay

   String fieldName = SOME_FIELD_NAME;

   SpanNearQuery spanNear = new SpanNearQuery(new SpanQuery[] {
 new SpanTermQuery(new Term(fieldName, lucene)),
 new SpanTermQuery(new Term(fieldName, doug)) }, 5, true);

   Query query = new SpanNearQuery(new SpanQuery[] { spanNear,
 new SpanTermQuery(new Term(fieldName, hadoop)) }, 4, true);

   String expected = The BLucene/B was made by BDoug/B Cutting
 and Lucene great BHadoop/B was;
   //String expected = The BLucene/B was made by BDoug/B
 Cutting and the great BHadoop/B was;

   String observed = highlightField(query, fieldName, theText);
   System.out.println(Expected: \ + expected + \n + Observed: \
 + observed);

   assertEquals(Why is that second instance of the term \Lucene\
 highlighted?, expected

Looks like we missed a little change for 3.0 ...

2010-02-22 Thread Mark Miller


  /* TODO 3.0: change this default to true */
  protected boolean calibrateSizeByDeletes = false;

Better to make these JIRA issues to avoid the miss?

--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: [VOTE] Lucene Java 2.9.2 and 3.0.1 release artifacts

2010-02-17 Thread Mark Miller

+1 from me - I've put both releases through their paces - though
technically, there are a handful of files that look like they need
apache headers (css, html) that are reported by rat. I don't think this
is a new issue though, so I don't think its something we need to be that
picky about right now.

A ref to the apache header policy:

With few exceptions #faq-exceptions, all human-readable
Apache-developed files that are included within a distribution must
include the header text #header-text. Documentation, including web
site documentation distributed with the release, may include the header
text within some form of metadata (such as HTML comments) or as a header
or footer appearing in the visible documentation.

A file without any degree of creativity in either its literal elements
or its structure is not protected by copyright law; therefore, such a
file does not require a license header. If in doubt about the extent of
the file's creativity, add the license header to the file.

-- 
- Mark

http://www.lucidimagination.com



Uwe Schindler wrote:
 Hi all,

 I tested the lucene-core-3.0.1.jar in production since Sunday afternoon, no 
 problems. I also replaced by the 2.9.2 file in my dev environment (without 
 recompilations, because the locally added generics would break only the 
 compilation but not the JVM of my projects) and tested: works.

 I also downloaded the artifacts to a computer without my own trustdb, 
 imported KEYS and verified the signatures - no problems (only the GPG warning 
 about the fact that the imported KEYS are not yet trusted by me). Md5/sha1 
 are also ok.

 I also downloaded source zips and built/tested using ANT - passed.

 So a +1 from myself as a non-PMC member.

 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen
 http://www.thetaphi.de
 eMail: u...@thetaphi.de

   
 -Original Message-
 From: Uwe Schindler [mailto:u...@thetaphi.de]
 Sent: Monday, February 15, 2010 12:46 AM
 To: gene...@lucene.apache.org; java-dev@lucene.apache.org
 Subject: [VOTE] Lucene Java 2.9.2 and 3.0.1 release artifacts

 Hallo Folks,

 I have posted a release candidate for both Lucene Java 2.9.2 and 3.0.1
 (which both have the same bug fix level, functionality and release
 announcement), build from revision 910082 of the corresponding
 branches. Thanks for all your help! Please test them and give your
 votes until Thursday morning, as the scheduled release date for both
 versions is Friday, Feb 19th, 2010. Only votes from Lucene PMC are
 binding, but everyone
 is welcome to check the release candidate and voice their approval or
 disapproval. The vote passes if at least three binding +1 votes are
 cast.

 We planned the parallel release with one announcement because of their
 parallel development / bug fix level to emphasize that they are equal
 except deprecation removal and Java 5 since major version 3.

 Please also read the attached release announcement (Open Document) and
 send it corrected back if you miss anything or want to improve my bad
 English :-)

 You find the artifacts here:
 http://people.apache.org/~uschindler/staging-area/lucene-292-301-take1-
 rev910082/

 Maven repo:
 http://people.apache.org/~uschindler/staging-area/lucene-292-301-take1-
 rev910082/maven/

 The changes are here:
 http://people.apache.org/~uschindler/staging-area/lucene-292-301-take1-
 rev910082/changes-2.9.2/Changes.html
 http://people.apache.org/~uschindler/staging-area/lucene-292-301-take1-
 rev910082/changes-2.9.2/Contrib-Changes.html

 http://people.apache.org/~uschindler/staging-area/lucene-292-301-take1-
 rev910082/changes-3.0.1/Changes.html
 http://people.apache.org/~uschindler/staging-area/lucene-292-301-take1-
 rev910082/changes-3.0.1/Contrib-Changes.html

 Uwe

 === Proposed Release Announcement ===

 Hello Lucene users,

 On behalf of the Lucene development community I would like to announce
 the release of Lucene Java versions 3.0.1 and 2.9.2:

 Both releases fix bugs in the previous versions, where 2.9.2 is the
 last release working with Java 1.4, still providing all deprecated APIs
 of the Lucene Java 2.x series. 3.0.1 has the same bug fix level, but
 requires Java 5 and is no longer compatible with code using deprecated
 APIs. The API was cleaned up to make use of Java 5's generics, varargs,
 enums, and autoboxing. New users of Lucene are advised to use version
 3.0.1 for new developments, because it has a clean, type safe new API.
 Users upgrading from 2.9.x can now remove unnecessary casts and add
 generics to their code, too.

 Important improvements in these releases are a increased maximum number
 of unique terms in each index segment. They also add fixes in
 IndexWriter’s commit and lost document deletes in near real-time
 indexing. Also lots of bugs in Contrib’s Analyzers package were fixed.
 Additionally, the 3.0.1 release restored some public methods, that get
 lost during deprecation removal. If you are using Lucene in a web
 application environment, you will notice that

Re: [jira] Commented: (LUCENE-2262) QueryParser should now allow leading '?' wildcards

2010-02-13 Thread Mark Miller

Nah, let's just make fuzzy not work in the qp by default :) And make
that back compat while your at it - while not abusing Version so that
it's used for something subjective :) wouldn't want to rile up Hoss.

I'm like 3/4 serious.

- Mark

http://www.lucidimagination.com (mobile)

On Feb 13, 2010, at 10:22 PM, Robert Muir (JIRA) j...@apache.org
wrote:

[ https://issues.apache.org/jira/browse/LUCENE-2262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833496#action_12833496
]

Robert Muir commented on LUCENE-2262:
-

bq. in my opinion disallowing these queries with leading wildcards,
be it * or ? or whatever, is rather silly, since we allow even
slower fuzzyqueries by default.

bq. Agree.

What do you think, should we skip this step then and simply
deprecate the entire setAllowLeadingWildcard concept all together,
setting it to true for Version = 3.1?

QueryParser should now allow leading '?' wildcards
--

Key: LUCENE-2262
URL: https://issues.apache.org/jira/browse/LUCENE-2262
Project: Lucene - Java
Issue Type: Improvement
Components: QueryParser
Affects Versions: Flex Branch
Reporter: Robert Muir
Assignee: Robert Muir
Priority: Minor
Fix For: Flex Branch

Attachments: LUCENE-2262.patch, LUCENE-2262_backwards.patch

QueryParser currently throws an exception if a wildcard term begins
with the '?' operator.

The current documentation describes why this is:
{noformat}
When set, * or ? are allowed as the first character of a
PrefixQuery and WildcardQuery.

Note that this can produce very slow queries on big indexes.
{noformat}
In the flexible indexing branch, wildcard queries with leading '?'
operator are no longer slow on big indexes (they do not enumerate
terms in linear fashion).
Thus, it no longer makes sense to throw a ParseException for a
leading '?'
So, users should be able to perform a query of ?foo and no longer
get a ParseException from the QueryParser.
For the flexible indexing branch, wildcard queries of 'foo?', '?
foo', 'f?oo', etc are all the same from a performance perspective.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Nasty NIO behavior makes NIOFSDirectory silently close channel

2010-01-29 Thread Mark Miller

Perhaps - one of the things they are supposed to be addressing is
extendability.

nio2 does have FileSystemProvider, which would actually allow you to
create a custom channel !

I have not dug in enough to know much more than that though.

*But*, another really interesting thing is that in Java 7,
FileDescriptors are ref counted ! (though users can't inc/dec).

But, FileInputStream and OutputStream have a new constructor that takes
a FileDescriptor.

So possibly, you could just make one that sits around to keep the
FileDescriptor valid, and get your channel off
FileInputStream/FileOutputStream?

And then if it goes down, make a new one using the FileDescriptor which
was not actually closed because there was a still a ref to it.

Possibly  ;)

Michael McCandless wrote:
 Does anyone know if nio2 has improved this...?

 Mike

 On Fri, Jan 29, 2010 at 2:00 PM, Jason Rutherglen
 jason.rutherg...@gmail.com wrote:
   
 Defaulting NIOFSDir could account for some of the recent speed
 improvements users have been reporting in Lucene 2.9.  So removing it
 as a default could reverse those and people could then report Lucene
 3.X has slowed...

 On Thu, Jan 28, 2010 at 5:24 AM, Michael McCandless
 luc...@mikemccandless.com wrote:
 
 Bummer.

 So the only viable workarounds are 1) don't use Thread.interrupt (nor,
 things like Future.cancel, which in turn use Thread.interrupt) with
 NIOFSDir, or 2) we fix NIOFSDir to reopen the channel AND the app must
 make a deletion policy that keeps a commit alive if any reader is
 using it.  Or, 3) don't use NIOFSDir!

 Mike

 On Thu, Jan 28, 2010 at 7:29 AM, Simon Willnauer
 simon.willna...@googlemail.com wrote:
   
 On Thu, Jan 28, 2010 at 12:43 PM, Michael McCandless
 luc...@mikemccandless.com wrote:
 
 On Thu, Jan 28, 2010 at 6:38 AM, Uwe Schindler u...@thetaphi.de wrote:

   
 So I checked the code of NIOFSIndexInput, my last comment was not really 
 correct:
 NIOFSIndexInput extends SimpleFSIndexInput and that opens the RAF. In 
 the ctor RAF.getChannel() is called. The RAF keeps open until the file 
 is closed (and also the channel).

 So it's really simple to fix in my opinion, just call getChannel() again 
 on this exception. Because the RAF should still be open?
 
 Short answer:
  public final FileChannel getChannel() {
synchronized (this) {
if (channel == null)
channel = FileChannelImpl.open(fd, true, rw, this);
return channel;
}
}

 this is not gonna work I tried it before. The RandomAccessFile buffers
 the channel!!

 simon
 
 I think we need a definitive answer on what happens to the RAF when
 the FileChannel was closed by Thread.Interrupt.  Simon can you test
 this?

 Mike

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org


   
 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org


 
 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org


   
 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org


 

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org

   


-- 
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2226) move contrib/snowball to contrib/analyzers

2010-01-18 Thread Mark Miller (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12801972#action_12801972
 ] 

Mark Miller commented on LUCENE-2226:
-

Contribs back compat policy is that there is no back compat policy unless that 
contrib specifically states one.

 move contrib/snowball to contrib/analyzers
 --

 Key: LUCENE-2226
 URL: https://issues.apache.org/jira/browse/LUCENE-2226
 Project: Lucene - Java
  Issue Type: Task
  Components: contrib/analyzers
Reporter: Robert Muir
Assignee: Robert Muir
 Fix For: 3.1

 Attachments: LUCENE-2226.patch


 to fix bugs in some duplicate, handcoded impls of these stemmers (nl, fr, ru, 
 etc) we should simply merge snowball and analyzers, and replace the buggy 
 impls with the proper snowball stemfilters.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2226) move contrib/snowball to contrib/analyzers

2010-01-18 Thread Mark Miller (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-2226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12802011#action_12802011
]

Mark Miller commented on LUCENE-2226:
-

{quote}Mark, that is my understanding too. I wasn't commenting on the policy
but on the fact of the possible breakage. I think it is a courtesy to notify
users of a change to which they might need to pay attention. I don't know
that's spelled out in the policy, but I think it should be. Not that a lack of
notice is a guarantee of no breakage but that a notice is a guarantee of
breakage (at least under some circumstances).{quote}

Right - I was just pointing out that jar drop in is far from a requirement in
contrib. We do always try and play nice anyway.

bq. Is there any contrib that specifically states one? I couldn't find it.

Don't think so - meaning there is no back compat policy in contrib - I think as
a contrib matures, its up to those working on it to decide that its reached a
state that deserves a policy of some kind. The Highlighter could probably use
one at this point, but at the same time, nothing has created too much of an
outcry at this point.

bq. The analysis/common is not clear as it has the Version stuff.

Right - just because there is no policy doesn't mean we shouldn't make any
attempts at back compat - but the issue you brought up is not something easily
addressed, nor I think, large enough to worry about with the proper warning in
Changes. Users should be wary of contrib on upgrading - unless it presents a
strong back compat policy.

bq. But after all the dust settles and this i18n stuff is solid, I think it
might be reasonable to make a stronger bw compat statement.

I agree - now that contrib has been getting some much needed love recently, I
think it should start heading towards some back compat promises - especially
concerning analyzers. We already do tend to bend over backwards when we can
anyway.

I think we are on the same page - I'm just not very worried about the break you
mention - I think its a perfectly acceptable growing pain. And I think our back
compat has been so week because contrib has been a bit of a wasteland in the
past - no one was willing to take ownership of a lot of this stuff - especially
the language analyzers. That has change recently. As the devs clean up and
consolidate this stuff properly, I think we can work towards stronger promises
in the future.

move contrib/snowball to contrib/analyzers
--

Key: LUCENE-2226
URL: https://issues.apache.org/jira/browse/LUCENE-2226
Project: Lucene - Java
Issue Type: Task
Components: contrib/analyzers
Reporter: Robert Muir
Assignee: Robert Muir
Fix For: 3.1

Attachments: LUCENE-2226.patch

to fix bugs in some duplicate, handcoded impls of these stemmers (nl, fr, ru,
etc) we should simply merge snowball and analyzers, and replace the buggy
impls with the proper snowball stemfilters.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Compound File Default

2010-01-12 Thread Mark Miller

Otis Gospodnetic wrote:
 At the same time, seeing how some people benchmark systems without tuning 
 them and then publish their results, cfs may be safer.

   
Though at the same time you get nailed with a 10-15% indexing speed hit.

-- 
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

update doc by query

2010-01-10 Thread Mark Miller

Any reason we don't offer update doc by query along with term?

Its easy enough to implement in the same manner - is there some sort of
gotchya with this, or is it just because there has been no demand yet?

-- 
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Lucene Java 2.9.2

2010-01-06 Thread Mark Miller

Other than what's left of the TokenStream issues, I think we just need a
compression solution - which shouldn't
be difficult.

- Mark

Robert Muir wrote:
 https://issues.apache.org/jira/browse/SOLR-1657

 i just struckthrough the things that are done.

   
 Mark, Robert: How far are we with progress in solr? Were there any
 additional problems with 3.0.0?

 Uwe

 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen
 http://www.thetaphi.de
 eMail: u...@thetaphi.de


 
 -Original Message-
 From: Uwe Schindler [mailto:u...@thetaphi.de]
 Sent: Tuesday, January 05, 2010 1:34 PM
 To: java-dev@lucene.apache.org
 Subject: RE: Lucene Java 2.9.2

 My plan was to release it together with 3.0.1. Both version then will have
 the same bug fix status.

 I have the scripts here to build the artifacts (as I added fast vector
 highlighter poms), so I could do it for both and start the release.

 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen
 http://www.thetaphi.de
 eMail: u...@thetaphi.de

   
 -Original Message-
 From: Michael McCandless [mailto:luc...@mikemccandless.com]
 Sent: Tuesday, January 05, 2010 12:26 PM
 To: java-dev@lucene.apache.org
 Subject: Re: Lucene Java 2.9.2

 Lucene 2.9.2 hasn't been released yet, but I think we should release
 it at some point soonish?  It's accumulated some important bug fixes.

 Mike

 On Mon, Jan 4, 2010 at 10:59 PM, George Aroush geo...@aroush.net
 
 wrote:
   
 Hi Folks,

 Over at Lucene.Net, we have 2.9.1 ready for official release.  This is
   
 a
   
 port of the current Lucene Java 2.9.1 release.

 When I raised the question about releasing Lucene.Net 2.9.1, a
   
 question
   
 was
 
 asked to port over LUCENE-2190 for which a patch was quickly made
   
 (see:
   
 https://issues.apache.org/jira/browse/LUCENENET-331).  This begs the
 question, if Lucene.Net takes just this one patch, than Lucene.Net
   
 2.9.1
   
 is
 
 now 2.9.1.1 (which I personally don't like to see happening as I
   
 prefer
   
 to
 
 see a 1-to-1 release match).

 So, I examined the list of fixes made in 2.9.2 here:
 https://issues.apache.org/jira/browse/LUCENE/fixforversion/12314342
   
 and
   
 found that this is a small task to port over.

 So far so good?  Good.

 Now, as far as I know, Lucene Java never made an official 2.9.2
   
 release,
   
 or
 
 is this in the works (I don't recall seeing any email about it)?  If
   
 so,
   
 what's the time line?  I think our decision on the Lucene.Net side
   
 will
   
 be
 
 based on the answer to this question.

 Thanks.

 -- George


 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org


   
 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org
 

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org
   

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org


 



   


-- 
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Resolved: (LUCENE-2035) TokenSources.getTokenStream() does not assign positionIncrement

2010-01-06 Thread Mark Miller (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Mark Miller resolved LUCENE-2035.
-

Resolution: Fixed

Thanks Christopher!

TokenSources.getTokenStream() does not assign positionIncrement
---

Key: LUCENE-2035
URL: https://issues.apache.org/jira/browse/LUCENE-2035
Project: Lucene - Java
Issue Type: Bug
Components: contrib/highlighter
Affects Versions: 2.4, 2.4.1, 2.9
Reporter: Christopher Morris
Assignee: Mark Miller
Fix For: 3.1

Attachments: LUCENE-2035.patch, LUCENE-2035.patch, LUCENE-2305.patch

Original Estimate: 24h
Remaining Estimate: 24h

TokenSources.StoredTokenStream does not assign positionIncrement information.
This means that all tokens in the stream are considered adjacent. This has
implications for the phrase highlighting in QueryScorer when using
non-contiguous tokens.
For example:
Consider a token stream that creates tokens for both the stemmed and
unstemmed version of each word - the fox (jump|jumped)
When retrieved from the index using TokenSources.getTokenStream(tpv,false),
the token stream will be - the fox jump jumped
Now try a search and highlight for the phrase query fox jumped. The search
will correctly find the document; the highlighter will fail to highlight the
phrase because it thinks that there is an additional word between fox and
jumped. If we use the original (from the analyzer) token stream then the
highlighter works.
Also, consider the converse - the fox did not jump
not is a stop word and there is an option to increment the position to
account for stop words - (the,0) (fox,1) (did,2) (jump,4)
When retrieved from the index using TokenSources.getTokenStream(tpv,false),
the token stream will be - (the,0) (fox,1) (did,2) (jump,3).
So the phrase query did jump will cause the did and jump terms in the
text did not jump to be highlighted. If we use the original (from the
analyzer) token stream then the highlighter works correctly.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-860) site should call project Lucene Java, not just Lucene

2010-01-06 Thread Mark Miller (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller updated LUCENE-860:
---

Attachment: LUCENE-860-1.patch

updated patch that also includes doc site level changes

 site should call project Lucene Java, not just Lucene
 -

 Key: LUCENE-860
 URL: https://issues.apache.org/jira/browse/LUCENE-860
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Website
Reporter: Doug Cutting
Assignee: Mark Miller
Priority: Minor
 Fix For: 3.1

 Attachments: LUCENE-860-1.patch, LUCENE-860-2.patch, LUCENE-860.patch


 To avoid confusion with the top-level Lucene project, the Lucene Java website 
 should refer to itself as Lucene Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-860) site should call project Lucene Java, not just Lucene

2010-01-06 Thread Mark Miller (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller updated LUCENE-860:
---

Attachment: LUCENE-860-2.patch

 site should call project Lucene Java, not just Lucene
 -

 Key: LUCENE-860
 URL: https://issues.apache.org/jira/browse/LUCENE-860
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Website
Reporter: Doug Cutting
Assignee: Mark Miller
Priority: Minor
 Fix For: 3.1

 Attachments: LUCENE-860-1.patch, LUCENE-860-2.patch, LUCENE-860.patch


 To avoid confusion with the top-level Lucene project, the Lucene Java website 
 should refer to itself as Lucene Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2035) TokenSources.getTokenStream() does not assign positionIncrement

2009-12-17 Thread Mark Miller (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12791939#action_12791939
]

Mark Miller commented on LUCENE-2035:
-

I'll commit this soon.

TokenSources.getTokenStream() does not assign positionIncrement
---

Attachments: LUCENE-2035.patch, LUCENE-2035.patch, LUCENE-2305.patch

Original Estimate: 24h
Remaining Estimate: 24h

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1922) exposing the ability to get the number of unique term count per field

2009-12-17 Thread Mark Miller (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller updated LUCENE-1922:


Affects Version/s: (was: 2.4.1)
   Flex Branch

 exposing the ability to get the number of unique term count per field
 -

 Key: LUCENE-1922
 URL: https://issues.apache.org/jira/browse/LUCENE-1922
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: Flex Branch
Reporter: John Wang

 Add an api to get the number of unique term count given a field name, e.g.:
 IndexReader.getUniqueTermCount(String field)
 This issue has a dependency on LUCENE-1458

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2035) TokenSources.getTokenStream() does not assign positionIncrement

2009-12-16 Thread Mark Miller (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12791680#action_12791680
 ] 

Mark Miller commented on LUCENE-2035:
-

Hey Christopher, why are you going through the trouble of the custom collector 
to check that there are no hits? Why not just do a standard search?

 TokenSources.getTokenStream() does not assign positionIncrement
 ---

 Key: LUCENE-2035
 URL: https://issues.apache.org/jira/browse/LUCENE-2035
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/highlighter
Affects Versions: 2.4, 2.4.1, 2.9
Reporter: Christopher Morris
Assignee: Mark Miller
 Fix For: 3.1

 Attachments: LUCENE-2035.patch, LUCENE-2305.patch

   Original Estimate: 24h
  Remaining Estimate: 24h

 TokenSources.StoredTokenStream does not assign positionIncrement information. 
 This means that all tokens in the stream are considered adjacent. This has 
 implications for the phrase highlighting in QueryScorer when using 
 non-contiguous tokens.
 For example:
 Consider  a token stream that creates tokens for both the stemmed and 
 unstemmed version of each word - the fox (jump|jumped)
 When retrieved from the index using TokenSources.getTokenStream(tpv,false), 
 the token stream will be - the fox jump jumped
 Now try a search and highlight for the phrase query fox jumped. The search 
 will correctly find the document; the highlighter will fail to highlight the 
 phrase because it thinks that there is an additional word between fox and 
 jumped. If we use the original (from the analyzer) token stream then the 
 highlighter works.
 Also, consider the converse - the fox did not jump
 not is a stop word and there is an option to increment the position to 
 account for stop words - (the,0) (fox,1) (did,2) (jump,4)
 When retrieved from the index using TokenSources.getTokenStream(tpv,false), 
 the token stream will be - (the,0) (fox,1) (did,2) (jump,3).
 So the phrase query did jump will cause the did and jump terms in the 
 text did not jump to be highlighted. If we use the original (from the 
 analyzer) token stream then the highlighter works correctly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2035) TokenSources.getTokenStream() does not assign positionIncrement

2009-12-16 Thread Mark Miller (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller updated LUCENE-2035:


Attachment: LUCENE-2035.patch

I've broken the new tests back out into there own file, change the hit 
collector code to just search basically, and improved the test coverage of 
TokenSources a bit.

 TokenSources.getTokenStream() does not assign positionIncrement
 ---

 Key: LUCENE-2035
 URL: https://issues.apache.org/jira/browse/LUCENE-2035
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/highlighter
Affects Versions: 2.4, 2.4.1, 2.9
Reporter: Christopher Morris
Assignee: Mark Miller
 Fix For: 3.1

 Attachments: LUCENE-2035.patch, LUCENE-2035.patch, LUCENE-2305.patch

   Original Estimate: 24h
  Remaining Estimate: 24h

 TokenSources.StoredTokenStream does not assign positionIncrement information. 
 This means that all tokens in the stream are considered adjacent. This has 
 implications for the phrase highlighting in QueryScorer when using 
 non-contiguous tokens.
 For example:
 Consider  a token stream that creates tokens for both the stemmed and 
 unstemmed version of each word - the fox (jump|jumped)
 When retrieved from the index using TokenSources.getTokenStream(tpv,false), 
 the token stream will be - the fox jump jumped
 Now try a search and highlight for the phrase query fox jumped. The search 
 will correctly find the document; the highlighter will fail to highlight the 
 phrase because it thinks that there is an additional word between fox and 
 jumped. If we use the original (from the analyzer) token stream then the 
 highlighter works.
 Also, consider the converse - the fox did not jump
 not is a stop word and there is an option to increment the position to 
 account for stop words - (the,0) (fox,1) (did,2) (jump,4)
 When retrieved from the index using TokenSources.getTokenStream(tpv,false), 
 the token stream will be - (the,0) (fox,1) (did,2) (jump,3).
 So the phrase query did jump will cause the did and jump terms in the 
 text did not jump to be highlighted. If we use the original (from the 
 analyzer) token stream then the highlighter works correctly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery


[ 
https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12790748#action_12790748
 ] 

Mark Miller commented on LUCENE-2089:
-

Sorry Earwin - to be clear, we don't actually use chapter 6 - AutomataQuery 
needs the automata.

You can get all the states just by taking the power set of the subsumption 
triangle for every base position, and then removing from each set any position 
thats subsumed by another. Thats what I mean by brute force. But in the paper, 
they boil this down to nice little i param tables, extracting some sort of 
pattern from that process. They give no hint on how they do this, or whether it 
applicable to greater n's though. No big deal I guess - the computer can do the 
brute force method - but I wouldn't be surprised if it starts to bog down at 
much higher n's.

 explore using automaton for fuzzyquery
 --

 Key: LUCENE-2089
 URL: https://issues.apache.org/jira/browse/LUCENE-2089
 Project: Lucene - Java
  Issue Type: Wish
  Components: Search
Reporter: Robert Muir
Assignee: Mark Miller
Priority: Minor
 Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java


 Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is 
 itching to write that nasty algorithm)
 we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea
 * up front, calculate the maximum required K edits needed to match the users 
 supplied float threshold.
 * for at least small common E up to some max K (1,2,3, etc) we should create 
 a DFA for each E. 
 if the required E is above our supported max, we use dumb mode at first (no 
 seeking, no DFA, just brute force like now).
 As the pq fills, we swap progressively lower DFAs into the enum, based upon 
 the lowest score in the pq.
 This should work well on avg, at high E, you will typically fill the pq very 
 quickly since you will match many terms. 
 This not only provides a mechanism to switch to more efficient DFAs during 
 enumeration, but also to switch from dumb mode to smart mode.
 i modified my wildcard benchmark to generate random fuzzy queries.
 * Pattern: 7N stands for NNN, etc.
 * AvgMS_DFA: this is the time spent creating the automaton (constructor)
 ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
 |7N|10|64.0|4155.9|38.6|20.3|
 |14N|10|0.0|2511.6|46.0|37.9| 
 |28N|10|0.0|2506.3|93.0|86.6|
 |56N|10|0.0|2524.5|304.4|298.5|
 as you can see, this prototype is no good yet, because it creates the DFA in 
 a slow way. right now it creates an NFA, and all this wasted time is in 
 NFA-DFA conversion.
 So, for a very long string, it just gets worse and worse. This has nothing to 
 do with lucene, and here you can see, the TermEnum is fast (AvgMS - 
 AvgMS_DFA), there is no problem there.
 instead we should just build a DFA to begin with, maybe with this paper: 
 http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
 we can precompute the tables with that algorithm up to some reasonable K, and 
 then I think we are ok.
 the paper references using http://portal.acm.org/citation.cfm?id=135907 for 
 linear minimization, if someone wants to implement this they should not worry 
 about minimization.
 in fact, we need to at some point determine if AutomatonQuery should even 
 minimize FSM's at all, or if it is simply enough for them to be deterministic 
 with no transitions to dead states. (The only code that actually assumes 
 minimal DFA is the Dumb vs Smart heuristic and this can be rewritten as a 
 summation easily). we need to benchmark really complex DFAs (i.e. write a 
 regex benchmark) to figure out if minimization is even helping right now.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2165) SnowballAnalyzer lacks a constructor that takes a Set of Stop Words


 [ 
https://issues.apache.org/jira/browse/LUCENE-2165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller updated LUCENE-2165:


Fix Version/s: 3.1

 SnowballAnalyzer lacks a constructor that takes a Set of Stop Words
 ---

 Key: LUCENE-2165
 URL: https://issues.apache.org/jira/browse/LUCENE-2165
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/analyzers
Affects Versions: 2.9.1, 3.0
Reporter: Nick Burch
Priority: Minor
 Fix For: 3.1


 As discussed on the java-user list, the SnowballAnalyzer has been updated to 
 use a Set of stop words. However, there is no constructor which accepts a 
 Set, there's only the original String[] one
 This is an issue, because most of the common sources of stop words (eg 
 StopAnalyzer) have deprecated their String[] stop word lists, and moved over 
 to Sets (eg StopAnalyzer.ENGLISH_STOP_WORDS_SET). So, for now, you either 
 have to use a deprecated field on StopAnalyzer, or manually turn the Set into 
 an array so you can pass it to the SnowballAnalyzer
 I would suggest that a constructor is added to SnowballAnalyzer which accepts 
 a Set. Not sure if the old String[] one should be deprecated or not.
 A sample patch against 2.9.1 to add the constructor is:
 --- SnowballAnalyzer.java.orig  2009-12-15 11:14:08.0 +
 +++ SnowballAnalyzer.java   2009-12-14 12:58:37.0 +
 @@ -67,6 +67,12 @@
  stopSet = StopFilter.makeStopSet(stopWords);
}
  
 +  /** Builds the named analyzer with the given stop words. */
 +  public SnowballAnalyzer(Version matchVersion, String name, Set 
 stopWordsSet) {
 +this(matchVersion, name);
 +stopSet = stopWordsSet;
 +  }
 +

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1769) Fix wrong clover analysis because of backwards-tests, upgrade clover to 2.4.3 or better

[
https://issues.apache.org/jira/browse/LUCENE-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12791115#action_12791115
]

Mark Miller commented on LUCENE-1769:
-

Would be cool to get this issue wrapped up ...

Fix wrong clover analysis because of backwards-tests, upgrade clover to 2.4.3
or better
---

Key: LUCENE-1769
URL: https://issues.apache.org/jira/browse/LUCENE-1769
Project: Lucene - Java
Issue Type: Bug
Components: Build
Affects Versions: 2.9
Reporter: Uwe Schindler
Attachments: clover.license, LUCENE-1769.patch, LUCENE-1769.patch,
nicks-LUCENE-1769.patch

This is a followup for
[http://www.lucidimagination.com/search/document/6248d6eafbe10ef4/build_failed_in_hudson_lucene_trunk_902]
The problem with clover running on hudson is, that it does not instrument all
tests ran. The autodetection of clover 1.x is not able to find out which
files are the correct tests and only instruments the backwards test. Because
of this, the current coverage report is only from the backwards tests running
against the current Lucene JAR.
You can see this, if you install clover and start the tests. During test-core
no clover data is added to the db, only when backwards-tests begin, new files
are created in the clover db folder.
Clover 2.x supports a new ant task, testsources that can be used to specify
the files, that are the tests. It works here locally with clover 2.4.3 and
produces a really nice coverage report, also linking with test files work, it
tells which tests failed and so on.
I will attach a patch, that changes common-build.xml to the new clover
version (other initialization resource) and tells clover where to find the
tests (using the test folder include/exclude properties).
One problem with the current patch: It does *not* instrument the backwards
branch, so you see only coverage of the core/contrib tests. Getting the
coverage also from the backwards tests is not easy possible because of two
things:
- the tag test dir is not easy to find out and add to testsources element
(there may be only one of them)
- the test names in BW branch are identical to the trunk tests. This
completely corrupts the linkage between tests and code in the coverage report.
In principle the best would be to generate a second coverage report for the
backwards branch with a separate clover DB. The attached patch does not
instrument the bw branch, it only does trunk tests.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Assigned: (LUCENE-2035) TokenSources.getTokenStream() does not assign positionIncrement


 [ 
https://issues.apache.org/jira/browse/LUCENE-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller reassigned LUCENE-2035:
---

Assignee: Mark Miller

 TokenSources.getTokenStream() does not assign positionIncrement
 ---

 Key: LUCENE-2035
 URL: https://issues.apache.org/jira/browse/LUCENE-2035
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/highlighter
Affects Versions: 2.4, 2.4.1, 2.9
Reporter: Christopher Morris
Assignee: Mark Miller
 Fix For: 3.1

 Attachments: LUCENE-2305.patch

   Original Estimate: 24h
  Remaining Estimate: 24h

 TokenSources.StoredTokenStream does not assign positionIncrement information. 
 This means that all tokens in the stream are considered adjacent. This has 
 implications for the phrase highlighting in QueryScorer when using 
 non-contiguous tokens.
 For example:
 Consider  a token stream that creates tokens for both the stemmed and 
 unstemmed version of each word - the fox (jump|jumped)
 When retrieved from the index using TokenSources.getTokenStream(tpv,false), 
 the token stream will be - the fox jump jumped
 Now try a search and highlight for the phrase query fox jumped. The search 
 will correctly find the document; the highlighter will fail to highlight the 
 phrase because it thinks that there is an additional word between fox and 
 jumped. If we use the original (from the analyzer) token stream then the 
 highlighter works.
 Also, consider the converse - the fox did not jump
 not is a stop word and there is an option to increment the position to 
 account for stop words - (the,0) (fox,1) (did,2) (jump,4)
 When retrieved from the index using TokenSources.getTokenStream(tpv,false), 
 the token stream will be - (the,0) (fox,1) (did,2) (jump,3).
 So the phrase query did jump will cause the did and jump terms in the 
 text did not jump to be highlighted. If we use the original (from the 
 analyzer) token stream then the highlighter works correctly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2035) TokenSources.getTokenStream() does not assign positionIncrement


 [ 
https://issues.apache.org/jira/browse/LUCENE-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller updated LUCENE-2035:


Fix Version/s: 3.1

 TokenSources.getTokenStream() does not assign positionIncrement
 ---

 Key: LUCENE-2035
 URL: https://issues.apache.org/jira/browse/LUCENE-2035
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/highlighter
Affects Versions: 2.4, 2.4.1, 2.9
Reporter: Christopher Morris
Assignee: Mark Miller
 Fix For: 3.1

 Attachments: LUCENE-2305.patch

   Original Estimate: 24h
  Remaining Estimate: 24h

 TokenSources.StoredTokenStream does not assign positionIncrement information. 
 This means that all tokens in the stream are considered adjacent. This has 
 implications for the phrase highlighting in QueryScorer when using 
 non-contiguous tokens.
 For example:
 Consider  a token stream that creates tokens for both the stemmed and 
 unstemmed version of each word - the fox (jump|jumped)
 When retrieved from the index using TokenSources.getTokenStream(tpv,false), 
 the token stream will be - the fox jump jumped
 Now try a search and highlight for the phrase query fox jumped. The search 
 will correctly find the document; the highlighter will fail to highlight the 
 phrase because it thinks that there is an additional word between fox and 
 jumped. If we use the original (from the analyzer) token stream then the 
 highlighter works.
 Also, consider the converse - the fox did not jump
 not is a stop word and there is an option to increment the position to 
 account for stop words - (the,0) (fox,1) (did,2) (jump,4)
 When retrieved from the index using TokenSources.getTokenStream(tpv,false), 
 the token stream will be - (the,0) (fox,1) (did,2) (jump,3).
 So the phrase query did jump will cause the did and jump terms in the 
 text did not jump to be highlighted. If we use the original (from the 
 analyzer) token stream then the highlighter works correctly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2035) TokenSources.getTokenStream() does not assign positionIncrement


 [ 
https://issues.apache.org/jira/browse/LUCENE-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller updated LUCENE-2035:


Attachment: LUCENE-2035.patch

 TokenSources.getTokenStream() does not assign positionIncrement
 ---

 Key: LUCENE-2035
 URL: https://issues.apache.org/jira/browse/LUCENE-2035
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/highlighter
Affects Versions: 2.4, 2.4.1, 2.9
Reporter: Christopher Morris
Assignee: Mark Miller
 Fix For: 3.1

 Attachments: LUCENE-2035.patch, LUCENE-2305.patch

   Original Estimate: 24h
  Remaining Estimate: 24h

 TokenSources.StoredTokenStream does not assign positionIncrement information. 
 This means that all tokens in the stream are considered adjacent. This has 
 implications for the phrase highlighting in QueryScorer when using 
 non-contiguous tokens.
 For example:
 Consider  a token stream that creates tokens for both the stemmed and 
 unstemmed version of each word - the fox (jump|jumped)
 When retrieved from the index using TokenSources.getTokenStream(tpv,false), 
 the token stream will be - the fox jump jumped
 Now try a search and highlight for the phrase query fox jumped. The search 
 will correctly find the document; the highlighter will fail to highlight the 
 phrase because it thinks that there is an additional word between fox and 
 jumped. If we use the original (from the analyzer) token stream then the 
 highlighter works.
 Also, consider the converse - the fox did not jump
 not is a stop word and there is an option to increment the position to 
 account for stop words - (the,0) (fox,1) (did,2) (jump,4)
 When retrieved from the index using TokenSources.getTokenStream(tpv,false), 
 the token stream will be - (the,0) (fox,1) (did,2) (jump,3).
 So the phrase query did jump will cause the did and jump terms in the 
 text did not jump to be highlighted. If we use the original (from the 
 analyzer) token stream then the highlighter works correctly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2035) TokenSources.getTokenStream() does not assign positionIncrement

[
https://issues.apache.org/jira/browse/LUCENE-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12791152#action_12791152
]

Mark Miller commented on LUCENE-2035:
-

Thanks for the tests and fix Christopher!

I've got one more patch coming and ill commit in a few days.

I'm going to break the tests back out in a separate file again (on second
thought I think how you had is a good idea) and remove an author tag. Then
after one more review I think this good to go in.

TokenSources.getTokenStream() does not assign positionIncrement
---

Attachments: LUCENE-2035.patch, LUCENE-2305.patch

Original Estimate: 24h
Remaining Estimate: 24h

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-406) sort missing string fields last


[ 
https://issues.apache.org/jira/browse/LUCENE-406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12791153#action_12791153
 ] 

Mark Miller commented on LUCENE-406:


We should update this and incorporate into Lucene.

 sort missing string fields last
 ---

 Key: LUCENE-406
 URL: https://issues.apache.org/jira/browse/LUCENE-406
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Search
Affects Versions: 1.4
 Environment: Operating System: All
 Platform: All
Reporter: Yonik Seeley
Assignee: Hoss Man
Priority: Minor
 Attachments: MissingStringLastComparatorSource.java, 
 MissingStringLastComparatorSource.java, 
 TestMissingStringLastComparatorSource.java


 A SortComparatorSource for string fields that orders documents with the sort
 field missing after documents with the field.  This is the reverse of the
 default Lucene implementation.
 The concept and first-pass implementation was done by Chris Hostetter.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Resolved: (LUCENE-1942) NUM_THREADS is a static member of RunAddIndexesThreads and should be accessed in a static way


 [ 
https://issues.apache.org/jira/browse/LUCENE-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller resolved LUCENE-1942.
-

Resolution: Won't Fix

 NUM_THREADS is a static member of RunAddIndexesThreads and should be accessed 
 in a static way
 -

 Key: LUCENE-1942
 URL: https://issues.apache.org/jira/browse/LUCENE-1942
 Project: Lucene - Java
  Issue Type: Bug
  Components: Other
 Environment: Eclipse 3.4.2
Reporter: Hasan Diwan
Priority: Trivial
 Attachments: lucene.pat


 The summary contains the problem. No further description needed, I don't 
 think.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Resolved: (LUCENE-628) Intermittent FileNotFoundException for .fnm when using rsync

[
https://issues.apache.org/jira/browse/LUCENE-628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Mark Miller resolved LUCENE-628.

Resolution: Incomplete

Intermittent FileNotFoundException for .fnm when using rsync

Key: LUCENE-628
URL: https://issues.apache.org/jira/browse/LUCENE-628
Project: Lucene - Java
Issue Type: Bug
Components: Search
Affects Versions: 1.9
Environment: Linux RedHat ES3, Jboss402
Reporter: Simon Lorenz
Priority: Minor

We use Lucene 1.9.1 to create and search indexes for web applications. The
application runs in Jboss402 on Redhat ES3. A single Master (Writer) Jboss
instance creates and writes the indexes using the compound file format ,
which is optimised after all updates. These index files are replicated every
few hours using rsync, to a number of other application servers (Searchers).
The rsync job only runs if there are no lucene lock files present on the
Writer. The Searcher servers that receive the replicated files, perform only
searches on the index. Up to 60 searches may be performed each minute.
Everything works well most of the time, but we get the following issue on the
Searcher servers about 10% of the time.
Following an rsync replication one or all of the Searcher server throws
IOException caught when creating and IndexSearcher
java.io.FileNotFoundException: //_1zm.fnm (No such file or directory)
at java.io.RandomAccessFile.open(Native Method)
at java.io.RandomAccessFile.init(RandomAccessFile.java:212)
at
org.apache.lucene.store.FSIndexInput$Descriptor.init(FSDirectory.java:425)
at org.apache.lucene.store.FSIndexInput.init(FSDirectory.java:434)
at org.apache.lucene.store.FSDirectory.openInput(FSDirectory.java:324)
at org.apache.lucene.index.FieldInfos.init(FieldInfos.java:56)
at
org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:144)
at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:129)
at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:110)
at org.apache.lucene.index.IndexReader$1.doBody(IndexReader.java:154)
at org.apache.lucene.store.Lock$With.run(Lock.java:109)
at org.apache.lucene.index.IndexReader.open(IndexReader.java:143)
As we use the compound file format I would not expect .fnm files to be
present. When replicating, we do not delete the old .cfs index files as these
could still be referenced by old Searcher threads. We do overwrite the
segments and deletable files on the Searcher servers.
My thoughts are: Either we are occasionally overwriting a file at the exact
time a new searcher is being created, or the lock files are removed from the
Writer server before the compaction process is completed, we then replicate a
segments file that still references a ghost .fnm file.
I would greatly appreciate any ideas and suggestions to solve this annoying
issue.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery

2009-12-14 Thread Mark Miller (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12790368#action_12790368
 ] 

Mark Miller commented on LUCENE-2089:
-

bq. If you do take hold of it, do not hesitate to share  The original paper and 
C++ code likewise melt my brain, and I needed the algo in some other place.

The java impl I was onto was about 75% complete according to the author, but I 
have not yet looked at the code. Robert was convinced it was a different less 
efficient algorithm last I heard though.

We have cracked much of the paper - thats how Robert implemented n=1 here - 
thats from the paper. The next step is to work out how to construct the tables 
for n as Robert says above. And store those tables efficiently as they start 
getting quite large rather fast - though we might only use as high as n=3 or 4 
in Lucene - Robert suspects term seeking will outweigh any gains at that point. 
I think we know how to do the majority of the work for the n case, but I don't 
really have much/any time for this, so it probably depends on if/when Robert 
gets to it. If he loses interest on finishing, I def plan to come back to it 
someday. I'd like to complete my understanding of the paper and see a full n 
java impl of this in either case. The main piece left that I don't understand 
fully (computing all possible states for n), can be computed with just a brute 
force check (thats how the python impl is doing it), so there may not be much 
more to understand. I would like to know how the paper is getting 'i' 
parametrized state generators though - thats much more efficient. The paper 
shows them for n=1 and n=2.

 explore using automaton for fuzzyquery
 --

 Key: LUCENE-2089
 URL: https://issues.apache.org/jira/browse/LUCENE-2089
 Project: Lucene - Java
  Issue Type: Wish
  Components: Search
Reporter: Robert Muir
Assignee: Mark Miller
Priority: Minor
 Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java


 Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is 
 itching to write that nasty algorithm)
 we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea
 * up front, calculate the maximum required K edits needed to match the users 
 supplied float threshold.
 * for at least small common E up to some max K (1,2,3, etc) we should create 
 a DFA for each E. 
 if the required E is above our supported max, we use dumb mode at first (no 
 seeking, no DFA, just brute force like now).
 As the pq fills, we swap progressively lower DFAs into the enum, based upon 
 the lowest score in the pq.
 This should work well on avg, at high E, you will typically fill the pq very 
 quickly since you will match many terms. 
 This not only provides a mechanism to switch to more efficient DFAs during 
 enumeration, but also to switch from dumb mode to smart mode.
 i modified my wildcard benchmark to generate random fuzzy queries.
 * Pattern: 7N stands for NNN, etc.
 * AvgMS_DFA: this is the time spent creating the automaton (constructor)
 ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
 |7N|10|64.0|4155.9|38.6|20.3|
 |14N|10|0.0|2511.6|46.0|37.9| 
 |28N|10|0.0|2506.3|93.0|86.6|
 |56N|10|0.0|2524.5|304.4|298.5|
 as you can see, this prototype is no good yet, because it creates the DFA in 
 a slow way. right now it creates an NFA, and all this wasted time is in 
 NFA-DFA conversion.
 So, for a very long string, it just gets worse and worse. This has nothing to 
 do with lucene, and here you can see, the TermEnum is fast (AvgMS - 
 AvgMS_DFA), there is no problem there.
 instead we should just build a DFA to begin with, maybe with this paper: 
 http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
 we can precompute the tables with that algorithm up to some reasonable K, and 
 then I think we are ok.
 the paper references using http://portal.acm.org/citation.cfm?id=135907 for 
 linear minimization, if someone wants to implement this they should not worry 
 about minimization.
 in fact, we need to at some point determine if AutomatonQuery should even 
 minimize FSM's at all, or if it is simply enough for them to be deterministic 
 with no transitions to dead states. (The only code that actually assumes 
 minimal DFA is the Dumb vs Smart heuristic and this can be rewritten as a 
 summation easily). we need to benchmark really complex DFAs (i.e. write a 
 regex benchmark) to figure out if minimization is even helping right now.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java

[jira] Commented: (LUCENE-2126) Split up IndexInput and IndexOutput into DataInput and DataOutput

2009-12-13 Thread Mark Miller (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-2126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12789901#action_12789901
]

Mark Miller commented on LUCENE-2126:
-

I disagree with you here: introducing DataInput/Output makes IMO the API
actually easier for the normal user to understand.

I agree with everything you say in the second paragraph, but I don't see how
any of that supports the assertion you make in the first paragraph.

Presumably, because the normal user won't touch/see the IndexInput/Output
classes, but more likely may deal with DataInput/Output - and those classes
being limited to what actually makes sense to use for them (only exposing
methods they should use) - thats easier for them.

I was leaning towards Marvin's arguments - it really seems that documentation
should be enough to steer users against doing something stupid - there is no
doubt that writing attributes into the posting list is a fairly advanced
operation (though more normal than using IndexInput/Output). On the other
hand though,
I'm not really sold on the downsides longer term either. The complexity
argument is a bit over blown. If you understand anything down to the level of
these classes,
this is a ridiculously simple change. The backcompat argument is not very
persuasive either - not only does it look like a slim chance of any future
issues - at this
level we are fairly loose about back compat when something comes up. I think
advanced users have already realized, the more you dig into Lucene's guts, the
more likely you won't be able to count on jar drop in. Thats just the way
things have gone. I don't see a looming concrete issue myself anyway. And if
there is a
hidden one, I don't think anyone is going to get in a ruffle about it.

So net/net, I'm +1. Seems worth it to me to be able to give a user 2125 the
correct API.

I could go either way on the name change. Not a fan of LuceneInput/Output
though.

Split up IndexInput and IndexOutput into DataInput and DataOutput
-

Key: LUCENE-2126
URL: https://issues.apache.org/jira/browse/LUCENE-2126
Project: Lucene - Java
Issue Type: Improvement
Affects Versions: Flex Branch
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
Fix For: Flex Branch

Attachments: lucene-2126.patch

I'd like to introduce the two new classes DataInput and DataOutput
that contain all methods from IndexInput and IndexOutput that actually
decode or encode data, such as readByte()/writeByte(),
readVInt()/writeVInt().
Methods like getFilePointer(), seek(), close(), etc., which are not
related to data encoding, but to files as input/output source stay in
IndexInput/IndexOutput.
This patch also changes ByteSliceReader/ByteSliceWriter to extend
DataInput/DataOutput. Previously ByteSliceReader implemented the
methods that stay in IndexInput by throwing RuntimeExceptions.
See also LUCENE-2125.
All tests pass.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2133) [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField

2009-12-11 Thread Mark Miller (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-2133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12789384#action_12789384
]

Mark Miller commented on LUCENE-2133:
-

bq. Something along these lines maybe?

And we are back to 831 :)

[PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField
-

Key: LUCENE-2133
URL: https://issues.apache.org/jira/browse/LUCENE-2133
Project: Lucene - Java
Issue Type: Improvement
Components: Search
Affects Versions: 2.9.1, 3.0
Reporter: Christian Kohlschütter
Attachments: LUCENE-2133-complete.patch, LUCENE-2133.patch,
LUCENE-2133.patch, LUCENE-2133.patch

Hi all,
up to the current version Lucene contains a conceptual flaw, that is the
FieldCache. The FieldCache is a singleton which is supposed to cache certain
information for every IndexReader that is currently open
The FieldCache is flawed because it is incorrect to assume that:
1. one IndexReader instance equals one index. In fact, there can be many
clones (of SegmentReader) or decorators (FilterIndexReader) which all access
the very same data.
2. the cache information remains valid for the lifetime of an IndexReader. In
fact, some IndexReaders may be reopen()'ed and thus they may contain
completely different information.
3. all IndexReaders need the same type of cache. In fact, because of the
limitations imposed by the singleton construct there was no implementation
other than FieldCacheImpl.
Furthermore, FieldCacheImpl and FieldComparator are bloated by several static
inner-classes that could be moved to package level.
There have been a few attempts to improve FieldCache, namely LUCENE-831,
LUCENE-1579 and LUCENE-1749, but the overall situation remains the same:
There is a central registry for assigning Caches to IndexReader instances.
I now propose the following:
1. Obsolete FieldCache and FieldCacheKey and provide index-specific,
extensible cache instances (IndexCache). IndexCaches provide common caching
functionality for all IndexReaders and may be extended (for example,
SegmentReader would have a SegmentReaderIndexCache and store different data
than a regular IndexCache)
2. Add the index-specific field cache (IndexFieldCache) to the IndexCache.
IndexFieldCache is an interface just like FieldCache and may support
different implementations.
3. The IndexCache instances may be flushed/closed by the associated
IndexReaders whenever necessary.
4. Obsolete FieldCacheSanityChecker because no more insanities are expected
(or at least, they do not impact the overall performance)
5. Refactor FieldCacheImpl and the related classes (FieldComparator,
SortField)
I have provided an patch which takes care of all these issues. It passes all
JUnit tests.
The patch is quite large, admittedly, but the change required several
modifications and some more to preserve backwards-compatibility.
Backwards-compatibility is preserved by moving some of the updated
functionality in the package org.apache.lucene.search.fields (field
comparators and parsers, SortField) while adding wrapper instances and
keeping old code in org.apache.lucene.search.
In detail and besides the above mentioned improvements, the following is
provided:
1. An IndexCache specific for SegmentReaders. The two ThreadLocals are moved
from SegmentReader to SegmentReaderIndexCache.
2. A housekeeping improvement to CloseableThreadLocal. Now delegates the
close() method to all registered instances by calling an onClose() method
with the threads' instances.
3. Analyzer.close now may throw an IOException (this already is covered by
java.io.Closeable).
4. A change to Collector: allow IndexCache instead of IndexReader being
passed to setNextReader()
5. SortField's numeric types have been replaced by direct assignments of
FieldComparatorSource. This removes the switch statements and the
possibility to throw IllegalArgumentExceptions because of unsupported type
values.
The following classes have been deprecated and replaced by new classes in
org.apache.lucene.search.fields:
- FieldCacheRangeFilter (= IndexFieldCacheRangeFilter)
- FieldCacheTermsFilter (= IndexFieldCacheTermsFilter)
- FieldCache (= IndexFieldCache)
- FieldCacheImpl (= IndexFieldCacheImpl)
- all classes in FieldCacheImpl (= several package-level classes)
- all subclasses of FieldComparator (= several package-level classes)
Final notes:
- The patch would be simpler if no backwards compatibility was necessary. The
Lucene community has to decide which classes/methods can immediately be
removed, which ones later, which not at all. Whenever new classes depend on
the old ones, an appropriate notice exists in the javadocs.
- The patch introduces a new

[jira] Commented: (LUCENE-1377) Add HTMLStripReader and WordDelimiterFilter from SOLR


[ 
https://issues.apache.org/jira/browse/LUCENE-1377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12788762#action_12788762
 ] 

Mark Miller commented on LUCENE-1377:
-

bq.  with the exception of a few core committers.

I think the exception is the other way around, especially considering Lucene 
contrib. Lets look at the Solr list (and consider some are not very active in 
Solr currently)
||name||status||
|Bill Au| |
|Doug Cutting|Lucene Core Committer|
|Otis Gospodnetić|Lucene Core Committer|
|Erik Hatcher| Lucene Core Committer|
|Chris Hostetter |Lucene Core Committer|
|Grant Ingersoll | Lucene Core Committer|
|Mike Klaas| |
|Shalin Shekhar Mangar| |
|Ryan McKinley| Lucene Contrib Committer|
|Mark Miller |Lucene Core Committer|
|Noble Paul| |
|Yonik Seeley| Lucene Core Committer|
|Koji Sekiguchi|Lucene Contrib Committer|


 Add HTMLStripReader and WordDelimiterFilter from SOLR
 -

 Key: LUCENE-1377
 URL: https://issues.apache.org/jira/browse/LUCENE-1377
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Affects Versions: 2.3.2
Reporter: Jason Rutherglen
Priority: Minor
   Original Estimate: 24h
  Remaining Estimate: 24h

 SOLR has two classes HTMLStripReader and WordDelimiterFilter which are very 
 useful for a wide variety of use cases.  It would be good to place them into 
 core Lucene.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2133) [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField

[
https://issues.apache.org/jira/browse/LUCENE-2133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12788874#action_12788874
]

Mark Miller commented on LUCENE-2133:
-

I don't know that back compat is really a concern if we are just leaving the
old API intact as part of that, with its own caching mechanism?

Just deprecate the old API, and make a new one. This is a big pain, because you
have to be sure you don't straddle the two apis on upgrading, but thats the
boat we will be in anyway.

Which means a new impl should provide enough benefits to make that large pain
worth enduring. 831 was not committed for the same reason - it didn't bring
enough to table to be worth it after we got to a per segment cache in another
way. Since I don't see that this provides anything over 831, I don't see how
its not in the same boat.

I'm not sure we should target a specific release with this - we don't even know
when 3.1 is going to happen. 2.9 took a year. Its anybodies guess - we should
prob just do what makes sense and commit it when its ready.

[PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField
-

[jira] Commented: (LUCENE-2133) [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField

[
https://issues.apache.org/jira/browse/LUCENE-2133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12788908#action_12788908
]

Mark Miller commented on LUCENE-2133:
-

bq. LUCENE-831 still requires a static FieldCache, the root of all evil :)

It doesn't require one though? It supports a cache per segment reader just like
this. Except its called a ValueSource.

The CacheByReaderValueSource is just there to handle a back compat issue - its
something that we would want to get around and use the reader valuesource for
instead - but that patch still had a long way to go.

Overall, from what I can see, the approach was about the same.

bq. It probably makes sense to start from one of Hoss's original patches or
even from scratch

That was said before a lot more work was done. The API was actually starting to
shape up nicely.

bq. The more complex the patches are, the longer it will take to integrate them
into a new version.

Of course - and this is a complex issue with a lot of upgrade pain. Like with
831, it not really worth the pain to users without more benefits.

bq. The more such patches you have, the longer it will take to get to a new
release.

Thats not really true. 3.1 does't need this patch - there would be no reason to
hold it for it. Patches go in when they are ready.

bq. Let's make it simple, submit what we have and build upon that.

I dont think thats simple :) The patch can be iterated on outside of trunk as
easy as in.

[PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField
-

[jira] Commented: (LUCENE-2133) [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField

[
https://issues.apache.org/jira/browse/LUCENE-2133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12788951#action_12788951
]

Mark Miller commented on LUCENE-2133:
-

bq. That is, it adds a lot of duplicated code / different possible
implementations for the same thing.

Things were still ugly were not likely to stick around - 831 was very much a
work in progress. The solution there to handle back compat issues was a working
solution that would need to be improved upon. 831 was still in experimentation
state - issues that need more though had hacked in working solutions. We had a
more general cache at one point, and began working towards ValueSources based
on discussion. The latest 831 patch is an exploration of that, not a final
product.

bq. They should store arbitrary data, allow cache inspection, eviction of
entries and so on.

Thats extremely simple to add to an IndexReader - we were thinking of a
ValueSource as something different than a basic cache.

{quote}
It is indeed a complex problem but it can easily be split into several subtasks
that can be addressed by different people in parallel. To allow such a
development, we have to somehow get the base code it into SVN, not necessarily
trunk, admittedly, a branch would also do. Of course, this requires also
additional work to keep it in sync with trunk. If we can really assume to have
3.1 in one year, we have lots of time for developing a stable, powerful new API
directly in trunk. Of course, this is a decision related to release management
and not to the actual problem. I can live with both ways (trunk vs. branch),
but, in my opinion, managing the changes just as patch files in jira is not a
viable option.
{quote}

A branch is certainly a possibility, but with only one person working on it, I
think its overkill. With some additional interest, a branch can make sense -
otherwise its not worth the merging headaches. You also have to have a
committer(s) thats willing to take on the merging.

At one point, 831 was much more like this patch. Discussion along what Mike
brought up above started transforming it to something else. We essentially
decided that unless that much was brought to the table, the disrupting change
just wasn't worth it for a different cache API.

I'm def a proponent of FieldCache reform - but I think we want to fully flesh
it out before committing to something in trunk.

[PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField
-

[jira] Issue Comment Edited: (LUCENE-2133) [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField

[
https://issues.apache.org/jira/browse/LUCENE-2133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12788951#action_12788951
]

Mark Miller edited comment on LUCENE-2133 at 12/10/09 9:48 PM:
---

bq. That is, it adds a lot of duplicated code / different possible
implementations for the same thing.

Things that were still ugly were not likely to stick around - 831 was very much
a work in progress. The solution there to handle back compat issues was a
working solution that would need to be improved upon. 831 was still in
experimentation state - issues that need more though had hacked in working
solutions. We had a more general cache at one point, and began working towards
ValueSources based on discussion. The latest 831 patch is an exploration of
that, not a final product.

bq. They should store arbitrary data, allow cache inspection, eviction of
entries and so on.

Thats extremely simple to add to an IndexReader - we were thinking of a
ValueSource as something different than a basic cache.

I'm def a proponent of FieldCache reform - but I think we want to fully flesh
it out before committing to something in trunk.

was (Author: markrmil...@gmail.com):
bq. That is, it adds a lot of duplicated code / different possible
implementations for the same thing.

bq. They should store arbitrary data, allow cache inspection, eviction of
entries and so on.

Thats extremely simple to add to an IndexReader - we were thinking of a
ValueSource as something different than a basic cache.

I'm def a proponent of FieldCache reform - but I think we want to fully flesh
it out before committing to something in trunk.

[PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField

[jira] Commented: (LUCENE-2018) Reconsider boolean max clause exception

[
https://issues.apache.org/jira/browse/LUCENE-2018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12787658#action_12787658
]

Mark Miller commented on LUCENE-2018:
-

I still think this should be removed - or moved to the MTQ query itself - then
a setting on the queryparser could set it, or a user could set it. It shouldn't
be a sys property, and I don't necessarily think it should be on by default
either.

Reconsider boolean max clause exception
---

Key: LUCENE-2018
URL: https://issues.apache.org/jira/browse/LUCENE-2018
Project: Lucene - Java
Issue Type: Improvement
Reporter: Mark Miller
Fix For: 3.1

Now that we have smarter multi-term queries, I think its time to reconsider
the boolean max clause setting. It made more sense before, because you could
hit it more unaware when the multi-term queries got huge - now its more
likely that if it happens its because a user built the boolean themselves.
And no duh thousands more boolean clauses means slower perf and more
resources needed. We don't throw an exception when you try to use a ton of
resources in a thousand other ways.
The current setting also suffers from the static hell argument - especially
when you consider something like Solr's multicore feature - you can have
different settings for this in different cores, and the last one is going to
win. Its ugly. Yes, that could be addressed better in Solr as well - but I
still think it should be less ugly in Lucene as well.
I'd like to consider either doing away with it, or raising it by quite a bit
at the least. Or an alternative better solution. Right now, it aint so great.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2133) [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField

[
https://issues.apache.org/jira/browse/LUCENE-2133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12787711#action_12787711
]

Mark Miller commented on LUCENE-2133:
-

There are a bunch or unrelated changes (imports/names/exception thrown) that
should be pulled from this patch.

[PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField
-

[jira] Commented: (LUCENE-2133) [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField


[ 
https://issues.apache.org/jira/browse/LUCENE-2133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12787715#action_12787715
 ] 

Mark Miller commented on LUCENE-2133:
-

Hmm ... nevermind. The exception is related and most of the imports are correct 
- brain spin.

Didn't see that 

import org.apache.lucene.search.SortField; // for javadocs 

wasn't being used anymore anyway.

import org.apache.lucene.search.fields.IndexFieldCache in NumericQuery should 
get a //javadoc so someone doesn't accidently remove it.

And I guess the t to threadLocal change doesn't hurt with the amount your 
changing that anyway. Its a better name.

This looks pretty nice overall.

 [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField
 -

 Key: LUCENE-2133
 URL: https://issues.apache.org/jira/browse/LUCENE-2133
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.9.1, 3.0
Reporter: Christian Kohlschütter
 Attachments: LUCENE-2133-complete.patch, LUCENE-2133.patch, 
 LUCENE-2133.patch


 Hi all,
 up to the current version Lucene contains a conceptual flaw, that is the 
 FieldCache. The FieldCache is a singleton which is supposed to cache certain 
 information for every IndexReader that is currently open
 The FieldCache is flawed because it is incorrect to assume that:
 1. one IndexReader instance equals one index. In fact, there can be many 
 clones (of SegmentReader) or decorators (FilterIndexReader) which all access 
 the very same data.
 2. the cache information remains valid for the lifetime of an IndexReader. In 
 fact, some IndexReaders may be reopen()'ed and thus they may contain 
 completely different information.
 3. all IndexReaders need the same type of cache. In fact, because of the 
 limitations imposed by the singleton construct there was no implementation 
 other than FieldCacheImpl.
 Furthermore, FieldCacheImpl and FieldComparator are bloated by several static 
 inner-classes that could be moved to package level.
 There have been a few attempts to improve FieldCache, namely LUCENE-831, 
 LUCENE-1579 and LUCENE-1749, but the overall situation remains the same: 
 There is a central registry for assigning Caches to IndexReader instances.
 I now propose the following:
 1. Obsolete FieldCache and FieldCacheKey and provide index-specific, 
 extensible cache instances (IndexCache). IndexCaches provide common caching 
 functionality for all IndexReaders and may be extended (for example, 
 SegmentReader would have a SegmentReaderIndexCache and store different data 
 than a regular IndexCache)
 2. Add the index-specific field cache (IndexFieldCache) to the IndexCache. 
 IndexFieldCache is an interface just like FieldCache and may support 
 different implementations.
 3. The IndexCache instances may be flushed/closed by the associated 
 IndexReaders whenever necessary.
 4. Obsolete FieldCacheSanityChecker because no more insanities are expected 
 (or at least, they do not impact the overall performance)
 5. Refactor FieldCacheImpl and the related classes (FieldComparator, 
 SortField) 
 I have provided an patch which takes care of all these issues. It passes all 
 JUnit tests.
 The patch is quite large, admittedly, but the change required several 
 modifications and some more to preserve backwards-compatibility. 
 Backwards-compatibility is preserved by moving some of the updated 
 functionality in the package org.apache.lucene.search.fields (field 
 comparators and parsers, SortField) while adding wrapper instances and 
 keeping old code in org.apache.lucene.search.
 In detail and besides the above mentioned improvements, the following is 
 provided:
 1. An IndexCache specific for SegmentReaders. The two ThreadLocals are moved 
 from SegmentReader to SegmentReaderIndexCache.
 2. A housekeeping improvement to CloseableThreadLocal. Now delegates the 
 close() method to all registered instances by calling an onClose() method 
 with the threads' instances.
 3. Analyzer.close now may throw an IOException (this already is covered by 
 java.io.Closeable).
 4. A change to Collector: allow IndexCache instead of IndexReader being 
 passed to setNextReader()
 5. SortField's numeric types have been replaced by direct assignments of 
 FieldComparatorSource. This removes the switch statements and the 
 possibility to throw IllegalArgumentExceptions because of unsupported type 
 values.
 The following classes have been deprecated and replaced by new classes in 
 org.apache.lucene.search.fields:
 - FieldCacheRangeFilter (= IndexFieldCacheRangeFilter)
 - FieldCacheTermsFilter (= IndexFieldCacheTermsFilter)
 - FieldCache (= IndexFieldCache)
 - FieldCacheImpl (= IndexFieldCacheImpl)
 - all classes in FieldCacheImpl (= several package-level classes

[jira] Commented: (LUCENE-2133) [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField


[ 
https://issues.apache.org/jira/browse/LUCENE-2133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12787729#action_12787729
 ] 

Mark Miller commented on LUCENE-2133:
-

A couple more quick notes:

I know the FieldComparator class is ugly, but I'm not sure we should pull the 
rug by putting the impls in a new package. On the other hand, its not likely to 
affect many and it was experimental - so its a tough call. Its a lot of classes 
in there ;)

I'm also not sure if fields is the right package name? And do the Filters 
belong in that package?

Also, almost a non issue, but extending a deprecated class is going to be an 
ultra minor back compat break when its removed. Not likely a problem though. 
But we might put a note to that affect to be clear. It is almost self 
documenting anyway though :)

Rather then changing the tests to the new classes, we should prob copy them and 
make new ones - then remove them when the deprecations are removed.

Also, you should pull the author tag(s) - all credit is through JIRA and 
Changes. (I only see it like once, so I bet thats eclipse?)

I havn't done a thorough review it all, but this is pretty great stuff to 
appear so complete and out of nowhere :)

 [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField
 -

 Key: LUCENE-2133
 URL: https://issues.apache.org/jira/browse/LUCENE-2133
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.9.1, 3.0
Reporter: Christian Kohlschütter
 Attachments: LUCENE-2133-complete.patch, LUCENE-2133.patch, 
 LUCENE-2133.patch


 Hi all,
 up to the current version Lucene contains a conceptual flaw, that is the 
 FieldCache. The FieldCache is a singleton which is supposed to cache certain 
 information for every IndexReader that is currently open
 The FieldCache is flawed because it is incorrect to assume that:
 1. one IndexReader instance equals one index. In fact, there can be many 
 clones (of SegmentReader) or decorators (FilterIndexReader) which all access 
 the very same data.
 2. the cache information remains valid for the lifetime of an IndexReader. In 
 fact, some IndexReaders may be reopen()'ed and thus they may contain 
 completely different information.
 3. all IndexReaders need the same type of cache. In fact, because of the 
 limitations imposed by the singleton construct there was no implementation 
 other than FieldCacheImpl.
 Furthermore, FieldCacheImpl and FieldComparator are bloated by several static 
 inner-classes that could be moved to package level.
 There have been a few attempts to improve FieldCache, namely LUCENE-831, 
 LUCENE-1579 and LUCENE-1749, but the overall situation remains the same: 
 There is a central registry for assigning Caches to IndexReader instances.
 I now propose the following:
 1. Obsolete FieldCache and FieldCacheKey and provide index-specific, 
 extensible cache instances (IndexCache). IndexCaches provide common caching 
 functionality for all IndexReaders and may be extended (for example, 
 SegmentReader would have a SegmentReaderIndexCache and store different data 
 than a regular IndexCache)
 2. Add the index-specific field cache (IndexFieldCache) to the IndexCache. 
 IndexFieldCache is an interface just like FieldCache and may support 
 different implementations.
 3. The IndexCache instances may be flushed/closed by the associated 
 IndexReaders whenever necessary.
 4. Obsolete FieldCacheSanityChecker because no more insanities are expected 
 (or at least, they do not impact the overall performance)
 5. Refactor FieldCacheImpl and the related classes (FieldComparator, 
 SortField) 
 I have provided an patch which takes care of all these issues. It passes all 
 JUnit tests.
 The patch is quite large, admittedly, but the change required several 
 modifications and some more to preserve backwards-compatibility. 
 Backwards-compatibility is preserved by moving some of the updated 
 functionality in the package org.apache.lucene.search.fields (field 
 comparators and parsers, SortField) while adding wrapper instances and 
 keeping old code in org.apache.lucene.search.
 In detail and besides the above mentioned improvements, the following is 
 provided:
 1. An IndexCache specific for SegmentReaders. The two ThreadLocals are moved 
 from SegmentReader to SegmentReaderIndexCache.
 2. A housekeeping improvement to CloseableThreadLocal. Now delegates the 
 close() method to all registered instances by calling an onClose() method 
 with the threads' instances.
 3. Analyzer.close now may throw an IOException (this already is covered by 
 java.io.Closeable).
 4. A change to Collector: allow IndexCache instead of IndexReader being 
 passed to setNextReader()
 5. SortField's numeric types have been replaced

[jira] Commented: (LUCENE-2133) [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField


[ 
https://issues.apache.org/jira/browse/LUCENE-2133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12787734#action_12787734
 ] 

Mark Miller commented on LUCENE-2133:
-

It looks like FieldCacheTermsFilterDocIdSet is using the wrong StringIndex? And 
I think the FieldCache import in that class can be removed.

 [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField
 -

 Key: LUCENE-2133
 URL: https://issues.apache.org/jira/browse/LUCENE-2133
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.9.1, 3.0
Reporter: Christian Kohlschütter
 Attachments: LUCENE-2133-complete.patch, LUCENE-2133.patch, 
 LUCENE-2133.patch


 Hi all,
 up to the current version Lucene contains a conceptual flaw, that is the 
 FieldCache. The FieldCache is a singleton which is supposed to cache certain 
 information for every IndexReader that is currently open
 The FieldCache is flawed because it is incorrect to assume that:
 1. one IndexReader instance equals one index. In fact, there can be many 
 clones (of SegmentReader) or decorators (FilterIndexReader) which all access 
 the very same data.
 2. the cache information remains valid for the lifetime of an IndexReader. In 
 fact, some IndexReaders may be reopen()'ed and thus they may contain 
 completely different information.
 3. all IndexReaders need the same type of cache. In fact, because of the 
 limitations imposed by the singleton construct there was no implementation 
 other than FieldCacheImpl.
 Furthermore, FieldCacheImpl and FieldComparator are bloated by several static 
 inner-classes that could be moved to package level.
 There have been a few attempts to improve FieldCache, namely LUCENE-831, 
 LUCENE-1579 and LUCENE-1749, but the overall situation remains the same: 
 There is a central registry for assigning Caches to IndexReader instances.
 I now propose the following:
 1. Obsolete FieldCache and FieldCacheKey and provide index-specific, 
 extensible cache instances (IndexCache). IndexCaches provide common caching 
 functionality for all IndexReaders and may be extended (for example, 
 SegmentReader would have a SegmentReaderIndexCache and store different data 
 than a regular IndexCache)
 2. Add the index-specific field cache (IndexFieldCache) to the IndexCache. 
 IndexFieldCache is an interface just like FieldCache and may support 
 different implementations.
 3. The IndexCache instances may be flushed/closed by the associated 
 IndexReaders whenever necessary.
 4. Obsolete FieldCacheSanityChecker because no more insanities are expected 
 (or at least, they do not impact the overall performance)
 5. Refactor FieldCacheImpl and the related classes (FieldComparator, 
 SortField) 
 I have provided an patch which takes care of all these issues. It passes all 
 JUnit tests.
 The patch is quite large, admittedly, but the change required several 
 modifications and some more to preserve backwards-compatibility. 
 Backwards-compatibility is preserved by moving some of the updated 
 functionality in the package org.apache.lucene.search.fields (field 
 comparators and parsers, SortField) while adding wrapper instances and 
 keeping old code in org.apache.lucene.search.
 In detail and besides the above mentioned improvements, the following is 
 provided:
 1. An IndexCache specific for SegmentReaders. The two ThreadLocals are moved 
 from SegmentReader to SegmentReaderIndexCache.
 2. A housekeeping improvement to CloseableThreadLocal. Now delegates the 
 close() method to all registered instances by calling an onClose() method 
 with the threads' instances.
 3. Analyzer.close now may throw an IOException (this already is covered by 
 java.io.Closeable).
 4. A change to Collector: allow IndexCache instead of IndexReader being 
 passed to setNextReader()
 5. SortField's numeric types have been replaced by direct assignments of 
 FieldComparatorSource. This removes the switch statements and the 
 possibility to throw IllegalArgumentExceptions because of unsupported type 
 values.
 The following classes have been deprecated and replaced by new classes in 
 org.apache.lucene.search.fields:
 - FieldCacheRangeFilter (= IndexFieldCacheRangeFilter)
 - FieldCacheTermsFilter (= IndexFieldCacheTermsFilter)
 - FieldCache (= IndexFieldCache)
 - FieldCacheImpl (= IndexFieldCacheImpl)
 - all classes in FieldCacheImpl (= several package-level classes)
 - all subclasses of FieldComparator (= several package-level classes)
 Final notes:
 - The patch would be simpler if no backwards compatibility was necessary. The 
 Lucene community has to decide which classes/methods can immediately be 
 removed, which ones later, which not at all. Whenever new classes depend on 
 the old ones, an appropriate notice

[jira] Issue Comment Edited: (LUCENE-2133) [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField


[ 
https://issues.apache.org/jira/browse/LUCENE-2133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12787734#action_12787734
 ] 

Mark Miller edited comment on LUCENE-2133 at 12/8/09 8:42 PM:
--

It looks like FieldCacheTermsFilterDocIdSet is using the wrong StringIndex? And 
I think the FieldCache import in that class can be removed (same with 
IndexFieldCacheRangeFilter).

  was (Author: markrmil...@gmail.com):
It looks like FieldCacheTermsFilterDocIdSet is using the wrong StringIndex? 
And I think the FieldCache import in that class can be removed.
  
 [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField
 -

 Key: LUCENE-2133
 URL: https://issues.apache.org/jira/browse/LUCENE-2133
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.9.1, 3.0
Reporter: Christian Kohlschütter
 Attachments: LUCENE-2133-complete.patch, LUCENE-2133.patch, 
 LUCENE-2133.patch


 Hi all,
 up to the current version Lucene contains a conceptual flaw, that is the 
 FieldCache. The FieldCache is a singleton which is supposed to cache certain 
 information for every IndexReader that is currently open
 The FieldCache is flawed because it is incorrect to assume that:
 1. one IndexReader instance equals one index. In fact, there can be many 
 clones (of SegmentReader) or decorators (FilterIndexReader) which all access 
 the very same data.
 2. the cache information remains valid for the lifetime of an IndexReader. In 
 fact, some IndexReaders may be reopen()'ed and thus they may contain 
 completely different information.
 3. all IndexReaders need the same type of cache. In fact, because of the 
 limitations imposed by the singleton construct there was no implementation 
 other than FieldCacheImpl.
 Furthermore, FieldCacheImpl and FieldComparator are bloated by several static 
 inner-classes that could be moved to package level.
 There have been a few attempts to improve FieldCache, namely LUCENE-831, 
 LUCENE-1579 and LUCENE-1749, but the overall situation remains the same: 
 There is a central registry for assigning Caches to IndexReader instances.
 I now propose the following:
 1. Obsolete FieldCache and FieldCacheKey and provide index-specific, 
 extensible cache instances (IndexCache). IndexCaches provide common caching 
 functionality for all IndexReaders and may be extended (for example, 
 SegmentReader would have a SegmentReaderIndexCache and store different data 
 than a regular IndexCache)
 2. Add the index-specific field cache (IndexFieldCache) to the IndexCache. 
 IndexFieldCache is an interface just like FieldCache and may support 
 different implementations.
 3. The IndexCache instances may be flushed/closed by the associated 
 IndexReaders whenever necessary.
 4. Obsolete FieldCacheSanityChecker because no more insanities are expected 
 (or at least, they do not impact the overall performance)
 5. Refactor FieldCacheImpl and the related classes (FieldComparator, 
 SortField) 
 I have provided an patch which takes care of all these issues. It passes all 
 JUnit tests.
 The patch is quite large, admittedly, but the change required several 
 modifications and some more to preserve backwards-compatibility. 
 Backwards-compatibility is preserved by moving some of the updated 
 functionality in the package org.apache.lucene.search.fields (field 
 comparators and parsers, SortField) while adding wrapper instances and 
 keeping old code in org.apache.lucene.search.
 In detail and besides the above mentioned improvements, the following is 
 provided:
 1. An IndexCache specific for SegmentReaders. The two ThreadLocals are moved 
 from SegmentReader to SegmentReaderIndexCache.
 2. A housekeeping improvement to CloseableThreadLocal. Now delegates the 
 close() method to all registered instances by calling an onClose() method 
 with the threads' instances.
 3. Analyzer.close now may throw an IOException (this already is covered by 
 java.io.Closeable).
 4. A change to Collector: allow IndexCache instead of IndexReader being 
 passed to setNextReader()
 5. SortField's numeric types have been replaced by direct assignments of 
 FieldComparatorSource. This removes the switch statements and the 
 possibility to throw IllegalArgumentExceptions because of unsupported type 
 values.
 The following classes have been deprecated and replaced by new classes in 
 org.apache.lucene.search.fields:
 - FieldCacheRangeFilter (= IndexFieldCacheRangeFilter)
 - FieldCacheTermsFilter (= IndexFieldCacheTermsFilter)
 - FieldCache (= IndexFieldCache)
 - FieldCacheImpl (= IndexFieldCacheImpl)
 - all classes in FieldCacheImpl (= several package-level classes)
 - all subclasses of FieldComparator (= several package-level classes)
 Final

[jira] Commented: (LUCENE-2133) [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField


[ 
https://issues.apache.org/jira/browse/LUCENE-2133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12787748#action_12787748
 ] 

Mark Miller commented on LUCENE-2133:
-

bq.  I think it does not hurt either.

I didn't notice that you actually just deprecated the originals - I guess thats 
not a complete rug pull ...

By the way, I don't think you need to deprecate something in a new class (  
IndexFieldCacheImpl):

{code}

  /**
   * @deprecated Use {...@link #clear()} instead.
   */
  public void purgeAllCaches() {
init();
  }
{code}

 [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField
 -

 Key: LUCENE-2133
 URL: https://issues.apache.org/jira/browse/LUCENE-2133
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.9.1, 3.0
Reporter: Christian Kohlschütter
 Attachments: LUCENE-2133-complete.patch, LUCENE-2133.patch, 
 LUCENE-2133.patch


 Hi all,
 up to the current version Lucene contains a conceptual flaw, that is the 
 FieldCache. The FieldCache is a singleton which is supposed to cache certain 
 information for every IndexReader that is currently open
 The FieldCache is flawed because it is incorrect to assume that:
 1. one IndexReader instance equals one index. In fact, there can be many 
 clones (of SegmentReader) or decorators (FilterIndexReader) which all access 
 the very same data.
 2. the cache information remains valid for the lifetime of an IndexReader. In 
 fact, some IndexReaders may be reopen()'ed and thus they may contain 
 completely different information.
 3. all IndexReaders need the same type of cache. In fact, because of the 
 limitations imposed by the singleton construct there was no implementation 
 other than FieldCacheImpl.
 Furthermore, FieldCacheImpl and FieldComparator are bloated by several static 
 inner-classes that could be moved to package level.
 There have been a few attempts to improve FieldCache, namely LUCENE-831, 
 LUCENE-1579 and LUCENE-1749, but the overall situation remains the same: 
 There is a central registry for assigning Caches to IndexReader instances.
 I now propose the following:
 1. Obsolete FieldCache and FieldCacheKey and provide index-specific, 
 extensible cache instances (IndexCache). IndexCaches provide common caching 
 functionality for all IndexReaders and may be extended (for example, 
 SegmentReader would have a SegmentReaderIndexCache and store different data 
 than a regular IndexCache)
 2. Add the index-specific field cache (IndexFieldCache) to the IndexCache. 
 IndexFieldCache is an interface just like FieldCache and may support 
 different implementations.
 3. The IndexCache instances may be flushed/closed by the associated 
 IndexReaders whenever necessary.
 4. Obsolete FieldCacheSanityChecker because no more insanities are expected 
 (or at least, they do not impact the overall performance)
 5. Refactor FieldCacheImpl and the related classes (FieldComparator, 
 SortField) 
 I have provided an patch which takes care of all these issues. It passes all 
 JUnit tests.
 The patch is quite large, admittedly, but the change required several 
 modifications and some more to preserve backwards-compatibility. 
 Backwards-compatibility is preserved by moving some of the updated 
 functionality in the package org.apache.lucene.search.fields (field 
 comparators and parsers, SortField) while adding wrapper instances and 
 keeping old code in org.apache.lucene.search.
 In detail and besides the above mentioned improvements, the following is 
 provided:
 1. An IndexCache specific for SegmentReaders. The two ThreadLocals are moved 
 from SegmentReader to SegmentReaderIndexCache.
 2. A housekeeping improvement to CloseableThreadLocal. Now delegates the 
 close() method to all registered instances by calling an onClose() method 
 with the threads' instances.
 3. Analyzer.close now may throw an IOException (this already is covered by 
 java.io.Closeable).
 4. A change to Collector: allow IndexCache instead of IndexReader being 
 passed to setNextReader()
 5. SortField's numeric types have been replaced by direct assignments of 
 FieldComparatorSource. This removes the switch statements and the 
 possibility to throw IllegalArgumentExceptions because of unsupported type 
 values.
 The following classes have been deprecated and replaced by new classes in 
 org.apache.lucene.search.fields:
 - FieldCacheRangeFilter (= IndexFieldCacheRangeFilter)
 - FieldCacheTermsFilter (= IndexFieldCacheTermsFilter)
 - FieldCache (= IndexFieldCache)
 - FieldCacheImpl (= IndexFieldCacheImpl)
 - all classes in FieldCacheImpl (= several package-level classes)
 - all subclasses of FieldComparator (= several package-level classes)
 Final notes:
 - The patch would be simpler

[jira] Issue Comment Edited: (LUCENE-2133) [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField

[
https://issues.apache.org/jira/browse/LUCENE-2133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12787752#action_12787752
]

Mark Miller edited comment on LUCENE-2133 at 12/8/09 9:34 PM:
--

And what about the doubling up insanity? It looks like you just commented out
that check? It appears to me that thats still an issue we want to check for -
we want to make sure Lucene core and users have a way to be sure they are not
using a toplevel reader and its sub readers for caches unless they *really*
intend to.

*edit*

This type of change actually even exaggerates that problem (though if we want
to improve things here, its something we will have to deal with).

Now you might have a mixture of old api/new api caches as well if you don't
properly upgrade everything at once.

was (Author: markrmil...@gmail.com):
And what about the doubling up insanity? It looks like you just commented
out that check? It appears to me that thats still an issue we want to check for
- we want to make sure Lucene core and users have a way to be sure they are not
using a toplevel reader and its sub readers for caches unless they *really*
intend to.

[PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField
-

[jira] Commented: (LUCENE-2133) [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField

[
https://issues.apache.org/jira/browse/LUCENE-2133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12787767#action_12787767
]

Mark Miller commented on LUCENE-2133:
-

bq. not bind the cache so hard to the IndexReader (which was also the problem
with the last FieldCache), instead just make it a plugin component

At a minimum, you should be able to set the cache for the reader.

bq. For the functionality of Lucene, FieldCache is not needed, sorting is just
an addon on searching

The way he has it, this is not just for the fieldache, but also the
fieldsreader and vectorreader - if we go down that road, we should consider
norms as well.

bq. I see no problems with appling it soon

I still think it might be a little early. This has a lot of consequences.

[PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField
-

[jira] Resolved: (LUCENE-2106) Benchmark does not close its Reader when OpenReader/CloseReader are not used


 [ 
https://issues.apache.org/jira/browse/LUCENE-2106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller resolved LUCENE-2106.
-

Resolution: Fixed

 Benchmark does not close its Reader when OpenReader/CloseReader are not used
 

 Key: LUCENE-2106
 URL: https://issues.apache.org/jira/browse/LUCENE-2106
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/benchmark
Affects Versions: 3.0
Reporter: Mark Miller
Assignee: Mark Miller
 Fix For: 3.0.1, 3.1

 Attachments: LUCENE-2106.patch


 Only the Searcher is closed, but because the reader is passed to the 
 Searcher, the Searcher does not close the Reader, causing a resource leak.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1844) Speed up junit tests

[
https://issues.apache.org/jira/browse/LUCENE-1844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12787004#action_12787004
]

Mark Miller commented on LUCENE-1844:
-

It should work fine.

Speed up junit tests

Key: LUCENE-1844
URL: https://issues.apache.org/jira/browse/LUCENE-1844
Project: Lucene - Java
Issue Type: Improvement
Reporter: Mark Miller
Assignee: Michael McCandless
Fix For: 3.1

Attachments: FastCnstScoreQTest.patch, hi_junit_test_runtimes.png,
LUCENE-1844-Junit3.patch, LUCENE-1844.patch, LUCENE-1844.patch,
LUCENE-1844.patch

As Lucene grows, so does the number of JUnit tests. This is obviously a good
thing, but it comes with longer and longer test times. Now that we also run
back compat tests in a standard test run, this problem is essentially doubled.
There are some ways this may get better, including running parallel tests.
You will need the hardware to fully take advantage, but it should be a nice
gain. There is already an issue for this, and Junit 4.6, 4.7 have the
beginnings of something we might be able to count on soon. 4.6 was buggy, and
4.7 still doesn't come with nice ant integration. Parallel tests will come
though.
Beyond parallel testing, I think we also need to concentrate on keeping our
tests lean. We don't want to sacrifice coverage or quality, but I'm sure
there is plenty of fat to skim.
I've started making a list of some of the longer tests - I think with some
work we can make our tests much faster - and then with parallelization, I
think we could see some really great gains.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Created: (LUCENE-2130) Investigate Rewriting Constant Scoring MultiTermQueries per segment

Investigate Rewriting Constant Scoring MultiTermQueries per segment
---

 Key: LUCENE-2130
 URL: https://issues.apache.org/jira/browse/LUCENE-2130
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Mark Miller
Priority: Minor


This issue is likely not to go anywhere, but I thought we might explore it. The 
only idea I have come up with is fairly ugly, and unless something better comes 
up, this is not likely to happen.

But if we could rewrite constant score multi-term queries per segment, MTQ's 
with auto, constant, or constant boolean rewrite could enum terms against a 
single segment and then apply a boolean query against each segment with just 
the terms that are known to be in that segment. This way, if you have a bunch 
of really large segments and a lot of really small segments, you wouldn't apply 
a huge booleanquery against all of the small segments which don't have those 
terms anyway. How advantageous this is, I'm not sure yet.

No biggie, not likely, but what the heck.

So the ugly way to do it is to add a property to query's and weights - 
lateCnstRewrite or something, that defaults to false. MTQ would return true if 
its in a constant score mode. On the top level rewrite, if this is detected, an 
empty ConstantScoreQuery is made, and its Weight is turned to lateCnstRewrite 
and it keeps a ref to the original MTQ query. It also gets its boost set to the 
MTQ's boost. Then when we are searching per segment, if the Weight is 
lateCnstRewrite, we grab the orig query and actually do the rewrite against the 
subreader and grab the actual constantscore weight. It works I think - but its 
a little ugly.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2130) Investigate Rewriting Constant Scoring MultiTermQueries per segment

[
https://issues.apache.org/jira/browse/LUCENE-2130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12787239#action_12787239
]

Mark Miller commented on LUCENE-2130:
-

Whoops - a little off in that summary - you would't apply a huge boolean query
- you'd just have a sparser filter. This might not be that beneficial.

Investigate Rewriting Constant Scoring MultiTermQueries per segment
---

Key: LUCENE-2130
URL: https://issues.apache.org/jira/browse/LUCENE-2130
Project: Lucene - Java
Issue Type: Improvement
Reporter: Mark Miller
Priority: Minor

This issue is likely not to go anywhere, but I thought we might explore it.
The only idea I have come up with is fairly ugly, and unless something better
comes up, this is not likely to happen.
But if we could rewrite constant score multi-term queries per segment, MTQ's
with auto, constant, or constant boolean rewrite could enum terms against a
single segment and then apply a boolean query against each segment with just
the terms that are known to be in that segment. This way, if you have a bunch
of really large segments and a lot of really small segments, you wouldn't
apply a huge booleanquery against all of the small segments which don't have
those terms anyway. How advantageous this is, I'm not sure yet.
No biggie, not likely, but what the heck.
So the ugly way to do it is to add a property to query's and weights -
lateCnstRewrite or something, that defaults to false. MTQ would return true
if its in a constant score mode. On the top level rewrite, if this is
detected, an empty ConstantScoreQuery is made, and its Weight is turned to
lateCnstRewrite and it keeps a ref to the original MTQ query. It also gets
its boost set to the MTQ's boost. Then when we are searching per segment, if
the Weight is lateCnstRewrite, we grab the orig query and actually do the
rewrite against the subreader and grab the actual constantscore weight. It
works I think - but its a little ugly.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2130) Investigate Rewriting Constant Scoring MultiTermQueries per segment