[jira] Created: (LUCENE-2313) Add VERBOSE to LuceneTestCase and LuceneTestCaseJ4

2010-03-13 Thread Shai Erera (JIRA)
Add VERBOSE to LuceneTestCase and LuceneTestCaseJ4
--

 Key: LUCENE-2313
 URL: https://issues.apache.org/jira/browse/LUCENE-2313
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Shai Erera
Priority: Minor
 Fix For: 3.1


component-build.xml allows to define tests.verbose as a system property when 
running tests. Both LuceneTestCase and LuceneTestCaseJ4 don't read that 
property. It will be useful for overriding tests to access one place for this 
setting (I believe currently some tests do it on their own). Then (as a 
separate issue) we can move all tests that don't check the parameter to only 
print if VERBOSE is true.

I will post a patch soon.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2313) Add VERBOSE to LuceneTestCase and LuceneTestCaseJ4

2010-03-13 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-2313:
---

Attachment: LUCENE-2313.patch

Adds VERBOSE to LuceneTestCase and LuceneTestCaseJ4, as well as changes 
TestQualityRun (contrib/benchmark) to use that. I didn't find any other tests 
which check that property directly.

> Add VERBOSE to LuceneTestCase and LuceneTestCaseJ4
> --
>
> Key: LUCENE-2313
> URL: https://issues.apache.org/jira/browse/LUCENE-2313
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Shai Erera
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2313.patch
>
>
> component-build.xml allows to define tests.verbose as a system property when 
> running tests. Both LuceneTestCase and LuceneTestCaseJ4 don't read that 
> property. It will be useful for overriding tests to access one place for this 
> setting (I believe currently some tests do it on their own). Then (as a 
> separate issue) we can move all tests that don't check the parameter to only 
> print if VERBOSE is true.
> I will post a patch soon.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2313) Add VERBOSE to LuceneTestCase and LuceneTestCaseJ4

2010-03-13 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844832#action_12844832
 ] 

Uwe Schindler commented on LUCENE-2313:
---

Looks good!

(I did not even know this property, but we can add this VERBOSE check to more 
tests, too. The first that come to my mind are NumericRange, Highlighter, 
Spatial)

> Add VERBOSE to LuceneTestCase and LuceneTestCaseJ4
> --
>
> Key: LUCENE-2313
> URL: https://issues.apache.org/jira/browse/LUCENE-2313
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Shai Erera
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2313.patch
>
>
> component-build.xml allows to define tests.verbose as a system property when 
> running tests. Both LuceneTestCase and LuceneTestCaseJ4 don't read that 
> property. It will be useful for overriding tests to access one place for this 
> setting (I believe currently some tests do it on their own). Then (as a 
> separate issue) we can move all tests that don't check the parameter to only 
> print if VERBOSE is true.
> I will post a patch soon.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Assigned: (LUCENE-2313) Add VERBOSE to LuceneTestCase and LuceneTestCaseJ4

2010-03-13 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler reassigned LUCENE-2313:
-

Assignee: Uwe Schindler

> Add VERBOSE to LuceneTestCase and LuceneTestCaseJ4
> --
>
> Key: LUCENE-2313
> URL: https://issues.apache.org/jira/browse/LUCENE-2313
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Shai Erera
>Assignee: Uwe Schindler
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2313.patch
>
>
> component-build.xml allows to define tests.verbose as a system property when 
> running tests. Both LuceneTestCase and LuceneTestCaseJ4 don't read that 
> property. It will be useful for overriding tests to access one place for this 
> setting (I believe currently some tests do it on their own). Then (as a 
> separate issue) we can move all tests that don't check the parameter to only 
> print if VERBOSE is true.
> I will post a patch soon.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



RE: Different behavior of Directory.fieldLength()

2010-03-13 Thread Uwe Schindler
That is not true, the API says:
"Creates a new File *instance* from a parent pathname string and a child 
pathname string."

Please note "instance", so it will never create the file on disk. New File() 
just creates a file instance but no file on disk. You can check this with a 
simple test.

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


> -Original Message-
> From: Marcelo Ochoa [mailto:marcelo.oc...@gmail.com]
> Sent: Saturday, March 13, 2010 12:25 AM
> To: java-dev@lucene.apache.org
> Subject: Different behavior of Directory.fieldLength()
> 
> Hi:
>   During some test of Lucene Domain Index
> (http://docs.google.com/View?id=ddgw7sjp_54fgj9kg) with big data
> sources we found an exception caused for calling
> Directory.fieldLength() method on non existing file.
>   FSDirectory implements this method as:
>   /** Returns the length in bytes of a file in the directory. */
>   public long fileLength(String name) {
> ensureOpen();
> File file = new File(directory, name);
> return file.length();
>   }
> 
>   According to JDK1.5 calling to File constructor causes a file
> creation without throwing an exception:
> http://java.sun.com/j2se/1.5.0/docs/api/java/io/File.html#File(java.lan
> g.String,
> java.lang.String)
>   But either RAMDirectory nor OJVMDirectory do this:
> RAMDirectory:
>   /** Returns the length in bytes of a file in the directory.
>* @throws IOException if the file does not exist
>*/
>   public final long fileLength(String name) throws IOException {
> ensureOpen();
> RAMFile file;
> synchronized (this) {
>   file = (RAMFile)fileMap.get(name);
> }
> if (file==null)
>   throw new FileNotFoundException(name);
> return file.getLength();
>   }
> 
>   If OJVMDirectory throws an exception if a file doesn't exist it
> causes that the IndexWriter fail to do the job, here the stack trace:
> IW 3 [Root Thread]: DW:   RAM: now flush @ usedMB=15.001
> allocMB=15.001 deletesMB=0 triggerMB=15
> IW 3 [Root Thread]:   flush: segment=_0 docStoreSegment=_0
> docStoreOffset=0 flushDocs=true flushDeletes=false
> flushDocStores=false numDocs=109169 numBufDelTerms=0
> IW 3 [Root Thread]:   index before flush
> IW 3 [Root Thread]: DW: flush postings as segment _0 numDocs=109169
> *** 2010-03-11 17:27:15.696
> IW 3 [Root Thread]: DW: docWriter: now abort
> IW 3 [Root Thread]: hit exception flushing segment _0
> IFD [Root Thread]: refresh [prefix=_0]: removing newly created
> unreferenced file "_0.tii"
> IFD [Root Thread]: delete "_0.tii"
> IFD [Root Thread]: refresh [prefix=_0]: removing newly created
> unreferenced file "_0.fnm"
> IFD [Root Thread]: delete "_0.fnm"
> IFD [Root Thread]: refresh [prefix=_0]: removing newly created
> unreferenced file "_0.fdx"
> IFD [Root Thread]: delete "_0.fdx"
> IFD [Root Thread]: refresh [prefix=_0]: removing newly created
> unreferenced file "_0.fdt"
> IFD [Root Thread]: delete "_0.fdt"
> IFD [Root Thread]: refresh [prefix=_0]: removing newly created
> unreferenced file "_0.prx"
> IFD [Root Thread]: delete "_0.prx"
> IFD [Root Thread]: refresh [prefix=_0]: removing newly created
> unreferenced file "_0.nrm"
> IFD [Root Thread]: delete "_0.nrm"
> IFD [Root Thread]: refresh [prefix=_0]: removing newly created
> unreferenced file "_0.frq"
> IFD [Root Thread]: delete "_0.frq"
> IFD [Root Thread]: refresh [prefix=_0]: removing newly created
> unreferenced file "_0.tis"
> IFD [Root Thread]: delete "_0.tis"
> Mar 11, 2010 5:27:15 PM org.apache.lucene.indexer.LuceneDomainIndex
> ODCIIndexCreate
> SEVERE: failed to create index: cannot verify file: _0.fdx. Reason:
> Exhausted Resultset
> Mar 11, 2010 5:27:15 PM org.apache.lucene.indexer.LuceneDomainIndex
> ODCIIndexCreate
> FINER: THROW
> java.io.IOException: cannot verify file: _0.fdx. Reason: Exhausted
> Resultset
> at
> org.apache.lucene.store.OJVMDirectory.fileLength(OJVMDirectory.java:633
> )
> at
> org.apache.lucene.index.SegmentInfo.sizeInBytes(SegmentInfo.java:271)
> at
> org.apache.lucene.index.DocumentsWriter.flush(DocumentsWriter.java:593)
> at
> org.apache.lucene.index.IndexWriter.doFlushInternal(IndexWriter.java:43
> 11)
> at
> org.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:4209)
> at
> org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:4200)
> at
> org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2497)
> at
> org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2451)
> at
> org.apache.lucene.indexer.TableIndexer.index(TableIndexer.java:374)
> at
> org.apache.lucene.indexer.LuceneDomainIndex.ODCIIndexCreate(LuceneDomai
> nIndex.java:568)
> IW 3 [Root Thread]: now flush at close
> IW 3 [Root Thread]:   flush: segment=null docStoreSegment=null
> docStoreOffset=0 flushDocs=false flushDeletes=true
> flushDocStores=false numDocs=0 numBufDelTerms=0
> IW 3 [Root Thread]:   index before

[jira] Commented: (LUCENE-2313) Add VERBOSE to LuceneTestCase and LuceneTestCaseJ4

2010-03-13 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844844#action_12844844
 ] 

Michael McCandless commented on LUCENE-2313:


This is great

> Add VERBOSE to LuceneTestCase and LuceneTestCaseJ4
> --
>
> Key: LUCENE-2313
> URL: https://issues.apache.org/jira/browse/LUCENE-2313
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Shai Erera
>Assignee: Uwe Schindler
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2313.patch
>
>
> component-build.xml allows to define tests.verbose as a system property when 
> running tests. Both LuceneTestCase and LuceneTestCaseJ4 don't read that 
> property. It will be useful for overriding tests to access one place for this 
> setting (I believe currently some tests do it on their own). Then (as a 
> separate issue) we can move all tests that don't check the parameter to only 
> print if VERBOSE is true.
> I will post a patch soon.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2293) IndexWriter has hard limit on max concurrency

2010-03-13 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844848#action_12844848
 ] 

Michael McCandless commented on LUCENE-2293:


I think this issue has these steps:

  * Allow the 5 to be changed (trivial first step) -- I'll do this
after LUCENE-2294 is in

  * Change the approach for how we buffer in RAM to a more isolated
approach, whereby IW has N fully independent RAM segments
in-process and when a doc needs to be indexed it's added to one of
them.  Each segment would also write its own doc stores and
"normal" segment merging (not the inefficient merge we now do on
flush) would merge them.  This should be a good simplification in
the chain (eg maybe we can remove the *PerThread classes).  The
segments can flush independently, letting us make much better
concurrent use of IO & CPU.

  * Enable NRT readers to directly search these RAM segments.  This
entails recording deletes on the RAM segments as an int[].  We
need to solve the Term sorting issue... (b-tree, or, simply
sort-on-demand the first time a query needs it, though that cost
increases the larger your RAM segments get, ie, not incremental to
the # docs you just added).  Also, we have to solve what happens
to a reader using a RAM segment that's been flushed.  Perhaps we
don't reuse RAM at that point, ie, rely on GC to reclaim once all
readers using that RAM segmeent have closed.  We should do this
part under a separate issue (LUCENE-2312).


> IndexWriter has hard limit on max concurrency
> -
>
> Key: LUCENE-2293
> URL: https://issues.apache.org/jira/browse/LUCENE-2293
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 3.1
>
>
> DocumentsWriter has this nasty hardwired constant:
> {code}
> private final static int MAX_THREAD_STATE = 5;
> {code}
> which probably I should have attached a //nocommit to the moment I
> wrote it ;)
> That constant sets the max number of thread states to 5.  This means,
> if more than 5 threads enter IndexWriter at once, they will "share"
> only 5 thread states, meaning we gate CPU concurrency to 5 running
> threads inside IW (each thread must first wait for the last thread to
> finish using the thread state before grabbing it).
> This is bad because modern hardware can make use of more than 5
> threads.  So I think an immediate fix is to make this settable
> (expert), and increase the default (8?).
> It's tricky, though, because the more thread states, the less RAM
> efficiency you have, meaning the worse indexing throughput.  So you
> shouldn't up and set this to 50: you'll be flushing too often.
> But... I think a better fix is to re-think how threads write state
> into DocumentsWriter.  Today, a single docID stream is assigned across
> threads (eg one thread gets docID=0, next one docID=1, etc.), and each
> thread writes to a private RAM buffer (living in the thread state),
> and then on flush we do a merge sort.  The merge sort is inefficient
> (does not currently use a PQ)... and, wasteful because we must
> re-decode every posting byte.
> I think we could change this, so that threads write to private RAM
> buffers, with a private docID stream, but then instead of merging on
> flush, we directly flush each thread as its own segment (and, allocate
> private docIDs to each thread).  We can then leave merging to CMS
> which can already run merges in the BG without blocking ongoing
> indexing (unlike the merge we do in flush, today).
> This would also allow us to separately flush thread states.  Ie, we
> need not flush all thread states at once -- we can flush one when it
> gets too big, and then let the others keep running.  This should be a
> good concurrency gain since is uses IO & CPU resources "throughout"
> indexing instead of "big burst of CPU only" then "big burst of IO
> only" that we have today (flush today "stops the world").
> One downside I can think of is... docIDs would now be "less
> monotonic", meaning if N threads are indexing, you'll roughly get
> in-time-order assignment of docIDs.  But with this change, all of one
> thread state would get 0..N docIDs, the next thread state'd get
> N+1...M docIDs, etc.  However, a single thread would still get
> monotonic assignment of docIDs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Baby steps towards making Lucene's scoring more flexible...

2010-03-13 Thread Michael McCandless
On Thu, Mar 11, 2010 at 12:35 PM, Marvin Humphrey
 wrote:
> On Mon, Mar 08, 2010 at 02:10:35PM -0500, Michael McCandless wrote:
>
>> We ask it to give us a Codec.
>
> There's a conflict between the segment-wide role of the "Codec" class and its
> role as specifier for posting format.
>
> In some sense, you could argue that the "codec" reads/writes the entire index
> segment -- which includes not only postings files, but also stored fields,
> term vectors, etc.  However, the compression algorithms after which these
> codecs are named have nothing to do with those other files.  PFORCodec isn't
> relevant to stored fields.
>
> I'd argue for limiting the role of "Codec" to encoding and decoding posting
> files.

Yeah perhaps we should rename Codec -> PostingsCodec.  And with time
add different interfaces for the other components of a segment (eg
StoredFieldsCodec).

> As far as modularizing other aspects of index reading and writing, I don't
> think a simple factory is the way to go.  I favor using a composite design
> pattern for SegWriter and SegReader (rather than subclassing), and an
> initialization phase controlled by an Architecture object.
>
> It was Earwin Burrfoot who persuaded me of the merits of a user-defined
> initialization phase over a user-defined factory method:
> .

How would this work specifically for postings reading & writing?

When a segment is opened (eg via IndexReader.open/reopen,
IndexWriter.getReader), we need to fully init all components before
returning control.

>> So far my fav is still CodecProvider ;)
>
> It seems that the primary reason this object is needed is that IndexReader
> needs to be able to find the right decoder when it encounters an unfamiliar
> codec name.  Since the core doesn't know about user-created codecs, it's
> necessary for the user to register the name => codec pairing in advance so
> that core can find it.
>
> If that's this object's main role, I'd suggest "CodecRegistry".

Well, it also provides a writer for newly created segments...

>> Naming is the hardest part!!
>
> For me, the hardest parts of API design are...
>
>  A) Designing public abstract classes / interfaces.
>  B) Compensating for the curse of knowledge.

Yes both of these are hard.

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-2313) Add VERBOSE to LuceneTestCase and LuceneTestCaseJ4

2010-03-13 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler resolved LUCENE-2313.
---

Resolution: Fixed

Committed revision: 922525

I only changed the protected to public, to enable helper classes outside util 
to access the setting.

> Add VERBOSE to LuceneTestCase and LuceneTestCaseJ4
> --
>
> Key: LUCENE-2313
> URL: https://issues.apache.org/jira/browse/LUCENE-2313
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Shai Erera
>Assignee: Uwe Schindler
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2313.patch
>
>
> component-build.xml allows to define tests.verbose as a system property when 
> running tests. Both LuceneTestCase and LuceneTestCaseJ4 don't read that 
> property. It will be useful for overriding tests to access one place for this 
> setting (I believe currently some tests do it on their own). Then (as a 
> separate issue) we can move all tests that don't check the parameter to only 
> print if VERBOSE is true.
> I will post a patch soon.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2313) Add VERBOSE to LuceneTestCase and LuceneTestCaseJ4

2010-03-13 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844859#action_12844859
 ] 

Uwe Schindler commented on LUCENE-2313:
---

As a first test with removed verbosity see revision 922528 (NumericRange tests 
no longer prints the term statistics per default)

> Add VERBOSE to LuceneTestCase and LuceneTestCaseJ4
> --
>
> Key: LUCENE-2313
> URL: https://issues.apache.org/jira/browse/LUCENE-2313
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Shai Erera
>Assignee: Uwe Schindler
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2313.patch
>
>
> component-build.xml allows to define tests.verbose as a system property when 
> running tests. Both LuceneTestCase and LuceneTestCaseJ4 don't read that 
> property. It will be useful for overriding tests to access one place for this 
> setting (I believe currently some tests do it on their own). Then (as a 
> separate issue) we can move all tests that don't check the parameter to only 
> print if VERBOSE is true.
> I will post a patch soon.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Different behavior of Directory.fieldLength()

2010-03-13 Thread Marcelo Ochoa
Uwe:
> That is not true, the API says:
> "Creates a new File *instance* from a parent pathname string and a child 
> pathname string."
>
> Please note "instance", so it will never create the file on disk. New File() 
> just creates a file instance but no file on disk. You can check this with a 
> simple test.
  OK but what the about the exception?
  If the creation of the File instance do not throw an exception and
the method File.length() returns 0 if a file does not exists
RAMDirectory and other classes which also override this method should
be modified to return 0.
  Best regards, Marcelo.

-- 
Marcelo F. Ochoa
http://marceloochoa.blogspot.com/
http://mochoa.sites.exa.unicen.edu.ar/
__
Want to integrate Lucene and Oracle?
http://marceloochoa.blogspot.com/2007/09/running-lucene-inside-your-oracle-jvm.html
Is Oracle 11g REST ready?
http://marceloochoa.blogspot.com/2008/02/is-oracle-11g-rest-ready.html

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2314) Add AttributeSource.copyTo(AttributeSource)

2010-03-13 Thread Uwe Schindler (JIRA)
Add AttributeSource.copyTo(AttributeSource)
---

 Key: LUCENE-2314
 URL: https://issues.apache.org/jira/browse/LUCENE-2314
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Uwe Schindler
Assignee: Uwe Schindler
Priority: Minor
 Fix For: 3.1


One problem with AttributeSource at the moment is the missing "insight" into 
AttributeSource.State. If you want to create TokenStreams that inspect cpatured 
states, you have no chance. Making the contents of State public is a bad idea, 
as it does not help for inspecting (its a linked list, so you have to iterate).

AttributeSource currently contains a cloneAttributes() call, which returns a 
new AttrubuteSource with all current attributes cloned. This is the (more 
expensive) captureState. The problem is that you cannot copy back the cloned AS 
(which is the restoreState). To use this behaviour (by the way, ShingleMatrix 
can use it), one can alternatively use cloneAttributes and copyTo. You can 
easily change the cloned attributes and store them in lists and copy them back. 
The only problem is lower performance of these calls (as State is a very 
optimized class).

One use case could be:
{code}
AttributeSource state = cloneAttributes();
//  do something ...
state.getAttribute(TermAttribute.class).setTermBuffer(foobar);
// ... more work
state.copyTo(this);
{code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2314) Add AttributeSource.copyTo(AttributeSource)

2010-03-13 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-2314:
--

Attachment: LUCENE-2314.patch

Here the patch.

> Add AttributeSource.copyTo(AttributeSource)
> ---
>
> Key: LUCENE-2314
> URL: https://issues.apache.org/jira/browse/LUCENE-2314
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2314.patch
>
>
> One problem with AttributeSource at the moment is the missing "insight" into 
> AttributeSource.State. If you want to create TokenStreams that inspect 
> cpatured states, you have no chance. Making the contents of State public is a 
> bad idea, as it does not help for inspecting (its a linked list, so you have 
> to iterate).
> AttributeSource currently contains a cloneAttributes() call, which returns a 
> new AttrubuteSource with all current attributes cloned. This is the (more 
> expensive) captureState. The problem is that you cannot copy back the cloned 
> AS (which is the restoreState). To use this behaviour (by the way, 
> ShingleMatrix can use it), one can alternatively use cloneAttributes and 
> copyTo. You can easily change the cloned attributes and store them in lists 
> and copy them back. The only problem is lower performance of these calls (as 
> State is a very optimized class).
> One use case could be:
> {code}
> AttributeSource state = cloneAttributes();
> //  do something ...
> state.getAttribute(TermAttribute.class).setTermBuffer(foobar);
> // ... more work
> state.copyTo(this);
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2315) AttributeSource's methods for accessing attributes should be final, else its easy to corrupt the internal states

2010-03-13 Thread Uwe Schindler (JIRA)
AttributeSource's methods for accessing attributes should be final, else its 
easy to corrupt the internal states


 Key: LUCENE-2315
 URL: https://issues.apache.org/jira/browse/LUCENE-2315
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Uwe Schindler
Priority: Minor




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Baby steps towards making Lucene's scoring more flexible...

2010-03-13 Thread Michael McCandless
On Fri, Mar 12, 2010 at 8:31 PM, Marvin Humphrey  wrote:
> On Thu, Mar 11, 2010 at 05:59:03AM -0500, Michael McCandless wrote:
>> > So there would be polymorphism in the decoding phase while we're supplying
>> > information the Similarity object needs to make its similarity judgments.
>> > However, that polymorphism would be handled internally -- it wouldn't be 
>> > the
>> > responsibility of the user to determine whether a codec supported a 
>> > particular
>> > scoring model.
>>
>> Is that "yes" (a user can do MatchOnlySim at search time" if the field
>> were indexed with B25Sim)?
>
> In essence, yes.  Technically, no.
>
> Under the covers, doc-id-only postings iteration probably wouldn't be
> implemented by spawning a doc-id-only Similarity object.  It would probably be
> something more like, ask the Similarity for a PostingDecoder with no extra
> attributes.  And then docID-freq-boost postings iteration might be achieved by
> asking the Similarity for a PostingDecoder with TermFreq and DocBoost
> attributes.

Hmm ok so the Sim impls will expose postings with and w/o these attrs.
So then if the postings can't support TermFreq/Boost attrs, it'll
return some sort of error indicating this field can't support scoring?

>> How will Lucy "know" which switchups (Sim at indexing vs Sim at
>> searching) are "OK"...
>
> I think the theme is that each Similarity class will have a whitelist of
> supported posting iteration configurations.  So long as the requested config
> is in the whitelist, you get an iterator back -- otherwise, you get NULL.
>
> Exactly what form the request specification would take, that's up in the air.
> But it would be an implementation detail for now.  So long as the file format
> supports the data, we can build an iterator that reads it, regardless of
> encoding.

OK.

I think that white list is a postings thing, not a sim thing :)  The
index is or isn't able to provide a postings iterating the requested
attrs, and that means you can or cannot use the Sims requiring those
attrs.  Forcing the indirection through Sim (where Sim tells you you
cannot pull this particular postings) doesn't seem right...

It seems like we can actually do this quite cleanly if everything were
an attr (or at least referenced by an attr at read time).  Ie I make
an array of attrs and ask the index if it can give me those attrs.

[DocIdAttr] would be requested for match only.

[DocIdAttr,PositionsAttr] would be requested for match only of a
positional query (eg phrase query).

[DocIdAttr,TermDocFreqAttr] would be requested for a scoring
non-positional query.

[DocIdAttr,TermDocFreqAttr,PositionsAttr] would be requested for a
scoring positional query.

And one could stick in their custom attrs, too.

Then, any Sim imply can be created @ search time, and it asks the
reader for whatever attrs it needs.  If it gets NULL back that means
it's a non-starter -- and you throw an exception (or, silently pretend
nothing matched).

>> >> Yeah so, I don't like that in Lucene you call "Field.setOmitTFAP"
>> >> instead of saying "Field.matchOnly" (or something).  So I do agree
>> >> that it'd be better if the API made it clear what the *search* time
>> >> impact is of using this advanced Field API.
>> >
>> > In my opinion, it makes sense to communicate "match only" by way of the
>> > Similarity object as opposed to a boolean.  I think it's a good way to
>> > introduce the Similarity class and get people comfortable with it, and I 
>> > also
>> > think that it's good to keep stuff out of the FieldType API when we can.
>>
>> But say we want to also allow storing tf but not positions, because
>> really the two choices should not be coupled (as they are today with
>> Lucene's omitTFAP).
>>
>> So I have omitTF and omitP (only 3 combos are allowed -- must omitP if
>> you omitTF).
>>
>> What Sim do you call that at indexing time?
>
> Well, those are pretty esoteric posting formats.  It's common to not need
> scores and therefore not need boost bytes (the Lucene omitNorms case).  It's
> also common to not need any matching info beyond doc id (the Lucene omitTFAP
> case).  But omitTF and omitP aren't common needs, or Lucene would have them by
> now, right?

I think it's a compelling use-case.  Ie, allow for proper scoring
of non-positional queries.

> And since they are infrequently used, Huffman-driven naming philosophy
> suggests that they should have long, low-value names: OmitPositionsSimilarity,
> OmitTFandPositionsSimilarity (or OmitTFAPSimilarity, which would actually be
> an accurate abbreviation in this scenario as opposed to the current Lucene
> omitTFAP).

Just minus the Similarity part ;) I still don't think similarity
should have any bearing during indexing.

> In other words, I don't much care what those are named because they aren't
> likely to be used except by people who A) have very, very specific use cases
> and B) really know what they're doing.
>
> In contrast, I think it's important that we come up with good nam

[jira] Updated: (LUCENE-2315) AttributeSource's methods for accessing attributes should be final, else its easy to corrupt the internal states

2010-03-13 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-2315:
--

  Description: 
The methods that operate and modify the internal maps of AttributeSource should 
be final, which is a backwards break. But anybody that overrides such methods 
simply creates a buggy AS either case.

I want to makeall impls final (in general the class should be final at all, but 
it is made for extension in TokenStream). So its important that the 
implementations are final!
Affects Version/s: 2.9
   2.9.1
   2.9.2
   3.0
   3.0.1
Fix Version/s: 3.1

> AttributeSource's methods for accessing attributes should be final, else its 
> easy to corrupt the internal states
> 
>
> Key: LUCENE-2315
> URL: https://issues.apache.org/jira/browse/LUCENE-2315
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1
>Reporter: Uwe Schindler
>Priority: Minor
> Fix For: 3.1
>
>
> The methods that operate and modify the internal maps of AttributeSource 
> should be final, which is a backwards break. But anybody that overrides such 
> methods simply creates a buggy AS either case.
> I want to makeall impls final (in general the class should be final at all, 
> but it is made for extension in TokenStream). So its important that the 
> implementations are final!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2313) Add VERBOSE to LuceneTestCase and LuceneTestCaseJ4

2010-03-13 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844864#action_12844864
 ] 

Shai Erera commented on LUCENE-2313:


bq. I only changed the protected to public, to enable helper classes outside 
util to access the setting.

Makes sense Uwe - Thanks ! I went for protected to encourage tests to extend 
either of the two. Helper classes however are different indeed :).

> Add VERBOSE to LuceneTestCase and LuceneTestCaseJ4
> --
>
> Key: LUCENE-2313
> URL: https://issues.apache.org/jira/browse/LUCENE-2313
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Shai Erera
>Assignee: Uwe Schindler
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2313.patch
>
>
> component-build.xml allows to define tests.verbose as a system property when 
> running tests. Both LuceneTestCase and LuceneTestCaseJ4 don't read that 
> property. It will be useful for overriding tests to access one place for this 
> setting (I believe currently some tests do it on their own). Then (as a 
> separate issue) we can move all tests that don't check the parameter to only 
> print if VERBOSE is true.
> I will post a patch soon.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Different behavior of Directory.fieldLength()

2010-03-13 Thread Shai Erera
I think it falls under the semantics of dir.fileLength() and not the
semantics of the implementation right? Unfortunately, the semantics of
Directory.fileLength() are not specified, which made it easy for extensions
to invent their own.

I myself am not sure what's better - return 0 as the length for a file that
does not exist, or throw FNFE to alert the caller that he's querying a file
that does not exist. FSDirectory got away with it by using File API which
just happens to return 0 for a non-existing file. RAMDirectory chose to
alert the caller. My feeling is that these two were written by two different
persons, or separate times, each understanding the method differently.

I think we should make the semantics clear, and declare a better contract,
by documentation and possible also by method signature. If for example we
decide that it should return 0 for non-existing files, then I think we can
remove the IOException from the method sig? But maybe we want to allow
IOException to be thrown by Directories that could actually fail on probing
the file length.

I would propose to declare the semantics of fileLength like this:
* Returns the length of the file denoted by name if the file
exists. The return value may be anywhere between 0 and Long.MAX_VALUE.
* Throws FileNotFoundException if the file does not exist. Note that you can
call dir.fileExists(name) if you are not sure whether the file exists or
not.

That way it's clear. We can then change IW code to call fileExists if it
expects to fail on either of the two.

Question is - how do we do this w/o breaking Directory implementations out
there? I think that we might be safe with it, if we make sure all of IW code
queries fileExists before. However if someone relies on FSDir to return 0
instead of throwing exception, that will break his app.

Backwards is always tricky. This does not result in compilation error, but a
runtime change. We might be able to get away with it if we think users run
some tests before they deploy a new Lucene .jar ... but otherwise, we should
create a new method, w/ clear semantics? Something like:

/**
 * @deprecated the method will become abstract when #fileLength(name) has
been removed.
 */
public long getFileLength(String name) throws IOException {
  long len = fileLength(name);
  if (len == 0 && !fileExists(name)) {
throw new FileNotFoundException(name);
  }
  return len;
}

The first line just calls the current impl. If it throws exception for a
non-existing file, we're ok. The second line verifies whether a 0 length is
for an existing file or not and throws an exception appropriately.

That is of course only if everybody else agree w/ these semantics.

Shai

On Sat, Mar 13, 2010 at 1:21 PM, Marcelo Ochoa wrote:

> Uwe:
> > That is not true, the API says:
> > "Creates a new File *instance* from a parent pathname string and a child
> pathname string."
> >
> > Please note "instance", so it will never create the file on disk. New
> File() just creates a file instance but no file on disk. You can check this
> with a simple test.
>   OK but what the about the exception?
>  If the creation of the File instance do not throw an exception and
> the method File.length() returns 0 if a file does not exists
> RAMDirectory and other classes which also override this method should
> be modified to return 0.
>   Best regards, Marcelo.
>
> --
> Marcelo F. Ochoa
> http://marceloochoa.blogspot.com/
> http://mochoa.sites.exa.unicen.edu.ar/
> __
> Want to integrate Lucene and Oracle?
>
> http://marceloochoa.blogspot.com/2007/09/running-lucene-inside-your-oracle-jvm.html
> Is Oracle 11g REST ready?
> http://marceloochoa.blogspot.com/2008/02/is-oracle-11g-rest-ready.html
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>


[jira] Commented: (LUCENE-2314) Add AttributeSource.copyTo(AttributeSource)

2010-03-13 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844868#action_12844868
 ] 

Shai Erera commented on LUCENE-2314:


Minor comment - in copyTo, can you put state.attribute.getClass() in the 
message of the thrown exception, so whoever encounters it will know what's the 
invalid attribute?

On a more general note, can State implement Iterable?

> Add AttributeSource.copyTo(AttributeSource)
> ---
>
> Key: LUCENE-2314
> URL: https://issues.apache.org/jira/browse/LUCENE-2314
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2314.patch
>
>
> One problem with AttributeSource at the moment is the missing "insight" into 
> AttributeSource.State. If you want to create TokenStreams that inspect 
> cpatured states, you have no chance. Making the contents of State public is a 
> bad idea, as it does not help for inspecting (its a linked list, so you have 
> to iterate).
> AttributeSource currently contains a cloneAttributes() call, which returns a 
> new AttrubuteSource with all current attributes cloned. This is the (more 
> expensive) captureState. The problem is that you cannot copy back the cloned 
> AS (which is the restoreState). To use this behaviour (by the way, 
> ShingleMatrix can use it), one can alternatively use cloneAttributes and 
> copyTo. You can easily change the cloned attributes and store them in lists 
> and copy them back. The only problem is lower performance of these calls (as 
> State is a very optimized class).
> One use case could be:
> {code}
> AttributeSource state = cloneAttributes();
> //  do something ...
> state.getAttribute(TermAttribute.class).setTermBuffer(foobar);
> // ... more work
> state.copyTo(this);
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2315) AttributeSource's methods for accessing attributes should be final, else its easy to corrupt the internal states

2010-03-13 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844869#action_12844869
 ] 

Shai Erera commented on LUCENE-2315:


bq. in general the class should be final at all

How can AttributeSource be final? We want people to develop their own 
AttributeSources no? Can you please list the methods that you want to make 
final? I want to check that none of our AttributeSources override them.

> AttributeSource's methods for accessing attributes should be final, else its 
> easy to corrupt the internal states
> 
>
> Key: LUCENE-2315
> URL: https://issues.apache.org/jira/browse/LUCENE-2315
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1
>Reporter: Uwe Schindler
>Priority: Minor
> Fix For: 3.1
>
>
> The methods that operate and modify the internal maps of AttributeSource 
> should be final, which is a backwards break. But anybody that overrides such 
> methods simply creates a buggy AS either case.
> I want to makeall impls final (in general the class should be final at all, 
> but it is made for extension in TokenStream). So its important that the 
> implementations are final!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2315) AttributeSource's methods for accessing attributes should be final, else its easy to corrupt the internal states

2010-03-13 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844870#action_12844870
 ] 

Uwe Schindler commented on LUCENE-2315:
---

bq. How can AttributeSource be final?

This was just a comment about the class, but its not possible because it is 
extended by TokenStreams or similar classes - but the implementation of methods 
should not be alterable. So *all* methods should be final, at least all methods 
that access/modify the private maps.

A correct plan for "own implementations of AttributeSource" would be to create 
an abstract AttributeSource base class that defines the behaviour and all impls 
in the current AttributeSource are final. Because there may be other 
implementations that work without maps or have a hardcoded number of attributes 
with optimized implementations.

> AttributeSource's methods for accessing attributes should be final, else its 
> easy to corrupt the internal states
> 
>
> Key: LUCENE-2315
> URL: https://issues.apache.org/jira/browse/LUCENE-2315
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1
>Reporter: Uwe Schindler
>Priority: Minor
> Fix For: 3.1
>
>
> The methods that operate and modify the internal maps of AttributeSource 
> should be final, which is a backwards break. But anybody that overrides such 
> methods simply creates a buggy AS either case.
> I want to makeall impls final (in general the class should be final at all, 
> but it is made for extension in TokenStream). So its important that the 
> implementations are final!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2314) Add AttributeSource.copyTo(AttributeSource)

2010-03-13 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844871#action_12844871
 ] 

Uwe Schindler commented on LUCENE-2314:
---

bq. Minor comment - in copyTo, can you put state.attribute.getClass() in the 
message of the thrown exception, so whoever encounters it will know what's the 
invalid attribute?

Good idea, the same should be done for restoreState (the code way copied from 
there).

bq. On a more general note, can State implement Iterable?

It could, but as State is itsself a linked list element it would be... strange. 
But of course we could make it Iterable. But the internal 
implementations of AttributeSource should not use this interface as it is 
optimized for speed, so the creation of iterators is a no-go here.

> Add AttributeSource.copyTo(AttributeSource)
> ---
>
> Key: LUCENE-2314
> URL: https://issues.apache.org/jira/browse/LUCENE-2314
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2314.patch
>
>
> One problem with AttributeSource at the moment is the missing "insight" into 
> AttributeSource.State. If you want to create TokenStreams that inspect 
> cpatured states, you have no chance. Making the contents of State public is a 
> bad idea, as it does not help for inspecting (its a linked list, so you have 
> to iterate).
> AttributeSource currently contains a cloneAttributes() call, which returns a 
> new AttrubuteSource with all current attributes cloned. This is the (more 
> expensive) captureState. The problem is that you cannot copy back the cloned 
> AS (which is the restoreState). To use this behaviour (by the way, 
> ShingleMatrix can use it), one can alternatively use cloneAttributes and 
> copyTo. You can easily change the cloned attributes and store them in lists 
> and copy them back. The only problem is lower performance of these calls (as 
> State is a very optimized class).
> One use case could be:
> {code}
> AttributeSource state = cloneAttributes();
> //  do something ...
> state.getAttribute(TermAttribute.class).setTermBuffer(foobar);
> // ... more work
> state.copyTo(this);
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Different behavior of Directory.fieldLength()

2010-03-13 Thread Michael McCandless
I like the proposed new semantics (throw FNFE if the file does not
exist), and the migration path (new method, deprecate old).

Mike

On Sat, Mar 13, 2010 at 7:46 AM, Shai Erera  wrote:
> I think it falls under the semantics of dir.fileLength() and not the
> semantics of the implementation right? Unfortunately, the semantics of
> Directory.fileLength() are not specified, which made it easy for extensions
> to invent their own.
>
> I myself am not sure what's better - return 0 as the length for a file that
> does not exist, or throw FNFE to alert the caller that he's querying a file
> that does not exist. FSDirectory got away with it by using File API which
> just happens to return 0 for a non-existing file. RAMDirectory chose to
> alert the caller. My feeling is that these two were written by two different
> persons, or separate times, each understanding the method differently.
>
> I think we should make the semantics clear, and declare a better contract,
> by documentation and possible also by method signature. If for example we
> decide that it should return 0 for non-existing files, then I think we can
> remove the IOException from the method sig? But maybe we want to allow
> IOException to be thrown by Directories that could actually fail on probing
> the file length.
>
> I would propose to declare the semantics of fileLength like this:
> * Returns the length of the file denoted by name if the file
> exists. The return value may be anywhere between 0 and Long.MAX_VALUE.
> * Throws FileNotFoundException if the file does not exist. Note that you can
> call dir.fileExists(name) if you are not sure whether the file exists or
> not.
>
> That way it's clear. We can then change IW code to call fileExists if it
> expects to fail on either of the two.
>
> Question is - how do we do this w/o breaking Directory implementations out
> there? I think that we might be safe with it, if we make sure all of IW code
> queries fileExists before. However if someone relies on FSDir to return 0
> instead of throwing exception, that will break his app.
>
> Backwards is always tricky. This does not result in compilation error, but a
> runtime change. We might be able to get away with it if we think users run
> some tests before they deploy a new Lucene .jar ... but otherwise, we should
> create a new method, w/ clear semantics? Something like:
>
> /**
>  * @deprecated the method will become abstract when #fileLength(name) has
> been removed.
>  */
> public long getFileLength(String name) throws IOException {
>   long len = fileLength(name);
>   if (len == 0 && !fileExists(name)) {
>     throw new FileNotFoundException(name);
>   }
>   return len;
> }
>
> The first line just calls the current impl. If it throws exception for a
> non-existing file, we're ok. The second line verifies whether a 0 length is
> for an existing file or not and throws an exception appropriately.
>
> That is of course only if everybody else agree w/ these semantics.
>
> Shai
>
> On Sat, Mar 13, 2010 at 1:21 PM, Marcelo Ochoa 
> wrote:
>>
>> Uwe:
>> > That is not true, the API says:
>> > "Creates a new File *instance* from a parent pathname string and a child
>> > pathname string."
>> >
>> > Please note "instance", so it will never create the file on disk. New
>> > File() just creates a file instance but no file on disk. You can check this
>> > with a simple test.
>>  OK but what the about the exception?
>>  If the creation of the File instance do not throw an exception and
>> the method File.length() returns 0 if a file does not exists
>> RAMDirectory and other classes which also override this method should
>> be modified to return 0.
>>  Best regards, Marcelo.
>>
>> --
>> Marcelo F. Ochoa
>> http://marceloochoa.blogspot.com/
>> http://mochoa.sites.exa.unicen.edu.ar/
>> __
>> Want to integrate Lucene and Oracle?
>>
>> http://marceloochoa.blogspot.com/2007/09/running-lucene-inside-your-oracle-jvm.html
>> Is Oracle 11g REST ready?
>> http://marceloochoa.blogspot.com/2008/02/is-oracle-11g-rest-ready.html
>>
>> -
>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>>
>
>

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2314) Add AttributeSource.copyTo(AttributeSource)

2010-03-13 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844874#action_12844874
 ] 

Shai Erera commented on LUCENE-2314:


I just thought that instead of the for loop you have now you could have written 
something like: "for (State state : this)" ... a Java 5.0 style iteration. But 
it's not critical.

> Add AttributeSource.copyTo(AttributeSource)
> ---
>
> Key: LUCENE-2314
> URL: https://issues.apache.org/jira/browse/LUCENE-2314
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2314.patch
>
>
> One problem with AttributeSource at the moment is the missing "insight" into 
> AttributeSource.State. If you want to create TokenStreams that inspect 
> cpatured states, you have no chance. Making the contents of State public is a 
> bad idea, as it does not help for inspecting (its a linked list, so you have 
> to iterate).
> AttributeSource currently contains a cloneAttributes() call, which returns a 
> new AttrubuteSource with all current attributes cloned. This is the (more 
> expensive) captureState. The problem is that you cannot copy back the cloned 
> AS (which is the restoreState). To use this behaviour (by the way, 
> ShingleMatrix can use it), one can alternatively use cloneAttributes and 
> copyTo. You can easily change the cloned attributes and store them in lists 
> and copy them back. The only problem is lower performance of these calls (as 
> State is a very optimized class).
> One use case could be:
> {code}
> AttributeSource state = cloneAttributes();
> //  do something ...
> state.getAttribute(TermAttribute.class).setTermBuffer(foobar);
> // ... more work
> state.copyTo(this);
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2314) Add AttributeSource.copyTo(AttributeSource)

2010-03-13 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844876#action_12844876
 ] 

Uwe Schindler commented on LUCENE-2314:
---

Because of speed we do not do this. That was performance tested in 2.9 
development. The for-loop using the linked list directly is far faster. the 
captureState is one of the most optimized methods.

> Add AttributeSource.copyTo(AttributeSource)
> ---
>
> Key: LUCENE-2314
> URL: https://issues.apache.org/jira/browse/LUCENE-2314
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2314.patch
>
>
> One problem with AttributeSource at the moment is the missing "insight" into 
> AttributeSource.State. If you want to create TokenStreams that inspect 
> cpatured states, you have no chance. Making the contents of State public is a 
> bad idea, as it does not help for inspecting (its a linked list, so you have 
> to iterate).
> AttributeSource currently contains a cloneAttributes() call, which returns a 
> new AttrubuteSource with all current attributes cloned. This is the (more 
> expensive) captureState. The problem is that you cannot copy back the cloned 
> AS (which is the restoreState). To use this behaviour (by the way, 
> ShingleMatrix can use it), one can alternatively use cloneAttributes and 
> copyTo. You can easily change the cloned attributes and store them in lists 
> and copy them back. The only problem is lower performance of these calls (as 
> State is a very optimized class).
> One use case could be:
> {code}
> AttributeSource state = cloneAttributes();
> //  do something ...
> state.getAttribute(TermAttribute.class).setTermBuffer(foobar);
> // ... more work
> state.copyTo(this);
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2314) Add AttributeSource.copyTo(AttributeSource)

2010-03-13 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844877#action_12844877
 ] 

Shai Erera commented on LUCENE-2314:


Ok. Performance is always preferred than beautiful looking code :).

> Add AttributeSource.copyTo(AttributeSource)
> ---
>
> Key: LUCENE-2314
> URL: https://issues.apache.org/jira/browse/LUCENE-2314
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2314.patch
>
>
> One problem with AttributeSource at the moment is the missing "insight" into 
> AttributeSource.State. If you want to create TokenStreams that inspect 
> cpatured states, you have no chance. Making the contents of State public is a 
> bad idea, as it does not help for inspecting (its a linked list, so you have 
> to iterate).
> AttributeSource currently contains a cloneAttributes() call, which returns a 
> new AttrubuteSource with all current attributes cloned. This is the (more 
> expensive) captureState. The problem is that you cannot copy back the cloned 
> AS (which is the restoreState). To use this behaviour (by the way, 
> ShingleMatrix can use it), one can alternatively use cloneAttributes and 
> copyTo. You can easily change the cloned attributes and store them in lists 
> and copy them back. The only problem is lower performance of these calls (as 
> State is a very optimized class).
> One use case could be:
> {code}
> AttributeSource state = cloneAttributes();
> //  do something ...
> state.getAttribute(TermAttribute.class).setTermBuffer(foobar);
> // ... more work
> state.copyTo(this);
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Different behavior of Directory.fieldLength()

2010-03-13 Thread Shai Erera
Ok, opened LUCENE-2316 to track this.

Shai

On Sat, Mar 13, 2010 at 3:49 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> I like the proposed new semantics (throw FNFE if the file does not
> exist), and the migration path (new method, deprecate old).
>
> Mike
>
> On Sat, Mar 13, 2010 at 7:46 AM, Shai Erera  wrote:
> > I think it falls under the semantics of dir.fileLength() and not the
> > semantics of the implementation right? Unfortunately, the semantics of
> > Directory.fileLength() are not specified, which made it easy for
> extensions
> > to invent their own.
> >
> > I myself am not sure what's better - return 0 as the length for a file
> that
> > does not exist, or throw FNFE to alert the caller that he's querying a
> file
> > that does not exist. FSDirectory got away with it by using File API which
> > just happens to return 0 for a non-existing file. RAMDirectory chose to
> > alert the caller. My feeling is that these two were written by two
> different
> > persons, or separate times, each understanding the method differently.
> >
> > I think we should make the semantics clear, and declare a better
> contract,
> > by documentation and possible also by method signature. If for example we
> > decide that it should return 0 for non-existing files, then I think we
> can
> > remove the IOException from the method sig? But maybe we want to allow
> > IOException to be thrown by Directories that could actually fail on
> probing
> > the file length.
> >
> > I would propose to declare the semantics of fileLength like this:
> > * Returns the length of the file denoted by name if the file
> > exists. The return value may be anywhere between 0 and Long.MAX_VALUE.
> > * Throws FileNotFoundException if the file does not exist. Note that you
> can
> > call dir.fileExists(name) if you are not sure whether the file exists or
> > not.
> >
> > That way it's clear. We can then change IW code to call fileExists if it
> > expects to fail on either of the two.
> >
> > Question is - how do we do this w/o breaking Directory implementations
> out
> > there? I think that we might be safe with it, if we make sure all of IW
> code
> > queries fileExists before. However if someone relies on FSDir to return 0
> > instead of throwing exception, that will break his app.
> >
> > Backwards is always tricky. This does not result in compilation error,
> but a
> > runtime change. We might be able to get away with it if we think users
> run
> > some tests before they deploy a new Lucene .jar ... but otherwise, we
> should
> > create a new method, w/ clear semantics? Something like:
> >
> > /**
> >  * @deprecated the method will become abstract when #fileLength(name) has
> > been removed.
> >  */
> > public long getFileLength(String name) throws IOException {
> >   long len = fileLength(name);
> >   if (len == 0 && !fileExists(name)) {
> > throw new FileNotFoundException(name);
> >   }
> >   return len;
> > }
> >
> > The first line just calls the current impl. If it throws exception for a
> > non-existing file, we're ok. The second line verifies whether a 0 length
> is
> > for an existing file or not and throws an exception appropriately.
> >
> > That is of course only if everybody else agree w/ these semantics.
> >
> > Shai
> >
> > On Sat, Mar 13, 2010 at 1:21 PM, Marcelo Ochoa 
> > wrote:
> >>
> >> Uwe:
> >> > That is not true, the API says:
> >> > "Creates a new File *instance* from a parent pathname string and a
> child
> >> > pathname string."
> >> >
> >> > Please note "instance", so it will never create the file on disk. New
> >> > File() just creates a file instance but no file on disk. You can check
> this
> >> > with a simple test.
> >>  OK but what the about the exception?
> >>  If the creation of the File instance do not throw an exception and
> >> the method File.length() returns 0 if a file does not exists
> >> RAMDirectory and other classes which also override this method should
> >> be modified to return 0.
> >>  Best regards, Marcelo.
> >>
> >> --
> >> Marcelo F. Ochoa
> >> http://marceloochoa.blogspot.com/
> >> http://mochoa.sites.exa.unicen.edu.ar/
> >> __
> >> Want to integrate Lucene and Oracle?
> >>
> >>
> http://marceloochoa.blogspot.com/2007/09/running-lucene-inside-your-oracle-jvm.html
> >> Is Oracle 11g REST ready?
> >> http://marceloochoa.blogspot.com/2008/02/is-oracle-11g-rest-ready.html
> >>
> >> -
> >> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: java-dev-h...@lucene.apache.org
> >>
> >
> >
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>


[jira] Created: (LUCENE-2316) Define clear semantics for Directory.fileLength

2010-03-13 Thread Shai Erera (JIRA)
Define clear semantics for Directory.fileLength
---

 Key: LUCENE-2316
 URL: https://issues.apache.org/jira/browse/LUCENE-2316
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Priority: Minor
 Fix For: 3.1


On this thread: 
http://mail-archives.apache.org/mod_mbox/lucene-java-dev/201003.mbox/%3c126142c1003121525v24499625u1589bbef4c079...@mail.gmail.com%3e
 it was mentioned that Directory's fileLength behavior is not consistent 
between Directory implementations if the given file name does not exist. 
FSDirectory returns a 0 length while RAMDirectory throws FNFE.

The problem is that the semantics of fileLength() are not defined. As proposed 
in the thread, we'll define the following semantics:

* Returns the length of the file denoted by name if the file 
exists. The return value may be anything between 0 and Long.MAX_VALUE.
* Throws FileNotFoundException if the file does not exist. Note that you can 
call dir.fileExists(name) if you are not sure whether the file exists or not.

For backwards we'll create a new method w/ clear semantics. Something like:

{code}
/**
 * @deprecated the method will become abstract when #fileLength(name) has been 
removed.
 */
public long getFileLength(String name) throws IOException {
  long len = fileLength(name);
  if (len == 0 && !fileExists(name)) {
throw new FileNotFoundException(name);
  }
  return len;
}
{code}

The first line just calls the current impl. If it throws exception for a 
non-existing file, we're ok. The second line verifies whether a 0 length is for 
an existing file or not and throws an exception appropriately.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2316) Define clear semantics for Directory.fileLength

2010-03-13 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844882#action_12844882
 ] 

Shai Erera commented on LUCENE-2316:


I am not sure we should mark getFileLength deprecated though, in order to alert 
users that it will become abstract. Can we instead just note that in its 
Javadocs? It will be awkward if we deprecate both fileLength and getFileLength 
:).

> Define clear semantics for Directory.fileLength
> ---
>
> Key: LUCENE-2316
> URL: https://issues.apache.org/jira/browse/LUCENE-2316
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Shai Erera
>Priority: Minor
> Fix For: 3.1
>
>
> On this thread: 
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/201003.mbox/%3c126142c1003121525v24499625u1589bbef4c079...@mail.gmail.com%3e
>  it was mentioned that Directory's fileLength behavior is not consistent 
> between Directory implementations if the given file name does not exist. 
> FSDirectory returns a 0 length while RAMDirectory throws FNFE.
> The problem is that the semantics of fileLength() are not defined. As 
> proposed in the thread, we'll define the following semantics:
> * Returns the length of the file denoted by name if the file 
> exists. The return value may be anything between 0 and Long.MAX_VALUE.
> * Throws FileNotFoundException if the file does not exist. Note that you can 
> call dir.fileExists(name) if you are not sure whether the file exists or not.
> For backwards we'll create a new method w/ clear semantics. Something like:
> {code}
> /**
>  * @deprecated the method will become abstract when #fileLength(name) has 
> been removed.
>  */
> public long getFileLength(String name) throws IOException {
>   long len = fileLength(name);
>   if (len == 0 && !fileExists(name)) {
> throw new FileNotFoundException(name);
>   }
>   return len;
> }
> {code}
> The first line just calls the current impl. If it throws exception for a 
> non-existing file, we're ok. The second line verifies whether a 0 length is 
> for an existing file or not and throws an exception appropriately.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2315) AttributeSource's methods for accessing attributes should be final, else its easy to corrupt the internal states

2010-03-13 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844884#action_12844884
 ] 

Shai Erera commented on LUCENE-2315:


Ok I see. I think that instead of creating another class to introduce new users 
to, we can stick w/ AS and make all the methods that no one shouldn't have any 
reason to ever extend final. We can keep the methods that define the 'behavior' 
not final, though I don't see any at the moment. Maybe 
getAttributeImplsIterator.

But if will make sense to factor out just these methods to a separate class, so 
that custom AS don't need to be a sub-class of AS for just that purpose, then I 
think it'll also be ok.

> AttributeSource's methods for accessing attributes should be final, else its 
> easy to corrupt the internal states
> 
>
> Key: LUCENE-2315
> URL: https://issues.apache.org/jira/browse/LUCENE-2315
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1
>Reporter: Uwe Schindler
>Priority: Minor
> Fix For: 3.1
>
>
> The methods that operate and modify the internal maps of AttributeSource 
> should be final, which is a backwards break. But anybody that overrides such 
> methods simply creates a buggy AS either case.
> I want to makeall impls final (in general the class should be final at all, 
> but it is made for extension in TokenStream). So its important that the 
> implementations are final!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Different behavior of Directory.fieldLength()

2010-03-13 Thread Michael McCandless
Thanks!

Mike

On Sat, Mar 13, 2010 at 9:10 AM, Shai Erera  wrote:
> Ok, opened LUCENE-2316 to track this.
>
> Shai
>
> On Sat, Mar 13, 2010 at 3:49 PM, Michael McCandless
>  wrote:
>>
>> I like the proposed new semantics (throw FNFE if the file does not
>> exist), and the migration path (new method, deprecate old).
>>
>> Mike
>>
>> On Sat, Mar 13, 2010 at 7:46 AM, Shai Erera  wrote:
>> > I think it falls under the semantics of dir.fileLength() and not the
>> > semantics of the implementation right? Unfortunately, the semantics of
>> > Directory.fileLength() are not specified, which made it easy for
>> > extensions
>> > to invent their own.
>> >
>> > I myself am not sure what's better - return 0 as the length for a file
>> > that
>> > does not exist, or throw FNFE to alert the caller that he's querying a
>> > file
>> > that does not exist. FSDirectory got away with it by using File API
>> > which
>> > just happens to return 0 for a non-existing file. RAMDirectory chose to
>> > alert the caller. My feeling is that these two were written by two
>> > different
>> > persons, or separate times, each understanding the method differently.
>> >
>> > I think we should make the semantics clear, and declare a better
>> > contract,
>> > by documentation and possible also by method signature. If for example
>> > we
>> > decide that it should return 0 for non-existing files, then I think we
>> > can
>> > remove the IOException from the method sig? But maybe we want to allow
>> > IOException to be thrown by Directories that could actually fail on
>> > probing
>> > the file length.
>> >
>> > I would propose to declare the semantics of fileLength like this:
>> > * Returns the length of the file denoted by name if the
>> > file
>> > exists. The return value may be anywhere between 0 and Long.MAX_VALUE.
>> > * Throws FileNotFoundException if the file does not exist. Note that you
>> > can
>> > call dir.fileExists(name) if you are not sure whether the file exists or
>> > not.
>> >
>> > That way it's clear. We can then change IW code to call fileExists if it
>> > expects to fail on either of the two.
>> >
>> > Question is - how do we do this w/o breaking Directory implementations
>> > out
>> > there? I think that we might be safe with it, if we make sure all of IW
>> > code
>> > queries fileExists before. However if someone relies on FSDir to return
>> > 0
>> > instead of throwing exception, that will break his app.
>> >
>> > Backwards is always tricky. This does not result in compilation error,
>> > but a
>> > runtime change. We might be able to get away with it if we think users
>> > run
>> > some tests before they deploy a new Lucene .jar ... but otherwise, we
>> > should
>> > create a new method, w/ clear semantics? Something like:
>> >
>> > /**
>> >  * @deprecated the method will become abstract when #fileLength(name)
>> > has
>> > been removed.
>> >  */
>> > public long getFileLength(String name) throws IOException {
>> >   long len = fileLength(name);
>> >   if (len == 0 && !fileExists(name)) {
>> >     throw new FileNotFoundException(name);
>> >   }
>> >   return len;
>> > }
>> >
>> > The first line just calls the current impl. If it throws exception for a
>> > non-existing file, we're ok. The second line verifies whether a 0 length
>> > is
>> > for an existing file or not and throws an exception appropriately.
>> >
>> > That is of course only if everybody else agree w/ these semantics.
>> >
>> > Shai
>> >
>> > On Sat, Mar 13, 2010 at 1:21 PM, Marcelo Ochoa 
>> > wrote:
>> >>
>> >> Uwe:
>> >> > That is not true, the API says:
>> >> > "Creates a new File *instance* from a parent pathname string and a
>> >> > child
>> >> > pathname string."
>> >> >
>> >> > Please note "instance", so it will never create the file on disk. New
>> >> > File() just creates a file instance but no file on disk. You can
>> >> > check this
>> >> > with a simple test.
>> >>  OK but what the about the exception?
>> >>  If the creation of the File instance do not throw an exception and
>> >> the method File.length() returns 0 if a file does not exists
>> >> RAMDirectory and other classes which also override this method should
>> >> be modified to return 0.
>> >>  Best regards, Marcelo.
>> >>
>> >> --
>> >> Marcelo F. Ochoa
>> >> http://marceloochoa.blogspot.com/
>> >> http://mochoa.sites.exa.unicen.edu.ar/
>> >> __
>> >> Want to integrate Lucene and Oracle?
>> >>
>> >>
>> >> http://marceloochoa.blogspot.com/2007/09/running-lucene-inside-your-oracle-jvm.html
>> >> Is Oracle 11g REST ready?
>> >> http://marceloochoa.blogspot.com/2008/02/is-oracle-11g-rest-ready.html
>> >>
>> >> -
>> >> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>> >> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>> >>
>> >
>> >
>>
>> -
>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org

[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer

2010-03-13 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844891#action_12844891
 ] 

Jason Rutherglen commented on LUCENE-2312:
--

>From LUCENE-2293: {quote}(b-tree, or, simply sort-on-demand the
first time a query needs it, though that cost increases the
larger your RAM segments get, ie, not incremental to the # docs
you just added){quote}

For the terms dictionary, perhaps a terms array (this could be a
RawPostingList[], or an array of objects with pointers to a
RawPostingList with some helper methods like getTerm and
compareTo), is kept in sorted order, we then binary search and
insert new RawPostingLists/terms into the array. We *could*
implement a 2 dimensional array, allowing us to make a per
reader copy of the 1st dimension of array. This would maintain
transactional consistency (ie, a reader's array isn't changing
as a term enum is traversing in another thread). 

{quote}Also, we have to solve what happens to a reader using a
RAM segment that's been flushed. Perhaps we don't reuse RAM at
that point, ie, rely on GC to reclaim once all readers using
that RAM segment have closed.{quote}

I don't think we have a choice here? 

> Search on IndexWriter's RAM Buffer
> --
>
> Key: LUCENE-2312
> URL: https://issues.apache.org/jira/browse/LUCENE-2312
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Affects Versions: 3.0.1
>Reporter: Jason Rutherglen
> Fix For: 3.0.2
>
>
> In order to offer user's near realtime search, without incurring
> an indexing performance penalty, we can implement search on
> IndexWriter's RAM buffer. This is the buffer that is filled in
> RAM as documents are indexed. Currently the RAM buffer is
> flushed to the underlying directory (usually disk) before being
> made searchable. 
> Todays Lucene based NRT systems must incur the cost of merging
> segments, which can slow indexing. 
> Michael Busch has good suggestions regarding how to handle deletes using max 
> doc ids.  
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923
> The area that isn't fully fleshed out is the terms dictionary,
> which needs to be sorted prior to queries executing. Currently
> IW implements a specialized hash table. Michael B has a
> suggestion here: 
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2317) allow separate control of whether docTermFreq and positions are indexed

2010-03-13 Thread Michael McCandless (JIRA)
allow separate control of whether docTermFreq and positions are indexed
---

 Key: LUCENE-2317
 URL: https://issues.apache.org/jira/browse/LUCENE-2317
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
 Fix For: 3.1


[Spinoff of LUCENE-2308... we keep spinning things off... I feel like we live 
inside a particle accelerator]

Right now Lucene indexes the docTermFreq and positions into the postings, by 
default.

You can use omitTFAP to turn them both off, which if you also omit norms gives 
you "match only" scoring.

But, really, they ought to be separately controllable -- one may want to 
include docTermFreq but not positions, to get full scoring for non-positional 
phrases.

Probably we should wait until LUCENE-2308 is done, and make the API change on 
*FieldType.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2293) IndexWriter has hard limit on max concurrency

2010-03-13 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844893#action_12844893
 ] 

Jason Rutherglen commented on LUCENE-2293:
--

{quote}Change the approach for how we buffer in RAM to a more
isolated approach{quote}

Would we reuse the DocumentsWriter class, and assign one to each
thread? Then start to rework DW on down in the code tree,
removing the per thread logic? Or do we need to do something
more dramatic?

> IndexWriter has hard limit on max concurrency
> -
>
> Key: LUCENE-2293
> URL: https://issues.apache.org/jira/browse/LUCENE-2293
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 3.1
>
>
> DocumentsWriter has this nasty hardwired constant:
> {code}
> private final static int MAX_THREAD_STATE = 5;
> {code}
> which probably I should have attached a //nocommit to the moment I
> wrote it ;)
> That constant sets the max number of thread states to 5.  This means,
> if more than 5 threads enter IndexWriter at once, they will "share"
> only 5 thread states, meaning we gate CPU concurrency to 5 running
> threads inside IW (each thread must first wait for the last thread to
> finish using the thread state before grabbing it).
> This is bad because modern hardware can make use of more than 5
> threads.  So I think an immediate fix is to make this settable
> (expert), and increase the default (8?).
> It's tricky, though, because the more thread states, the less RAM
> efficiency you have, meaning the worse indexing throughput.  So you
> shouldn't up and set this to 50: you'll be flushing too often.
> But... I think a better fix is to re-think how threads write state
> into DocumentsWriter.  Today, a single docID stream is assigned across
> threads (eg one thread gets docID=0, next one docID=1, etc.), and each
> thread writes to a private RAM buffer (living in the thread state),
> and then on flush we do a merge sort.  The merge sort is inefficient
> (does not currently use a PQ)... and, wasteful because we must
> re-decode every posting byte.
> I think we could change this, so that threads write to private RAM
> buffers, with a private docID stream, but then instead of merging on
> flush, we directly flush each thread as its own segment (and, allocate
> private docIDs to each thread).  We can then leave merging to CMS
> which can already run merges in the BG without blocking ongoing
> indexing (unlike the merge we do in flush, today).
> This would also allow us to separately flush thread states.  Ie, we
> need not flush all thread states at once -- we can flush one when it
> gets too big, and then let the others keep running.  This should be a
> good concurrency gain since is uses IO & CPU resources "throughout"
> indexing instead of "big burst of CPU only" then "big burst of IO
> only" that we have today (flush today "stops the world").
> One downside I can think of is... docIDs would now be "less
> monotonic", meaning if N threads are indexing, you'll roughly get
> in-time-order assignment of docIDs.  But with this change, all of one
> thread state would get 0..N docIDs, the next thread state'd get
> N+1...M docIDs, etc.  However, a single thread would still get
> monotonic assignment of docIDs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2294) Create IndexWriterConfiguration and store all of IW configuration there

2010-03-13 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844895#action_12844895
 ] 

Robert Muir commented on LUCENE-2294:
-

bq. I can't wait for this to be in ... an exhausting issue

Shai, thanks for taking the time to redo this massive patch. I'm sorry 
again I dropped the ball and didn't notice till the commit, forcing you 
to redo a lot of work.

+1

> Create IndexWriterConfiguration and store all of IW configuration there
> ---
>
> Key: LUCENE-2294
> URL: https://issues.apache.org/jira/browse/LUCENE-2294
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Shai Erera
>Assignee: Michael McCandless
> Fix For: 3.1
>
> Attachments: check.py, LUCENE-2294.patch, LUCENE-2294.patch, 
> LUCENE-2294.patch, LUCENE-2294.patch, LUCENE-2294.patch, LUCENE-2294.patch
>
>
> I would like to factor out of all IW configuration parameters into a single 
> configuration class, which I propose to name IndexWriterConfiguration (or 
> IndexWriterConfig). I want to store there almost everything besides the 
> Directory, and to reduce all the ctors down to one: IndexWriter(Directory, 
> IndexWriterConfiguration). What I was thinking of storing there are the 
> following parameters:
> * All of ctors parameters, except for Directory.
> * The different setters where it makes sense. For example I still think 
> infoStream should be set on IW directly.
> I'm thinking that IWC should expose everything in a setter/getter methods, 
> and defaults to whatever IW defaults today. Except for Analyzer which will 
> need to be defined in the ctor of IWC and won't have a setter.
> I am not sure why MaxFieldLength is required in all IW ctors, yet IW declares 
> a DEFAULT (which is an int and not MaxFieldLength). Do we still think that 
> 1 should be the default? Why not default to UNLIMITED and otherwise let 
> the application decide what LIMITED means for it? I would like to make MFL 
> optional on IWC and default to something, and I hope that default will be 
> UNLIMITED. We can document that on IWC, so that if anyone chooses to move to 
> the new API, he should be aware of that ...
> I plan to deprecate all the ctors and getters/setters and replace them by:
> * One ctor as described above
> * getIndexWriterConfiguration, or simply getConfig, which can then be queried 
> for the setting of interest.
> * About the setters, I think maybe we can just introduce a setConfig method 
> which will override everything that is overridable today, except for 
> Analyzer. So someone could do iw.getConfig().setSomething(); 
> iw.setConfig(newConfig);
> ** The setters on IWC can return an IWC to allow chaining set calls ... so 
> the above will turn into 
> iw.setConfig(iw.getConfig().setSomething1().setSomething2()); 
> BTW, this is needed for Parallel Indexing (see LUCENE-1879), but I think it 
> will greatly simplify IW's API.
> I'll start to work on a patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2317) allow separate control of whether docTermFreq and positions are indexed

2010-03-13 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844897#action_12844897
 ] 

Shai Erera commented on LUCENE-2317:


This will turn into another setting. Can we introduce on the fly a setMatchOnly 
method which will turn all off (in addition to the one you're proposing)? Maybe 
it should become its own FieldType constant ... it will be bundling together a 
bunch of options that can be set individually if one wants finer grained 
control.

> allow separate control of whether docTermFreq and positions are indexed
> ---
>
> Key: LUCENE-2317
> URL: https://issues.apache.org/jira/browse/LUCENE-2317
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael McCandless
> Fix For: 3.1
>
>
> [Spinoff of LUCENE-2308... we keep spinning things off... I feel like we live 
> inside a particle accelerator]
> Right now Lucene indexes the docTermFreq and positions into the postings, by 
> default.
> You can use omitTFAP to turn them both off, which if you also omit norms 
> gives you "match only" scoring.
> But, really, they ought to be separately controllable -- one may want to 
> include docTermFreq but not positions, to get full scoring for non-positional 
> phrases.
> Probably we should wait until LUCENE-2308 is done, and make the API change on 
> *FieldType.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer

2010-03-13 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844898#action_12844898
 ] 

Michael McCandless commented on LUCENE-2312:


{quote}
For the terms dictionary, perhaps a terms array (this could be a
RawPostingList[], or an array of objects with pointers to a
RawPostingList with some helper methods like getTerm and
compareTo), is kept in sorted order, we then binary search and
insert new RawPostingLists/terms into the array. We could
implement a 2 dimensional array, allowing us to make a per
reader copy of the 1st dimension of array. This would maintain
transactional consistency (ie, a reader's array isn't changing
as a term enum is traversing in another thread).
{quote}

I don't think we can do term insertion into an array -- that's O(N^2)
insertion cost -- we should use a btree instead.

Also, we could store the first docID stored into the term, too -- this
way we could have a ordered collection of terms, that's shared across
several open readers even as changes are still being made, but each
reader skips a given term if its first docID is greater than the
maxDoc it's searching.  That'd give us point in time searching even
while we add terms with time...

{quote}
bq. Also, we have to solve what happens to a reader using a RAM segment that's 
been flushed. Perhaps we don't reuse RAM at that point, ie, rely on GC to 
reclaim once all readers using that RAM segment have closed.

I don't think we have a choice here?
{quote}

I think we do have a choice.

EG we could force the reader to cutover to the newly flushed segment
(which should be identical to the RAM segment), eg by making [say] a
DelegatingSegmentReader.

Still... we'd probably have to not re-use in that case, since there
can be queries in-flight stepping through the RAM postings, and, we
have no way to accurately detect they are done.  But at least with
this approach we wouldn't tie up RAM indefinitely...

Or maybe we simply state that the APP must aggressively close NRT
readers with time else memory use grows and grows... but I don't
really like that.  We don't have such a restriction today...


> Search on IndexWriter's RAM Buffer
> --
>
> Key: LUCENE-2312
> URL: https://issues.apache.org/jira/browse/LUCENE-2312
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Affects Versions: 3.0.1
>Reporter: Jason Rutherglen
> Fix For: 3.0.2
>
>
> In order to offer user's near realtime search, without incurring
> an indexing performance penalty, we can implement search on
> IndexWriter's RAM buffer. This is the buffer that is filled in
> RAM as documents are indexed. Currently the RAM buffer is
> flushed to the underlying directory (usually disk) before being
> made searchable. 
> Todays Lucene based NRT systems must incur the cost of merging
> segments, which can slow indexing. 
> Michael Busch has good suggestions regarding how to handle deletes using max 
> doc ids.  
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923
> The area that isn't fully fleshed out is the terms dictionary,
> which needs to be sorted prior to queries executing. Currently
> IW implements a specialized hash table. Michael B has a
> suggestion here: 
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2294) Create IndexWriterConfiguration and store all of IW configuration there

2010-03-13 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844899#action_12844899
 ] 

Shai Erera commented on LUCENE-2294:


Thanks a lot Robert for reviewing this. No harm done ... I've had the chance to 
exercise some of Eclipse tricks in the process or re-doing. Unfortunately it 
introduced some changes, but luckily we have Mike and his 
mighty-python-scripting-ability to protect us :).

Now I have a bunch of other issues I need to open, that were waiting for this 
guy to go in. Stay tuned

> Create IndexWriterConfiguration and store all of IW configuration there
> ---
>
> Key: LUCENE-2294
> URL: https://issues.apache.org/jira/browse/LUCENE-2294
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Shai Erera
>Assignee: Michael McCandless
> Fix For: 3.1
>
> Attachments: check.py, LUCENE-2294.patch, LUCENE-2294.patch, 
> LUCENE-2294.patch, LUCENE-2294.patch, LUCENE-2294.patch, LUCENE-2294.patch
>
>
> I would like to factor out of all IW configuration parameters into a single 
> configuration class, which I propose to name IndexWriterConfiguration (or 
> IndexWriterConfig). I want to store there almost everything besides the 
> Directory, and to reduce all the ctors down to one: IndexWriter(Directory, 
> IndexWriterConfiguration). What I was thinking of storing there are the 
> following parameters:
> * All of ctors parameters, except for Directory.
> * The different setters where it makes sense. For example I still think 
> infoStream should be set on IW directly.
> I'm thinking that IWC should expose everything in a setter/getter methods, 
> and defaults to whatever IW defaults today. Except for Analyzer which will 
> need to be defined in the ctor of IWC and won't have a setter.
> I am not sure why MaxFieldLength is required in all IW ctors, yet IW declares 
> a DEFAULT (which is an int and not MaxFieldLength). Do we still think that 
> 1 should be the default? Why not default to UNLIMITED and otherwise let 
> the application decide what LIMITED means for it? I would like to make MFL 
> optional on IWC and default to something, and I hope that default will be 
> UNLIMITED. We can document that on IWC, so that if anyone chooses to move to 
> the new API, he should be aware of that ...
> I plan to deprecate all the ctors and getters/setters and replace them by:
> * One ctor as described above
> * getIndexWriterConfiguration, or simply getConfig, which can then be queried 
> for the setting of interest.
> * About the setters, I think maybe we can just introduce a setConfig method 
> which will override everything that is overridable today, except for 
> Analyzer. So someone could do iw.getConfig().setSomething(); 
> iw.setConfig(newConfig);
> ** The setters on IWC can return an IWC to allow chaining set calls ... so 
> the above will turn into 
> iw.setConfig(iw.getConfig().setSomething1().setSomething2()); 
> BTW, this is needed for Parallel Indexing (see LUCENE-1879), but I think it 
> will greatly simplify IW's API.
> I'll start to work on a patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2293) IndexWriter has hard limit on max concurrency

2010-03-13 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844900#action_12844900
 ] 

Michael McCandless commented on LUCENE-2293:


Probably one DW instance per thread?  Seems like that'd work?

And possibly remove *PerThread throughout the default indexing chain?

> IndexWriter has hard limit on max concurrency
> -
>
> Key: LUCENE-2293
> URL: https://issues.apache.org/jira/browse/LUCENE-2293
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 3.1
>
>
> DocumentsWriter has this nasty hardwired constant:
> {code}
> private final static int MAX_THREAD_STATE = 5;
> {code}
> which probably I should have attached a //nocommit to the moment I
> wrote it ;)
> That constant sets the max number of thread states to 5.  This means,
> if more than 5 threads enter IndexWriter at once, they will "share"
> only 5 thread states, meaning we gate CPU concurrency to 5 running
> threads inside IW (each thread must first wait for the last thread to
> finish using the thread state before grabbing it).
> This is bad because modern hardware can make use of more than 5
> threads.  So I think an immediate fix is to make this settable
> (expert), and increase the default (8?).
> It's tricky, though, because the more thread states, the less RAM
> efficiency you have, meaning the worse indexing throughput.  So you
> shouldn't up and set this to 50: you'll be flushing too often.
> But... I think a better fix is to re-think how threads write state
> into DocumentsWriter.  Today, a single docID stream is assigned across
> threads (eg one thread gets docID=0, next one docID=1, etc.), and each
> thread writes to a private RAM buffer (living in the thread state),
> and then on flush we do a merge sort.  The merge sort is inefficient
> (does not currently use a PQ)... and, wasteful because we must
> re-decode every posting byte.
> I think we could change this, so that threads write to private RAM
> buffers, with a private docID stream, but then instead of merging on
> flush, we directly flush each thread as its own segment (and, allocate
> private docIDs to each thread).  We can then leave merging to CMS
> which can already run merges in the BG without blocking ongoing
> indexing (unlike the merge we do in flush, today).
> This would also allow us to separately flush thread states.  Ie, we
> need not flush all thread states at once -- we can flush one when it
> gets too big, and then let the others keep running.  This should be a
> good concurrency gain since is uses IO & CPU resources "throughout"
> indexing instead of "big burst of CPU only" then "big burst of IO
> only" that we have today (flush today "stops the world").
> One downside I can think of is... docIDs would now be "less
> monotonic", meaning if N threads are indexing, you'll roughly get
> in-time-order assignment of docIDs.  But with this change, all of one
> thread state would get 0..N docIDs, the next thread state'd get
> N+1...M docIDs, etc.  However, a single thread would still get
> monotonic assignment of docIDs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2048) Omit positions but keep termFreq

2010-03-13 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-2048:
---

Fix Version/s: 3.1

> Omit positions but keep termFreq
> 
>
> Key: LUCENE-2048
> URL: https://issues.apache.org/jira/browse/LUCENE-2048
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 3.1
>Reporter: Andrzej Bialecki 
> Fix For: 3.1
>
>
> it would be useful to have an option to discard positional information but 
> still keep the term frequency - currently setOmitTermFreqAndPositions 
> discards both. Even though position-dependent queries wouldn't work in such 
> case, still any other queries would work fine and we would get the right 
> scoring.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-2317) allow separate control of whether docTermFreq and positions are indexed

2010-03-13 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-2317.


Resolution: Duplicate

Duh, this is a dup of LUCENE-2048.  I can't keep track of all the particles 
anymore!

> allow separate control of whether docTermFreq and positions are indexed
> ---
>
> Key: LUCENE-2317
> URL: https://issues.apache.org/jira/browse/LUCENE-2317
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael McCandless
> Fix For: 3.1
>
>
> [Spinoff of LUCENE-2308... we keep spinning things off... I feel like we live 
> inside a particle accelerator]
> Right now Lucene indexes the docTermFreq and positions into the postings, by 
> default.
> You can use omitTFAP to turn them both off, which if you also omit norms 
> gives you "match only" scoring.
> But, really, they ought to be separately controllable -- one may want to 
> include docTermFreq but not positions, to get full scoring for non-positional 
> phrases.
> Probably we should wait until LUCENE-2308 is done, and make the API change on 
> *FieldType.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-2294) Create IndexWriterConfiguration and store all of IW configuration there

2010-03-13 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-2294.


Resolution: Fixed

Take 2!  Thanks Shai.

> Create IndexWriterConfiguration and store all of IW configuration there
> ---
>
> Key: LUCENE-2294
> URL: https://issues.apache.org/jira/browse/LUCENE-2294
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Shai Erera
>Assignee: Michael McCandless
> Fix For: 3.1
>
> Attachments: check.py, LUCENE-2294.patch, LUCENE-2294.patch, 
> LUCENE-2294.patch, LUCENE-2294.patch, LUCENE-2294.patch, LUCENE-2294.patch
>
>
> I would like to factor out of all IW configuration parameters into a single 
> configuration class, which I propose to name IndexWriterConfiguration (or 
> IndexWriterConfig). I want to store there almost everything besides the 
> Directory, and to reduce all the ctors down to one: IndexWriter(Directory, 
> IndexWriterConfiguration). What I was thinking of storing there are the 
> following parameters:
> * All of ctors parameters, except for Directory.
> * The different setters where it makes sense. For example I still think 
> infoStream should be set on IW directly.
> I'm thinking that IWC should expose everything in a setter/getter methods, 
> and defaults to whatever IW defaults today. Except for Analyzer which will 
> need to be defined in the ctor of IWC and won't have a setter.
> I am not sure why MaxFieldLength is required in all IW ctors, yet IW declares 
> a DEFAULT (which is an int and not MaxFieldLength). Do we still think that 
> 1 should be the default? Why not default to UNLIMITED and otherwise let 
> the application decide what LIMITED means for it? I would like to make MFL 
> optional on IWC and default to something, and I hope that default will be 
> UNLIMITED. We can document that on IWC, so that if anyone chooses to move to 
> the new API, he should be aware of that ...
> I plan to deprecate all the ctors and getters/setters and replace them by:
> * One ctor as described above
> * getIndexWriterConfiguration, or simply getConfig, which can then be queried 
> for the setting of interest.
> * About the setters, I think maybe we can just introduce a setConfig method 
> which will override everything that is overridable today, except for 
> Analyzer. So someone could do iw.getConfig().setSomething(); 
> iw.setConfig(newConfig);
> ** The setters on IWC can return an IWC to allow chaining set calls ... so 
> the above will turn into 
> iw.setConfig(iw.getConfig().setSomething1().setSomething2()); 
> BTW, this is needed for Parallel Indexing (see LUCENE-1879), but I think it 
> will greatly simplify IW's API.
> I'll start to work on a patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2316) Define clear semantics for Directory.fileLength

2010-03-13 Thread Marvin Humphrey (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844906#action_12844906
 ] 

Marvin Humphrey commented on LUCENE-2316:
-

Is it really necessary to obtain the length of a file from the Directory? Lucy
doesn't implement that functionality, and we haven't missed it -- we're able
to get away with using the length() method on InStream and OutStream. 

I see that IndexInput and IndexOutput already have length() methods. Can you
simply eliminate all uses of Directory.fileLength() within core and deprecate
it without introducing a new method?

> Define clear semantics for Directory.fileLength
> ---
>
> Key: LUCENE-2316
> URL: https://issues.apache.org/jira/browse/LUCENE-2316
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Shai Erera
>Priority: Minor
> Fix For: 3.1
>
>
> On this thread: 
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/201003.mbox/%3c126142c1003121525v24499625u1589bbef4c079...@mail.gmail.com%3e
>  it was mentioned that Directory's fileLength behavior is not consistent 
> between Directory implementations if the given file name does not exist. 
> FSDirectory returns a 0 length while RAMDirectory throws FNFE.
> The problem is that the semantics of fileLength() are not defined. As 
> proposed in the thread, we'll define the following semantics:
> * Returns the length of the file denoted by name if the file 
> exists. The return value may be anything between 0 and Long.MAX_VALUE.
> * Throws FileNotFoundException if the file does not exist. Note that you can 
> call dir.fileExists(name) if you are not sure whether the file exists or not.
> For backwards we'll create a new method w/ clear semantics. Something like:
> {code}
> /**
>  * @deprecated the method will become abstract when #fileLength(name) has 
> been removed.
>  */
> public long getFileLength(String name) throws IOException {
>   long len = fileLength(name);
>   if (len == 0 && !fileExists(name)) {
> throw new FileNotFoundException(name);
>   }
>   return len;
> }
> {code}
> The first line just calls the current impl. If it throws exception for a 
> non-existing file, we're ok. The second line verifies whether a 0 length is 
> for an existing file or not and throws an exception appropriately.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2293) IndexWriter has hard limit on max concurrency

2010-03-13 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-2293:
---

Attachment: LUCENE-2293.patch

Simple patch, just adds maxThreadStates setting to IndexWriterConfig.

> IndexWriter has hard limit on max concurrency
> -
>
> Key: LUCENE-2293
> URL: https://issues.apache.org/jira/browse/LUCENE-2293
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 3.1
>
> Attachments: LUCENE-2293.patch
>
>
> DocumentsWriter has this nasty hardwired constant:
> {code}
> private final static int MAX_THREAD_STATE = 5;
> {code}
> which probably I should have attached a //nocommit to the moment I
> wrote it ;)
> That constant sets the max number of thread states to 5.  This means,
> if more than 5 threads enter IndexWriter at once, they will "share"
> only 5 thread states, meaning we gate CPU concurrency to 5 running
> threads inside IW (each thread must first wait for the last thread to
> finish using the thread state before grabbing it).
> This is bad because modern hardware can make use of more than 5
> threads.  So I think an immediate fix is to make this settable
> (expert), and increase the default (8?).
> It's tricky, though, because the more thread states, the less RAM
> efficiency you have, meaning the worse indexing throughput.  So you
> shouldn't up and set this to 50: you'll be flushing too often.
> But... I think a better fix is to re-think how threads write state
> into DocumentsWriter.  Today, a single docID stream is assigned across
> threads (eg one thread gets docID=0, next one docID=1, etc.), and each
> thread writes to a private RAM buffer (living in the thread state),
> and then on flush we do a merge sort.  The merge sort is inefficient
> (does not currently use a PQ)... and, wasteful because we must
> re-decode every posting byte.
> I think we could change this, so that threads write to private RAM
> buffers, with a private docID stream, but then instead of merging on
> flush, we directly flush each thread as its own segment (and, allocate
> private docIDs to each thread).  We can then leave merging to CMS
> which can already run merges in the BG without blocking ongoing
> indexing (unlike the merge we do in flush, today).
> This would also allow us to separately flush thread states.  Ie, we
> need not flush all thread states at once -- we can flush one when it
> gets too big, and then let the others keep running.  This should be a
> good concurrency gain since is uses IO & CPU resources "throughout"
> indexing instead of "big burst of CPU only" then "big burst of IO
> only" that we have today (flush today "stops the world").
> One downside I can think of is... docIDs would now be "less
> monotonic", meaning if N threads are indexing, you'll roughly get
> in-time-order assignment of docIDs.  But with this change, all of one
> thread state would get 0..N docIDs, the next thread state'd get
> N+1...M docIDs, etc.  However, a single thread would still get
> monotonic assignment of docIDs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2314) Add AttributeSource.copyTo(AttributeSource)

2010-03-13 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-2314:
--

Attachment: LUCENE-2314.patch

New patch with some improvements in cloneAttributes() and the requested class 
names in the IAEs.

> Add AttributeSource.copyTo(AttributeSource)
> ---
>
> Key: LUCENE-2314
> URL: https://issues.apache.org/jira/browse/LUCENE-2314
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2314.patch, LUCENE-2314.patch
>
>
> One problem with AttributeSource at the moment is the missing "insight" into 
> AttributeSource.State. If you want to create TokenStreams that inspect 
> cpatured states, you have no chance. Making the contents of State public is a 
> bad idea, as it does not help for inspecting (its a linked list, so you have 
> to iterate).
> AttributeSource currently contains a cloneAttributes() call, which returns a 
> new AttrubuteSource with all current attributes cloned. This is the (more 
> expensive) captureState. The problem is that you cannot copy back the cloned 
> AS (which is the restoreState). To use this behaviour (by the way, 
> ShingleMatrix can use it), one can alternatively use cloneAttributes and 
> copyTo. You can easily change the cloned attributes and store them in lists 
> and copy them back. The only problem is lower performance of these calls (as 
> State is a very optimized class).
> One use case could be:
> {code}
> AttributeSource state = cloneAttributes();
> //  do something ...
> state.getAttribute(TermAttribute.class).setTermBuffer(foobar);
> // ... more work
> state.copyTo(this);
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-1990) Add unsigned packed int impls in oal.util

2010-03-13 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-1990.


Resolution: Fixed

Thanks Toke!

> Add unsigned packed int impls in oal.util
> -
>
> Key: LUCENE-1990
> URL: https://issues.apache.org/jira/browse/LUCENE-1990
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: Flex Branch
>Reporter: Michael McCandless
>Priority: Minor
> Fix For: Flex Branch
>
> Attachments: generated_performance-te20100226.txt, 
> LUCENE-1990-te20100122.patch, LUCENE-1990-te20100210.patch, 
> LUCENE-1990-te20100212.patch, LUCENE-1990-te20100223.patch, 
> LUCENE-1990-te20100226.patch, LUCENE-1990-te20100226b.patch, 
> LUCENE-1990-te20100226c.patch, LUCENE-1990-te20100301.patch, 
> LUCENE-1990.patch, LUCENE-1990.patch, 
> LUCENE-1990_PerformanceMeasurements20100104.zip, perf-mkm-20100227.txt, 
> performance-20100301.txt, performance-te20100226.txt
>
>
> There are various places in Lucene that could take advantage of an
> efficient packed unsigned int/long impl.  EG the terms dict index in
> the standard codec in LUCENE-1458 could subsantially reduce it's RAM
> usage.  FieldCache.StringIndex could as well.  And I think "load into
> RAM" codecs like the one in TestExternalCodecs could use this too.
> I'm picturing something very basic like:
> {code}
> interface PackedUnsignedLongs  {
>   long get(long index);
>   void set(long index, long value);
> }
> {code}
> Plus maybe an iterator for getting and maybe also for setting.  If it
> helps, most of the usages of this inside Lucene will be "write once"
> so eg the set could make that an assumption/requirement.
> And a factory somewhere:
> {code}
>   PackedUnsignedLongs create(int count, long maxValue);
> {code}
> I think we should simply autogen the code (we can start from the
> autogen code in LUCENE-1410), or, if there is an good existing impl
> that has a compatible license that'd be great.
> I don't have time near-term to do this... so if anyone has the itch,
> please jump!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2314) Add AttributeSource.copyTo(AttributeSource)

2010-03-13 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844916#action_12844916
 ] 

Simon Willnauer commented on LUCENE-2314:
-

Small comment on javadoc wording. 

Maybe like that:
{code}
/**
 * Copies the contents of this AttributeSource to the given AttributeSource.
 * The given instance has to provide all {...@link Attribute}s this instance 
contains. 
 * The actual attribute implementations must be identical in both {...@link 
AttributeSource} instances.
 * Ideally both AttributeSource instances should use the same {...@link 
AttributeFactory} 
 */
{code}




> Add AttributeSource.copyTo(AttributeSource)
> ---
>
> Key: LUCENE-2314
> URL: https://issues.apache.org/jira/browse/LUCENE-2314
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2314.patch, LUCENE-2314.patch
>
>
> One problem with AttributeSource at the moment is the missing "insight" into 
> AttributeSource.State. If you want to create TokenStreams that inspect 
> cpatured states, you have no chance. Making the contents of State public is a 
> bad idea, as it does not help for inspecting (its a linked list, so you have 
> to iterate).
> AttributeSource currently contains a cloneAttributes() call, which returns a 
> new AttrubuteSource with all current attributes cloned. This is the (more 
> expensive) captureState. The problem is that you cannot copy back the cloned 
> AS (which is the restoreState). To use this behaviour (by the way, 
> ShingleMatrix can use it), one can alternatively use cloneAttributes and 
> copyTo. You can easily change the cloned attributes and store them in lists 
> and copy them back. The only problem is lower performance of these calls (as 
> State is a very optimized class).
> One use case could be:
> {code}
> AttributeSource state = cloneAttributes();
> //  do something ...
> state.getAttribute(TermAttribute.class).setTermBuffer(foobar);
> // ... more work
> state.copyTo(this);
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2318) Add System.getProperty("tempDir") as final static to LuceneTestCase(J4)

2010-03-13 Thread Uwe Schindler (JIRA)
Add System.getProperty("tempDir") as final static to LuceneTestCase(J4)
---

 Key: LUCENE-2318
 URL: https://issues.apache.org/jira/browse/LUCENE-2318
 Project: Lucene - Java
  Issue Type: Test
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 3.1


Almost every test calls System.getProperty("tempDir") and some of them check 
the return value for null. In other cases the test simply fails from within 
eclipse.

We should add this to LuceneTestCase(J4) as a static final constant. For 
enabling tests run in eclipse, we can add a fallback to ".", if the Sysprop is 
not defined.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2310) Reduce Fieldable, AbstractField and Field complexity

2010-03-13 Thread Chris Male (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Male updated LUCENE-2310:
---

Attachment: LUCENE-2310-Deprecate-AbstractField.patch

Attaching first version of the patch which deprecates AbstractField.

- Moves the properties and getters/setters down into Field.
- Field now only implements Fieldable
- Field now allows its value to be set to null through its construction.  This 
allows subclasses to set the fieldData to their own 
- NumericField now extends Field, overridding the setValue methods as they are 
not supported
- LazyField also now extends Field
- AbstractField is now no longer used anywhere.

> Reduce Fieldable, AbstractField and Field complexity
> 
>
> Key: LUCENE-2310
> URL: https://issues.apache.org/jira/browse/LUCENE-2310
> Project: Lucene - Java
>  Issue Type: Sub-task
>  Components: Index
>Reporter: Chris Male
> Attachments: LUCENE-2310-Deprecate-AbstractField.patch
>
>
> In order to move field type like functionality into its own class, we really 
> need to try to tackle the hierarchy of Fieldable, AbstractField and Field.  
> Currently AbstractField depends on Field, and does not provide much more 
> functionality that storing fields, most of which are being moved over to 
> FieldType.  Therefore it seems ideal to try to deprecate AbstractField (and 
> possible Fieldable), moving much of the functionality into Field and 
> FieldType.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2310) Reduce Fieldable, AbstractField and Field complexity

2010-03-13 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844923#action_12844923
 ] 

Uwe Schindler commented on LUCENE-2310:
---

You should also not be able to set the TokenStream in NF.

IMO, i would keep AbstractField and only remove Fieldable, as interfaces are 
not wanted in Lucene.

-1 for this patch in its current form.

> Reduce Fieldable, AbstractField and Field complexity
> 
>
> Key: LUCENE-2310
> URL: https://issues.apache.org/jira/browse/LUCENE-2310
> Project: Lucene - Java
>  Issue Type: Sub-task
>  Components: Index
>Reporter: Chris Male
> Attachments: LUCENE-2310-Deprecate-AbstractField.patch
>
>
> In order to move field type like functionality into its own class, we really 
> need to try to tackle the hierarchy of Fieldable, AbstractField and Field.  
> Currently AbstractField depends on Field, and does not provide much more 
> functionality that storing fields, most of which are being moved over to 
> FieldType.  Therefore it seems ideal to try to deprecate AbstractField (and 
> possible Fieldable), moving much of the functionality into Field and 
> FieldType.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2314) Add AttributeSource.copyTo(AttributeSource)

2010-03-13 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-2314:
--

Attachment: LUCENE-2314.patch

Updated javadocs. Will commit tomorrow.

> Add AttributeSource.copyTo(AttributeSource)
> ---
>
> Key: LUCENE-2314
> URL: https://issues.apache.org/jira/browse/LUCENE-2314
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2314.patch, LUCENE-2314.patch, LUCENE-2314.patch
>
>
> One problem with AttributeSource at the moment is the missing "insight" into 
> AttributeSource.State. If you want to create TokenStreams that inspect 
> cpatured states, you have no chance. Making the contents of State public is a 
> bad idea, as it does not help for inspecting (its a linked list, so you have 
> to iterate).
> AttributeSource currently contains a cloneAttributes() call, which returns a 
> new AttrubuteSource with all current attributes cloned. This is the (more 
> expensive) captureState. The problem is that you cannot copy back the cloned 
> AS (which is the restoreState). To use this behaviour (by the way, 
> ShingleMatrix can use it), one can alternatively use cloneAttributes and 
> copyTo. You can easily change the cloned attributes and store them in lists 
> and copy them back. The only problem is lower performance of these calls (as 
> State is a very optimized class).
> One use case could be:
> {code}
> AttributeSource state = cloneAttributes();
> //  do something ...
> state.getAttribute(TermAttribute.class).setTermBuffer(foobar);
> // ... more work
> state.copyTo(this);
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2293) IndexWriter has hard limit on max concurrency

2010-03-13 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844926#action_12844926
 ] 

Jason Rutherglen commented on LUCENE-2293:
--

bq. Probably one DW instance per thread? Seems like that'd work? 

Ok

bq. And possibly remove *PerThread throughout the default indexing chain?

I like removing this as there's many loops per thread right now, it's not easy 
to glance at and know what's going on.  

> IndexWriter has hard limit on max concurrency
> -
>
> Key: LUCENE-2293
> URL: https://issues.apache.org/jira/browse/LUCENE-2293
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 3.1
>
> Attachments: LUCENE-2293.patch
>
>
> DocumentsWriter has this nasty hardwired constant:
> {code}
> private final static int MAX_THREAD_STATE = 5;
> {code}
> which probably I should have attached a //nocommit to the moment I
> wrote it ;)
> That constant sets the max number of thread states to 5.  This means,
> if more than 5 threads enter IndexWriter at once, they will "share"
> only 5 thread states, meaning we gate CPU concurrency to 5 running
> threads inside IW (each thread must first wait for the last thread to
> finish using the thread state before grabbing it).
> This is bad because modern hardware can make use of more than 5
> threads.  So I think an immediate fix is to make this settable
> (expert), and increase the default (8?).
> It's tricky, though, because the more thread states, the less RAM
> efficiency you have, meaning the worse indexing throughput.  So you
> shouldn't up and set this to 50: you'll be flushing too often.
> But... I think a better fix is to re-think how threads write state
> into DocumentsWriter.  Today, a single docID stream is assigned across
> threads (eg one thread gets docID=0, next one docID=1, etc.), and each
> thread writes to a private RAM buffer (living in the thread state),
> and then on flush we do a merge sort.  The merge sort is inefficient
> (does not currently use a PQ)... and, wasteful because we must
> re-decode every posting byte.
> I think we could change this, so that threads write to private RAM
> buffers, with a private docID stream, but then instead of merging on
> flush, we directly flush each thread as its own segment (and, allocate
> private docIDs to each thread).  We can then leave merging to CMS
> which can already run merges in the BG without blocking ongoing
> indexing (unlike the merge we do in flush, today).
> This would also allow us to separately flush thread states.  Ie, we
> need not flush all thread states at once -- we can flush one when it
> gets too big, and then let the others keep running.  This should be a
> good concurrency gain since is uses IO & CPU resources "throughout"
> indexing instead of "big burst of CPU only" then "big burst of IO
> only" that we have today (flush today "stops the world").
> One downside I can think of is... docIDs would now be "less
> monotonic", meaning if N threads are indexing, you'll roughly get
> in-time-order assignment of docIDs.  But with this change, all of one
> thread state would get 0..N docIDs, the next thread state'd get
> N+1...M docIDs, etc.  However, a single thread would still get
> monotonic assignment of docIDs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2314) Add AttributeSource.copyTo(AttributeSource)

2010-03-13 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844927#action_12844927
 ] 

Simon Willnauer commented on LUCENE-2314:
-

looks good to me!

> Add AttributeSource.copyTo(AttributeSource)
> ---
>
> Key: LUCENE-2314
> URL: https://issues.apache.org/jira/browse/LUCENE-2314
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2314.patch, LUCENE-2314.patch, LUCENE-2314.patch
>
>
> One problem with AttributeSource at the moment is the missing "insight" into 
> AttributeSource.State. If you want to create TokenStreams that inspect 
> cpatured states, you have no chance. Making the contents of State public is a 
> bad idea, as it does not help for inspecting (its a linked list, so you have 
> to iterate).
> AttributeSource currently contains a cloneAttributes() call, which returns a 
> new AttrubuteSource with all current attributes cloned. This is the (more 
> expensive) captureState. The problem is that you cannot copy back the cloned 
> AS (which is the restoreState). To use this behaviour (by the way, 
> ShingleMatrix can use it), one can alternatively use cloneAttributes and 
> copyTo. You can easily change the cloned attributes and store them in lists 
> and copy them back. The only problem is lower performance of these calls (as 
> State is a very optimized class).
> One use case could be:
> {code}
> AttributeSource state = cloneAttributes();
> //  do something ...
> state.getAttribute(TermAttribute.class).setTermBuffer(foobar);
> // ... more work
> state.copyTo(this);
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2310) Reduce Fieldable, AbstractField and Field complexity

2010-03-13 Thread Chris Male (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844929#action_12844929
 ] 

Chris Male commented on LUCENE-2310:


{quote}
You should also not be able to set the TokenStream in NF.
{quote}

Yes good point.

{quote}
IMO, i would keep AbstractField and only remove Fieldable, as interfaces are 
not wanted in Lucene
{quote}

Actually I would like to remove both actually.  There doesn't seem much reason 
to keep AbstractField, especially since its already dependent on Field.XYZ and 
seems only to only store all the various properties, most of which will be 
moved away to FieldType anyway.

Would a compromise be to also add an UOE to setting the TokenStream in 
NumericField? It does still have the concept of a TokenStream, so it is a 
Field, but a specialisation which handles the TokenStream itself.

> Reduce Fieldable, AbstractField and Field complexity
> 
>
> Key: LUCENE-2310
> URL: https://issues.apache.org/jira/browse/LUCENE-2310
> Project: Lucene - Java
>  Issue Type: Sub-task
>  Components: Index
>Reporter: Chris Male
> Attachments: LUCENE-2310-Deprecate-AbstractField.patch
>
>
> In order to move field type like functionality into its own class, we really 
> need to try to tackle the hierarchy of Fieldable, AbstractField and Field.  
> Currently AbstractField depends on Field, and does not provide much more 
> functionality that storing fields, most of which are being moved over to 
> FieldType.  Therefore it seems ideal to try to deprecate AbstractField (and 
> possible Fieldable), moving much of the functionality into Field and 
> FieldType.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2310) Reduce Fieldable, AbstractField and Field complexity

2010-03-13 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844930#action_12844930
 ] 

Tim Smith commented on LUCENE-2310:
---

Personally, i like keeping Fieldable, (or having AbstractField just with 
abstract methods and no actual implementation)

for feeding documents, i use custom Fieldable implementations to reduce amount 
of setters called, as Fields of different types have different constant settings

> Reduce Fieldable, AbstractField and Field complexity
> 
>
> Key: LUCENE-2310
> URL: https://issues.apache.org/jira/browse/LUCENE-2310
> Project: Lucene - Java
>  Issue Type: Sub-task
>  Components: Index
>Reporter: Chris Male
> Attachments: LUCENE-2310-Deprecate-AbstractField.patch
>
>
> In order to move field type like functionality into its own class, we really 
> need to try to tackle the hierarchy of Fieldable, AbstractField and Field.  
> Currently AbstractField depends on Field, and does not provide much more 
> functionality that storing fields, most of which are being moved over to 
> FieldType.  Therefore it seems ideal to try to deprecate AbstractField (and 
> possible Fieldable), moving much of the functionality into Field and 
> FieldType.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2310) Reduce Fieldable, AbstractField and Field complexity

2010-03-13 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844931#action_12844931
 ] 

Earwin Burrfoot commented on LUCENE-2310:
-

These settings will go to FieldType?

> Reduce Fieldable, AbstractField and Field complexity
> 
>
> Key: LUCENE-2310
> URL: https://issues.apache.org/jira/browse/LUCENE-2310
> Project: Lucene - Java
>  Issue Type: Sub-task
>  Components: Index
>Reporter: Chris Male
> Attachments: LUCENE-2310-Deprecate-AbstractField.patch
>
>
> In order to move field type like functionality into its own class, we really 
> need to try to tackle the hierarchy of Fieldable, AbstractField and Field.  
> Currently AbstractField depends on Field, and does not provide much more 
> functionality that storing fields, most of which are being moved over to 
> FieldType.  Therefore it seems ideal to try to deprecate AbstractField (and 
> possible Fieldable), moving much of the functionality into Field and 
> FieldType.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2310) Reduce Fieldable, AbstractField and Field complexity

2010-03-13 Thread Chris Male (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844933#action_12844933
 ] 

Chris Male commented on LUCENE-2310:


I should note, to prevent confusion, that my patch is just the beginning of 
this work, designed to illustrate the direction I'm heading.

> Reduce Fieldable, AbstractField and Field complexity
> 
>
> Key: LUCENE-2310
> URL: https://issues.apache.org/jira/browse/LUCENE-2310
> Project: Lucene - Java
>  Issue Type: Sub-task
>  Components: Index
>Reporter: Chris Male
> Attachments: LUCENE-2310-Deprecate-AbstractField.patch
>
>
> In order to move field type like functionality into its own class, we really 
> need to try to tackle the hierarchy of Fieldable, AbstractField and Field.  
> Currently AbstractField depends on Field, and does not provide much more 
> functionality that storing fields, most of which are being moved over to 
> FieldType.  Therefore it seems ideal to try to deprecate AbstractField (and 
> possible Fieldable), moving much of the functionality into Field and 
> FieldType.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2310) Reduce Fieldable, AbstractField and Field complexity

2010-03-13 Thread Chris Male (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844932#action_12844932
 ] 

Chris Male commented on LUCENE-2310:


Hi Tim,

Yeah I see what you are saying, but as Earwin says, the 'settings' will be 
pushed into the FieldType, so they'll be removed from Fieldable as well.

> Reduce Fieldable, AbstractField and Field complexity
> 
>
> Key: LUCENE-2310
> URL: https://issues.apache.org/jira/browse/LUCENE-2310
> Project: Lucene - Java
>  Issue Type: Sub-task
>  Components: Index
>Reporter: Chris Male
> Attachments: LUCENE-2310-Deprecate-AbstractField.patch
>
>
> In order to move field type like functionality into its own class, we really 
> need to try to tackle the hierarchy of Fieldable, AbstractField and Field.  
> Currently AbstractField depends on Field, and does not provide much more 
> functionality that storing fields, most of which are being moved over to 
> FieldType.  Therefore it seems ideal to try to deprecate AbstractField (and 
> possible Fieldable), moving much of the functionality into Field and 
> FieldType.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2310) Reduce Fieldable, AbstractField and Field complexity

2010-03-13 Thread Chris Male (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Male updated LUCENE-2310:
---

Attachment: LUCENE-2310-Deprecate-AbstractField.patch

Addressed the issues raised by Uwe about the TokenStream in NumericField.  
NumericField now throws a UOE on setTokenStream.  Since it  also extends Field 
which has its own TokenStream field, NumericField now uses the field from 
TokenStream rather than its own.

The more and more this is discussed the clearer it is that Field should be the 
base class of the Field hierarchy, and not AbstractField or Fieldable.  The 
issue of having all the setters and configurations will be addressed in 
LUENE-2308 when we move them all to FieldType.  Field will become a simple 
tuple consisting of at least a value and type, and possibly a TokenStream.

NumericField and LazyField are customisations of Field controlling certain 
aspects of the tuple.  For NumericField that is the TokenStream and setting the 
value.  For LazyField that is the value.

> Reduce Fieldable, AbstractField and Field complexity
> 
>
> Key: LUCENE-2310
> URL: https://issues.apache.org/jira/browse/LUCENE-2310
> Project: Lucene - Java
>  Issue Type: Sub-task
>  Components: Index
>Reporter: Chris Male
> Attachments: LUCENE-2310-Deprecate-AbstractField.patch, 
> LUCENE-2310-Deprecate-AbstractField.patch
>
>
> In order to move field type like functionality into its own class, we really 
> need to try to tackle the hierarchy of Fieldable, AbstractField and Field.  
> Currently AbstractField depends on Field, and does not provide much more 
> functionality that storing fields, most of which are being moved over to 
> FieldType.  Therefore it seems ideal to try to deprecate AbstractField (and 
> possible Fieldable), moving much of the functionality into Field and 
> FieldType.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2310) Reduce Fieldable, AbstractField and Field complexity

2010-03-13 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844945#action_12844945
 ] 

Uwe Schindler commented on LUCENE-2310:
---

There is one problem in backwards:
If somebody has the following code:
{code}
AbstractField field = new Field(...)
{code}
This will no longer work.

> Reduce Fieldable, AbstractField and Field complexity
> 
>
> Key: LUCENE-2310
> URL: https://issues.apache.org/jira/browse/LUCENE-2310
> Project: Lucene - Java
>  Issue Type: Sub-task
>  Components: Index
>Reporter: Chris Male
> Attachments: LUCENE-2310-Deprecate-AbstractField.patch, 
> LUCENE-2310-Deprecate-AbstractField.patch
>
>
> In order to move field type like functionality into its own class, we really 
> need to try to tackle the hierarchy of Fieldable, AbstractField and Field.  
> Currently AbstractField depends on Field, and does not provide much more 
> functionality that storing fields, most of which are being moved over to 
> FieldType.  Therefore it seems ideal to try to deprecate AbstractField (and 
> possible Fieldable), moving much of the functionality into Field and 
> FieldType.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2310) Reduce Fieldable, AbstractField and Field complexity

2010-03-13 Thread Chris Male (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Male updated LUCENE-2310:
---

Attachment: LUCENE-2310-Deprecate-AbstractField.patch

Addressed Uwe's issue again.

Only solution is to change Field to extend AbstractField again, even though 
AbstractField is dead code.

Also fixed:

- Added final to setter methods that are also final in AbstractField for 
consistency sake
- Fixed import for javadocs in CheckIndex and FieldsReader

> Reduce Fieldable, AbstractField and Field complexity
> 
>
> Key: LUCENE-2310
> URL: https://issues.apache.org/jira/browse/LUCENE-2310
> Project: Lucene - Java
>  Issue Type: Sub-task
>  Components: Index
>Reporter: Chris Male
> Attachments: LUCENE-2310-Deprecate-AbstractField.patch, 
> LUCENE-2310-Deprecate-AbstractField.patch, 
> LUCENE-2310-Deprecate-AbstractField.patch
>
>
> In order to move field type like functionality into its own class, we really 
> need to try to tackle the hierarchy of Fieldable, AbstractField and Field.  
> Currently AbstractField depends on Field, and does not provide much more 
> functionality that storing fields, most of which are being moved over to 
> FieldType.  Therefore it seems ideal to try to deprecate AbstractField (and 
> possible Fieldable), moving much of the functionality into Field and 
> FieldType.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer

2010-03-13 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845030#action_12845030
 ] 

Jason Rutherglen commented on LUCENE-2312:
--

Mike, Why does DocFieldConsumers have DocFieldConsumer one and two?  How is 
this class used?  Thanks.

> Search on IndexWriter's RAM Buffer
> --
>
> Key: LUCENE-2312
> URL: https://issues.apache.org/jira/browse/LUCENE-2312
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Affects Versions: 3.0.1
>Reporter: Jason Rutherglen
> Fix For: 3.0.2
>
>
> In order to offer user's near realtime search, without incurring
> an indexing performance penalty, we can implement search on
> IndexWriter's RAM buffer. This is the buffer that is filled in
> RAM as documents are indexed. Currently the RAM buffer is
> flushed to the underlying directory (usually disk) before being
> made searchable. 
> Todays Lucene based NRT systems must incur the cost of merging
> segments, which can slow indexing. 
> Michael Busch has good suggestions regarding how to handle deletes using max 
> doc ids.  
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923
> The area that isn't fully fleshed out is the terms dictionary,
> which needs to be sorted prior to queries executing. Currently
> IW implements a specialized hash table. Michael B has a
> suggestion here: 
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2319) IndexReader # doCommit - typo nit about v3.0 in trunk

2010-03-13 Thread Kay Kay (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Kay updated LUCENE-2319:


Attachment: LUCENE-2319.patch

> IndexReader # doCommit - typo nit about v3.0 in trunk
> -
>
> Key: LUCENE-2319
> URL: https://issues.apache.org/jira/browse/LUCENE-2319
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Kay Kay
> Attachments: LUCENE-2319.patch
>
>
> Trunk is already in 3.0.1+ . But the documentation says -  "In 3.0, this will 
> become ... ".  Since it is already in 3.0, it might as well be removed. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2319) IndexReader # doCommit - typo nit about v3.0 in trunk

2010-03-13 Thread Kay Kay (JIRA)
IndexReader # doCommit - typo nit about v3.0 in trunk
-

 Key: LUCENE-2319
 URL: https://issues.apache.org/jira/browse/LUCENE-2319
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Kay Kay
 Attachments: LUCENE-2319.patch

Trunk is already in 3.0.1+ . But the documentation says -  "In 3.0, this will 
become ... ".  Since it is already in 3.0, it might as well be removed. 


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Assigned: (LUCENE-2312) Search on IndexWriter's RAM Buffer

2010-03-13 Thread Michael Busch (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Busch reassigned LUCENE-2312:
-

Assignee: Michael Busch

> Search on IndexWriter's RAM Buffer
> --
>
> Key: LUCENE-2312
> URL: https://issues.apache.org/jira/browse/LUCENE-2312
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Affects Versions: 3.0.1
>Reporter: Jason Rutherglen
>Assignee: Michael Busch
> Fix For: 3.0.2
>
>
> In order to offer user's near realtime search, without incurring
> an indexing performance penalty, we can implement search on
> IndexWriter's RAM buffer. This is the buffer that is filled in
> RAM as documents are indexed. Currently the RAM buffer is
> flushed to the underlying directory (usually disk) before being
> made searchable. 
> Todays Lucene based NRT systems must incur the cost of merging
> segments, which can slow indexing. 
> Michael Busch has good suggestions regarding how to handle deletes using max 
> doc ids.  
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923
> The area that isn't fully fleshed out is the terms dictionary,
> which needs to be sorted prior to queries executing. Currently
> IW implements a specialized hash table. Michael B has a
> suggestion here: 
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer

2010-03-13 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845031#action_12845031
 ] 

Michael Busch commented on LUCENE-2312:
---

{quote}
Also, we could store the first docID stored into the term, too - this
way we could have a ordered collection of terms, that's shared across
several open readers even as changes are still being made, but each
reader skips a given term if its first docID is greater than the
maxDoc it's searching. That'd give us point in time searching even
while we add terms with time...
{quote}

Exactly. This is what I meant in my comment: 
https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

But I mistakenly said lastDocID; of course firstDocID is correct.

> Search on IndexWriter's RAM Buffer
> --
>
> Key: LUCENE-2312
> URL: https://issues.apache.org/jira/browse/LUCENE-2312
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Affects Versions: 3.0.1
>Reporter: Jason Rutherglen
> Fix For: 3.0.2
>
>
> In order to offer user's near realtime search, without incurring
> an indexing performance penalty, we can implement search on
> IndexWriter's RAM buffer. This is the buffer that is filled in
> RAM as documents are indexed. Currently the RAM buffer is
> flushed to the underlying directory (usually disk) before being
> made searchable. 
> Todays Lucene based NRT systems must incur the cost of merging
> segments, which can slow indexing. 
> Michael Busch has good suggestions regarding how to handle deletes using max 
> doc ids.  
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923
> The area that isn't fully fleshed out is the terms dictionary,
> which needs to be sorted prior to queries executing. Currently
> IW implements a specialized hash table. Michael B has a
> suggestion here: 
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer

2010-03-13 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845032#action_12845032
 ] 

Michael Busch commented on LUCENE-2312:
---

I'll try to tackle this one!

> Search on IndexWriter's RAM Buffer
> --
>
> Key: LUCENE-2312
> URL: https://issues.apache.org/jira/browse/LUCENE-2312
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Affects Versions: 3.0.1
>Reporter: Jason Rutherglen
>Assignee: Michael Busch
> Fix For: 3.0.2
>
>
> In order to offer user's near realtime search, without incurring
> an indexing performance penalty, we can implement search on
> IndexWriter's RAM buffer. This is the buffer that is filled in
> RAM as documents are indexed. Currently the RAM buffer is
> flushed to the underlying directory (usually disk) before being
> made searchable. 
> Todays Lucene based NRT systems must incur the cost of merging
> segments, which can slow indexing. 
> Michael Busch has good suggestions regarding how to handle deletes using max 
> doc ids.  
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923
> The area that isn't fully fleshed out is the terms dictionary,
> which needs to be sorted prior to queries executing. Currently
> IW implements a specialized hash table. Michael B has a
> suggestion here: 
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer

2010-03-13 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845036#action_12845036
 ] 

Jason Rutherglen commented on LUCENE-2312:
--

A few notes so far:

* IW flush could become thread dependent (eg, it'll only flush
for the current doc writer) or maybe it should flush all doc
writers? Close will shut down and flush all doc writers.

* A new term will first check the hash table for existence (as
currently), if it's not in the term hash table only then will it
be added to the btree (btw, binary search is O(log N) on
average?) This way we're avoiding the somewhat costlier btree
existence check per token.

* The algorithm for flushing doc writers based on RAM
consumption can simply be, on exceed, flush the doc writer
consuming the most RAM? 

* I gutted the PerThread classes, then realized, it's all too
intertwined. I'd rather get *something* working, than spend an
excessive amount of time rearranging code that already works. 

> Search on IndexWriter's RAM Buffer
> --
>
> Key: LUCENE-2312
> URL: https://issues.apache.org/jira/browse/LUCENE-2312
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Affects Versions: 3.0.1
>Reporter: Jason Rutherglen
>Assignee: Michael Busch
> Fix For: 3.0.2
>
>
> In order to offer user's near realtime search, without incurring
> an indexing performance penalty, we can implement search on
> IndexWriter's RAM buffer. This is the buffer that is filled in
> RAM as documents are indexed. Currently the RAM buffer is
> flushed to the underlying directory (usually disk) before being
> made searchable. 
> Todays Lucene based NRT systems must incur the cost of merging
> segments, which can slow indexing. 
> Michael Busch has good suggestions regarding how to handle deletes using max 
> doc ids.  
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923
> The area that isn't fully fleshed out is the terms dictionary,
> which needs to be sorted prior to queries executing. Currently
> IW implements a specialized hash table. Michael B has a
> suggestion here: 
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2318) Add System.getProperty("tempDir") as final static to LuceneTestCase(J4)

2010-03-13 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845038#action_12845038
 ] 

Shai Erera commented on LUCENE-2318:


Uwe, can you default to "java.io.tmpdir" instead? "." is not properly defined. 
It will create indexes in the current directory where the tests run from, which 
is different if I run "ant test" from , /contrib and 
/benchmark ...

Or, we can tweak common-build.xml to fallback to /test. In fact, looking 
in common-build.xml, I already see tempDir defaults to {build.dir}/test. Look 
at lines 448 (where it is set), 417 where it is used and 418 where 
java.io.tmpdir is set to that value.

Maybe we need to change the definition of build.dir from location="build" to 
location="{common.dir}/build" so that it always references /build.

And if run from eclipse, default TEMP_DIR constant to "java.io.tmpdir"?

> Add System.getProperty("tempDir") as final static to LuceneTestCase(J4)
> ---
>
> Key: LUCENE-2318
> URL: https://issues.apache.org/jira/browse/LUCENE-2318
> Project: Lucene - Java
>  Issue Type: Test
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
> Fix For: 3.1
>
>
> Almost every test calls System.getProperty("tempDir") and some of them check 
> the return value for null. In other cases the test simply fails from within 
> eclipse.
> We should add this to LuceneTestCase(J4) as a static final constant. For 
> enabling tests run in eclipse, we can add a fallback to ".", if the Sysprop 
> is not defined.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer

2010-03-13 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845041#action_12845041
 ] 

Jason Rutherglen commented on LUCENE-2312:
--

* IW commitMerge calls docWriter's remapDeletes, a synchronized method to 
prevent concurrent updates.  I'm not sure how we should efficiently block calls 
to the different DW's.  

* _mergeInit calls docWriter getDocStoreSegment - unsure what to change

* Some of the config settings (such as maxBufferedDocs) can simply be removed 
from DW, and instead accessed via WriterConfig

> Search on IndexWriter's RAM Buffer
> --
>
> Key: LUCENE-2312
> URL: https://issues.apache.org/jira/browse/LUCENE-2312
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Affects Versions: 3.0.1
>Reporter: Jason Rutherglen
>Assignee: Michael Busch
> Fix For: 3.0.2
>
>
> In order to offer user's near realtime search, without incurring
> an indexing performance penalty, we can implement search on
> IndexWriter's RAM buffer. This is the buffer that is filled in
> RAM as documents are indexed. Currently the RAM buffer is
> flushed to the underlying directory (usually disk) before being
> made searchable. 
> Todays Lucene based NRT systems must incur the cost of merging
> segments, which can slow indexing. 
> Michael Busch has good suggestions regarding how to handle deletes using max 
> doc ids.  
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923
> The area that isn't fully fleshed out is the terms dictionary,
> which needs to be sorted prior to queries executing. Currently
> IW implements a specialized hash table. Michael B has a
> suggestion here: 
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2320) Add MergePolicy to IndexWriterConfig

2010-03-13 Thread Shai Erera (JIRA)
Add MergePolicy to IndexWriterConfig


 Key: LUCENE-2320
 URL: https://issues.apache.org/jira/browse/LUCENE-2320
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
 Fix For: 3.1


Now that IndexWriterConfig is in place, I'd like to move MergePolicy to it as 
well. The change is not straightforward and so I've kept it for a separate 
issue. MergePolicy requires in its ctor an IndexWriter, however none can be 
passed to it before an IndexWriter actually exists. And today IW may create an 
MP just for it to be overridden by the application one line afterwards. I don't 
want to make iw member of MP non-final, or settable by extending classes, 
however it needs to remain protected so they can access it directly. So the 
proposed changes are:

* Add a SetOnce object (to o.a.l.util), or Immutable, which can only be set 
once (hence its name). It'll have the signature SetOnce w/ *synchronized 
set* and *T get()*. T will be declared volatile, so that get() won't be 
synchronized.
* MP will define a *protected final SetOnce writer* instead of the 
current writer. *NOTE: this is a bw break*. any suggestions are welcomed.
* MP will offer a public default ctor, together with a set(IndexWriter).
* IndexWriter will set itself on MP using set(this). Note that if set will be 
called more than once, it will throw an exception (AlreadySetException - or 
does someone have a better suggestion, preferably an already existing Java 
exception?).

That's the core idea. I'd like to post a patch soon, so I'd appreciate your 
review and proposals.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2312) Search on IndexWriter's RAM Buffer

2010-03-13 Thread Michael Busch (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Busch updated LUCENE-2312:
--

Fix Version/s: (was: 3.0.2)
   3.1

> Search on IndexWriter's RAM Buffer
> --
>
> Key: LUCENE-2312
> URL: https://issues.apache.org/jira/browse/LUCENE-2312
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Affects Versions: 3.0.1
>Reporter: Jason Rutherglen
>Assignee: Michael Busch
> Fix For: 3.1
>
>
> In order to offer user's near realtime search, without incurring
> an indexing performance penalty, we can implement search on
> IndexWriter's RAM buffer. This is the buffer that is filled in
> RAM as documents are indexed. Currently the RAM buffer is
> flushed to the underlying directory (usually disk) before being
> made searchable. 
> Todays Lucene based NRT systems must incur the cost of merging
> segments, which can slow indexing. 
> Michael Busch has good suggestions regarding how to handle deletes using max 
> doc ids.  
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923
> The area that isn't fully fleshed out is the terms dictionary,
> which needs to be sorted prior to queries executing. Currently
> IW implements a specialized hash table. Michael B has a
> suggestion here: 
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org