[jira] Commented: (LUCENE-1401) Deprecation of autoCommit in 2.4 leads to compile problems, when autoCommit should be false

2008-09-24 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12634059#action_12634059
 ] 

Uwe Schindler commented on LUCENE-1401:
---

This patch seems to work, the IndexWriters created by the MaxFieldLength ctors 
are with autocommit=false, I have seen this, because the segment file does not 
change during indexing.

There is on small thing (was also there before your patch):
I use writer.setUseCompoundFile(true) to use compound files (which is also the 
default). It generates normally always only CFS files (on index creation, when 
optimizing,...). There is only one use case, when cfs and cfx files are 
generated:

- Use IndexWriter with create=true
- add documents to the index
- optimize the index (without closing in between)

After that the optimized index contains of one cfs and one cfx. During indexing 
(before optimization), I always see only cfs files for new segments (and for 
short times as usual the contents tfx,...).

When optimizing the index after closing it or later after adding documents, i 
got only one cfs file.

Two questions:
- Is this a small bug, which would be not release critical - but it is strange?
- How can I enable creation of doc store (cfx) and cfs always, I found nothing 
in the docs. In my opinion the separate cfs/cfx files are good for search 
performance (right?).

> Deprecation of autoCommit in 2.4 leads to compile problems, when autoCommit 
> should be false
> ---
>
> Key: LUCENE-1401
> URL: https://issues.apache.org/jira/browse/LUCENE-1401
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 2.4, 2.9
>Reporter: Uwe Schindler
>Assignee: Michael McCandless
>Priority: Trivial
> Fix For: 2.4, 2.9
>
> Attachments: LUCENE-1401.patch
>
>
> I am currently changing my code to be most compatible with 2.4. I switched on 
> deprecation warnings and got a warning about the autoCommit parameter in 
> IndexWriter constructors.
> My code *should* use autoCommit=false, so I want to use the new semantics. 
> The default of IndexWriter is still autoCommit=true. My problem now: How to 
> disable autoCommit whithout deprecation warnings?
> Maybe, the "old" constructors, that are deprecated should use 
> autoCommit=true. But there are new constructors with this 
> "IndexWriter.MaxFieldLength mfl" in it, that appear new in 2.4 but are 
> deprecated:
> IndexWriter(Directory d, boolean autoCommit, Analyzer a, boolean create, 
> IndexDeletionPolicy deletionPolicy, IndexWriter.MaxFieldLength mfl) 
>   Deprecated. This will be removed in 3.0, when autoCommit will be 
> hardwired to false. Use 
> IndexWriter(Directory,Analyzer,boolean,IndexDeletionPolicy,MaxFieldLength) 
> instead, and call commit() when needed.
> What the hell is meant by this, a new constructor that is deprecated? And the 
> hint is wrong. If I use the other constructor in the warning, I get 
> autoCommit=true.
> There is something completely wrong.
> It should be clear, which constructors set autoCommit=true, which set it per 
> default to false (perhaps new ones), and the Deprecated text is wrong, if 
> autoCommit does not default to false.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1402) CheckIndex API changed without backwards compaitibility

2008-09-24 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12634060#action_12634060
 ] 

Uwe Schindler commented on LUCENE-1402:
---

Patch looks OK. In my opinion, for consistency, the fix() method should spell 
fixIndex() like checkIndex().
Maybe the CheckIndexStatus class could be an inner class like the others.

> CheckIndex API changed without backwards compaitibility
> ---
>
> Key: LUCENE-1402
> URL: https://issues.apache.org/jira/browse/LUCENE-1402
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 2.4
>Reporter: Uwe Schindler
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1402-uwe.patch, LUCENE-1402.patch, 
> LUCENE-1402.patch
>
>
> The API of CheckIndex changed. The Check function returns a CheckIndexStatus 
> and not boolean. And JavaDocs notes the boolean return value.
> I am not sure if it works, but it would be good to have the check method that 
> returns boolean available @Deprecated, i.e.
> @Deprecated public static CheckIndexStatus check(Directory dir, boolean 
> doFix) throws IOException {
>  final CheckIndexStatus stat=this.check(dir,doFix);
>  return stat.clean;
> }
> I am not sure, if it can be done with the same method name, but it prevents 
> drop-in-replacements of Lucene to work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1402) CheckIndex API changed without backwards compaitibility

2008-09-24 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12634064#action_12634064
 ] 

Michael McCandless commented on LUCENE-1402:


bq. In my opinion, for consistency, the fix() method should spell fixIndex() 
like checkIndex().

OK I'll do this.

bq. Maybe the CheckIndexStatus class could be an inner class like the others. 

And this too.

Thanks!

> CheckIndex API changed without backwards compaitibility
> ---
>
> Key: LUCENE-1402
> URL: https://issues.apache.org/jira/browse/LUCENE-1402
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 2.4
>Reporter: Uwe Schindler
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1402-uwe.patch, LUCENE-1402.patch, 
> LUCENE-1402.patch
>
>
> The API of CheckIndex changed. The Check function returns a CheckIndexStatus 
> and not boolean. And JavaDocs notes the boolean return value.
> I am not sure if it works, but it would be good to have the check method that 
> returns boolean available @Deprecated, i.e.
> @Deprecated public static CheckIndexStatus check(Directory dir, boolean 
> doFix) throws IOException {
>  final CheckIndexStatus stat=this.check(dir,doFix);
>  return stat.clean;
> }
> I am not sure, if it can be done with the same method name, but it prevents 
> drop-in-replacements of Lucene to work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1380) Patch for ShingleFilter.enablePositions (or PositionFilter)

2008-09-24 Thread Mck SembWever (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mck SembWever updated LUCENE-1380:
--

Attachment: (was: LUCENE-1380-PositionFilter.patch)

> Patch for ShingleFilter.enablePositions (or PositionFilter)
> ---
>
> Key: LUCENE-1380
> URL: https://issues.apache.org/jira/browse/LUCENE-1380
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Reporter: Mck SembWever
>Priority: Trivial
> Attachments: LUCENE-1380.patch, LUCENE-1380.patch
>
>
> Make it possible for *all* words and shingles to be placed at the same 
> position, that is for _all_ shingles (and unigrams if included) to be treated 
> as synonyms of each other.
> Today the shingles generated are synonyms only to the first term in the 
> shingle.
> For example the query "abcd efgh ijkl" results in:
>("abcd" "abcd efgh" "abcd efgh ijkl") ("efgh" efgh ijkl") ("ijkl")
> where "abcd efgh" and "abcd efgh ijkl" are synonyms of "abcd", and "efgh 
> ijkl" is a synonym of "efgh".
> There exists no way today to alter which token a particular shingle is a 
> synonym for.
> This patch takes the first step in making it possible to make all shingles 
> (and unigrams if included) synonyms of each other.
> See http://comments.gmane.org/gmane.comp.jakarta.lucene.user/34746 for 
> mailing list thread.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1380) Patch for ShingleFilter.enablePositions (or PositionFilter)

2008-09-24 Thread Mck SembWever (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mck SembWever updated LUCENE-1380:
--

Attachment: LUCENE-1380-PositionFilter.patch

Re-attached the PositionFilter patch addressing Steve's moderation comments. 

> Patch for ShingleFilter.enablePositions (or PositionFilter)
> ---
>
> Key: LUCENE-1380
> URL: https://issues.apache.org/jira/browse/LUCENE-1380
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Reporter: Mck SembWever
>Priority: Trivial
> Attachments: LUCENE-1380-PositionFilter.patch, LUCENE-1380.patch, 
> LUCENE-1380.patch
>
>
> Make it possible for *all* words and shingles to be placed at the same 
> position, that is for _all_ shingles (and unigrams if included) to be treated 
> as synonyms of each other.
> Today the shingles generated are synonyms only to the first term in the 
> shingle.
> For example the query "abcd efgh ijkl" results in:
>("abcd" "abcd efgh" "abcd efgh ijkl") ("efgh" efgh ijkl") ("ijkl")
> where "abcd efgh" and "abcd efgh ijkl" are synonyms of "abcd", and "efgh 
> ijkl" is a synonym of "efgh".
> There exists no way today to alter which token a particular shingle is a 
> synonym for.
> This patch takes the first step in making it possible to make all shingles 
> (and unigrams if included) synonyms of each other.
> See http://comments.gmane.org/gmane.comp.jakarta.lucene.user/34746 for 
> mailing list thread.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1402) CheckIndex API changed without backwards compaitibility

2008-09-24 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1402:
---

Fix Version/s: 2.9
   2.4

> CheckIndex API changed without backwards compaitibility
> ---
>
> Key: LUCENE-1402
> URL: https://issues.apache.org/jira/browse/LUCENE-1402
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 2.4
>Reporter: Uwe Schindler
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.4, 2.9
>
> Attachments: LUCENE-1402-uwe.patch, LUCENE-1402.patch, 
> LUCENE-1402.patch
>
>
> The API of CheckIndex changed. The Check function returns a CheckIndexStatus 
> and not boolean. And JavaDocs notes the boolean return value.
> I am not sure if it works, but it would be good to have the check method that 
> returns boolean available @Deprecated, i.e.
> @Deprecated public static CheckIndexStatus check(Directory dir, boolean 
> doFix) throws IOException {
>  final CheckIndexStatus stat=this.check(dir,doFix);
>  return stat.clean;
> }
> I am not sure, if it can be done with the same method name, but it prevents 
> drop-in-replacements of Lucene to work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Resolved: (LUCENE-1404) NPE in NearSpansUnordered.isPayloadAvailable()

2008-09-24 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-1404.


   Resolution: Fixed
Fix Version/s: 2.9

Committed revision 698487 on trunk and 698488 on 2.4 branch.  Thanks Tim!

> NPE in NearSpansUnordered.isPayloadAvailable() 
> ---
>
> Key: LUCENE-1404
> URL: https://issues.apache.org/jira/browse/LUCENE-1404
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Query/Scoring
>Affects Versions: 2.4
>Reporter: Tim Smith
>Assignee: Michael McCandless
> Fix For: 2.4, 2.9
>
> Attachments: SpanQueryTest.java
>
>
> Using RC1 of lucene 2.4 resulted in null pointer exception with some 
> constructed SpanNearQueries
> Implementation of isPayloadAvailable() (results in exception)
> {code}
>  public boolean isPayloadAvailable() {
>SpansCell pointer = min();
>do {
>  if(pointer.isPayloadAvailable()) {
>return true;
>  }
>  pointer = pointer.next;
>} while(pointer.next != null);
>return false;
>   }
> {code}
> "Fixed" isPayloadAvailable()
> {code}
>  public boolean isPayloadAvailable() {
>SpansCell pointer = min();
>while (pointer != null) {
>  if(pointer.isPayloadAvailable()) {
>return true;
>  }
>  pointer = pointer.next;
>}
>return false;
>   }
> {code}
> Exception produced:
> {code}
>   [junit] java.lang.NullPointerException
> [junit] at 
> org.apache.lucene.search.spans.NearSpansUnordered$SpansCell.access$300(NearSpansUnordered.java:65)
> [junit] at 
> org.apache.lucene.search.spans.NearSpansUnordered.isPayloadAvailable(NearSpansUnordered.java:235)
> [junit] at 
> org.apache.lucene.search.spans.NearSpansOrdered.shrinkToAfterShortestMatch(NearSpansOrdered.java:246)
> [junit] at 
> org.apache.lucene.search.spans.NearSpansOrdered.advanceAfterOrdered(NearSpansOrdered.java:154)
> [junit] at 
> org.apache.lucene.search.spans.NearSpansOrdered.next(NearSpansOrdered.java:122)
> [junit] at 
> org.apache.lucene.search.spans.SpanScorer.next(SpanScorer.java:54)
> [junit] at org.apache.lucene.search.Scorer.score(Scorer.java:57)
> [junit] at 
> org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:137)
> [junit] at 
> org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:113)
> [junit] at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:113)
> [junit] at org.apache.lucene.search.Hits.(Hits.java:80)
> [junit] at org.apache.lucene.search.Searcher.search(Searcher.java:50)
> [junit] at org.apache.lucene.search.Searcher.search(Searcher.java:40)
> [junit] at 
> com.attivio.lucene.SpanQueryTest.search(SpanQueryTest.java:79)
> [junit] at 
> com.attivio.lucene.SpanQueryTest.assertHitCount(SpanQueryTest.java:75)
> [junit] at 
> com.attivio.lucene.SpanQueryTest.test(SpanQueryTest.java:67)
> {code}
> will attach unit test that causes exception (and passes with updated 
> isPayloadAvailable())

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1402) CheckIndex API changed without backwards compaitibility

2008-09-24 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1402:
---

Attachment: LUCENE-1402.patch

OK new rev of patch with changes above folded in!

> CheckIndex API changed without backwards compaitibility
> ---
>
> Key: LUCENE-1402
> URL: https://issues.apache.org/jira/browse/LUCENE-1402
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 2.4
>Reporter: Uwe Schindler
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.4, 2.9
>
> Attachments: LUCENE-1402-uwe.patch, LUCENE-1402.patch, 
> LUCENE-1402.patch, LUCENE-1402.patch
>
>
> The API of CheckIndex changed. The Check function returns a CheckIndexStatus 
> and not boolean. And JavaDocs notes the boolean return value.
> I am not sure if it works, but it would be good to have the check method that 
> returns boolean available @Deprecated, i.e.
> @Deprecated public static CheckIndexStatus check(Directory dir, boolean 
> doFix) throws IOException {
>  final CheckIndexStatus stat=this.check(dir,doFix);
>  return stat.clean;
> }
> I am not sure, if it can be done with the same method name, but it prevents 
> drop-in-replacements of Lucene to work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Resolved: (LUCENE-1400) Add Apache RAT (Release Audit Tool) target to build.xml

2008-09-24 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-1400.


Resolution: Fixed

Committed revision 698495 on trunk and 698493 on 2.4 branch.

> Add Apache RAT (Release Audit Tool) target to build.xml
> ---
>
> Key: LUCENE-1400
> URL: https://issues.apache.org/jira/browse/LUCENE-1400
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4, 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.4, 2.9
>
> Attachments: LUCENE-1400.patch
>
>
> Apache RAT is a useful tool to check for common mistakes in our source code 
> (eg missing copyright headers):
> http://incubator.apache.org/rat/
> I'm just copying the patch Grant worked out for Solr (SOLR-762).  I plan to 
> commit to 2.4 & 2.9.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1401) Deprecation of autoCommit in 2.4 leads to compile problems, when autoCommit should be false

2008-09-24 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12634094#action_12634094
 ] 

Michael McCandless commented on LUCENE-1401:



That (cfx/cfs file creation) is actually "normal" behavior for
Lucene.

With autoCommit=false, in a single session of IndexWriter, Lucene
will share the doc store files (stored fields, term vectors) across
multiple segments.  This saves alot of merge time because those files
don't need to be merged if we are merging segments that all share the
same doc store files.  When building up a large index anew this saves
alot of time.

A cfx file is the compound-file format of the doc store files.

However, when segments spanning multiple doc stores are merged, then
the doc store files are in fact merged, and written privately for that
one segment, and then folded into that segment's cfs file.  When all
such segments reference a given doc store segment are merged away,
then that doc store segment is deleted.

So it's currently only the "level 0" segments that may share a cfx
file.  As a future optimization we could consider extending Lucene's
index format so that a single segment could reference multiple doc
stores.  This would require logic in FieldsReader and
TermVectorsReader to do a binary search when locating which doc store
segment holds a given document, but, would enable merging non level 0
segments to skip having to merge the doc store.  This is an invasive
optimization.

So you can't separately control when Lucene uses cfx file; it's the
merge policy that indirectly controls this.

> Deprecation of autoCommit in 2.4 leads to compile problems, when autoCommit 
> should be false
> ---
>
> Key: LUCENE-1401
> URL: https://issues.apache.org/jira/browse/LUCENE-1401
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 2.4, 2.9
>Reporter: Uwe Schindler
>Assignee: Michael McCandless
>Priority: Trivial
> Fix For: 2.4, 2.9
>
> Attachments: LUCENE-1401.patch
>
>
> I am currently changing my code to be most compatible with 2.4. I switched on 
> deprecation warnings and got a warning about the autoCommit parameter in 
> IndexWriter constructors.
> My code *should* use autoCommit=false, so I want to use the new semantics. 
> The default of IndexWriter is still autoCommit=true. My problem now: How to 
> disable autoCommit whithout deprecation warnings?
> Maybe, the "old" constructors, that are deprecated should use 
> autoCommit=true. But there are new constructors with this 
> "IndexWriter.MaxFieldLength mfl" in it, that appear new in 2.4 but are 
> deprecated:
> IndexWriter(Directory d, boolean autoCommit, Analyzer a, boolean create, 
> IndexDeletionPolicy deletionPolicy, IndexWriter.MaxFieldLength mfl) 
>   Deprecated. This will be removed in 3.0, when autoCommit will be 
> hardwired to false. Use 
> IndexWriter(Directory,Analyzer,boolean,IndexDeletionPolicy,MaxFieldLength) 
> instead, and call commit() when needed.
> What the hell is meant by this, a new constructor that is deprecated? And the 
> hint is wrong. If I use the other constructor in the warning, I get 
> autoCommit=true.
> There is something completely wrong.
> It should be clear, which constructors set autoCommit=true, which set it per 
> default to false (perhaps new ones), and the Deprecated text is wrong, if 
> autoCommit does not default to false.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Could positions/payloads in SegmentMerger be copied directly?

2008-09-24 Thread Michael McCandless


Paul Elschot wrote:


Op Tuesday 23 September 2008 20:26:18 schreef Michael McCandless:

Paul Elschot wrote:

So, adding a document offset from the  documents/frequencies
into the positions/payloads for each document would allow:
-  bulk copying of the position/payloads during merging, and
-  a more efficient implementation of TermPositions.skipTo()
 in that decoding the positions from the last available skip
 document to the target of skipTo() could be avoided.
Is that correct?


Yes, though this would also add cost of computing/writing/reading
that new offset, and would increase the index size.


That would indeed be invasive.


Yes.  I think our time would likely be better spent working on using
PForDelta for freq/prox.


To change the prox data to PForDelta, it's nice to be have
bulk copies on prox working first. That would allow to change
the total size of the prox data easily.

But it appears to be easier to start with the doc/freq data, add
more prox pointers there, and then change the prox data.

PForDelta is fundamentally different from the existing index data
because an encoded number cannot be accessed on a byte
border. I don't know yet how to deal with that in the index
data structures.


PForDelta encodes multiples of 32 ints at a time; so, the pointers
stored in the term dict, and in skip data, would presumably have to be
block number (or byte position in the file) plus offset within the
block.

And then an entire block must be fully decoded when loaded (I don't
think it's easy to partially decode with PForDelta, unless the block
luckily had no exceptions?), and then you start from the
offset-within-block you need.

I think a single block would hold more than one term's postings data
in general.  Ie these blocks are like "pages" in virtual memory.

Also I wonder how PForDelta would impact performance of queries that
rely heavily on skipping (AND queries), because the entire block must
be decoded to read a few of its ints.

However, with PForDelta I don't think we'd be able to do byte block
copying when merging, unless we were willing to keep the "seams" of
past merges present in the index files (the invasive change I was
referring to), and, no deletions applied.

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1401) Deprecation of autoCommit in 2.4 leads to compile problems, when autoCommit should be false

2008-09-24 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12634151#action_12634151
 ] 

Uwe Schindler commented on LUCENE-1401:
---

Thanks for the info, it did not know this!

> Deprecation of autoCommit in 2.4 leads to compile problems, when autoCommit 
> should be false
> ---
>
> Key: LUCENE-1401
> URL: https://issues.apache.org/jira/browse/LUCENE-1401
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 2.4, 2.9
>Reporter: Uwe Schindler
>Assignee: Michael McCandless
>Priority: Trivial
> Fix For: 2.4, 2.9
>
> Attachments: LUCENE-1401.patch
>
>
> I am currently changing my code to be most compatible with 2.4. I switched on 
> deprecation warnings and got a warning about the autoCommit parameter in 
> IndexWriter constructors.
> My code *should* use autoCommit=false, so I want to use the new semantics. 
> The default of IndexWriter is still autoCommit=true. My problem now: How to 
> disable autoCommit whithout deprecation warnings?
> Maybe, the "old" constructors, that are deprecated should use 
> autoCommit=true. But there are new constructors with this 
> "IndexWriter.MaxFieldLength mfl" in it, that appear new in 2.4 but are 
> deprecated:
> IndexWriter(Directory d, boolean autoCommit, Analyzer a, boolean create, 
> IndexDeletionPolicy deletionPolicy, IndexWriter.MaxFieldLength mfl) 
>   Deprecated. This will be removed in 3.0, when autoCommit will be 
> hardwired to false. Use 
> IndexWriter(Directory,Analyzer,boolean,IndexDeletionPolicy,MaxFieldLength) 
> instead, and call commit() when needed.
> What the hell is meant by this, a new constructor that is deprecated? And the 
> hint is wrong. If I use the other constructor in the warning, I get 
> autoCommit=true.
> There is something completely wrong.
> It should be clear, which constructors set autoCommit=true, which set it per 
> default to false (perhaps new ones), and the Deprecated text is wrong, if 
> autoCommit does not default to false.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Resolved: (LUCENE-1385) IndexReader.isIndexCurrent()==false -> IndexReader.reopen() -> still index not current

2008-09-24 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler resolved LUCENE-1385.
---

   Resolution: Won't Fix
Fix Version/s: 2.4

I close this bug, as the problem is fixed in Lucene 2.4.
Thanks for investigation, Michael - good work!

> IndexReader.isIndexCurrent()==false -> IndexReader.reopen() -> still index 
> not current
> --
>
> Key: LUCENE-1385
> URL: https://issues.apache.org/jira/browse/LUCENE-1385
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 2.3.2
> Environment: Linux, Solaris, Windows XP
>Reporter: Uwe Schindler
>Assignee: Michael McCandless
> Fix For: 2.4
>
> Attachments: LUCENE-1385.patch
>
>
> I found a strange error occurring with IndexReader.reopen. It is not always 
> reproduceable, it only happens sometimes, but strangely on all my computers 
> with different platforms at the same time. Maybe has something to to with the 
> timestamp used in index versions.
> I have a search server using an IndexReader, that is openend in webapp 
> startup and should stay open. Every half an hour this web application checks, 
> if the index is still current using IndexReader.isCurrent(). When a parallel 
> job that indexes documents (in another virtual machine) and modifies the 
> indexes, isCurrent() return TRUE. The half-hourly cron-job then uses 
> IndexReader.reopen() to reopen the index. But sometimes, directly after 
> reopen() the Index is still not current (and no updates occur). Again calling 
> reopen does not change it, too. Searching on the index shows all new/updated 
> documents, but isCurrent() still return false. The problem with this is, that 
> now the index is reopened all the time, because the detection of a current 
> index does not work any more.
> I have now a workaround in my code to handle this: After calling 
> IndexReader.reopen(), I test for IndexReader.isCurrent(), and if not, I close 
> it hard and open a new instance.
> Most times IndexReader.reopen works correct, but sometimes this error occurs. 
> Looking into the code of reopen(), I realized, that there is some extra 
> check, if the Index has modifications, and if yes the reopen call returns the 
> original reader (this maybe the problem I have). But the IndexReader is only 
> used for searching, no updates occur.
> My questions: Why is there this check for modifications in reopen()? Why does 
> this happen only at certain times on all my servers with different platforms?
> I want to use reopen, because in future, when the new FieldCache will be 
> reopen-aware and does not everytime rebuild the full cache, it will be very 
> important, to have this fixed. At the moment, I have no problem with the 
> case, that reopen may fail and I have to do a rough reopen.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1385) IndexReader.isIndexCurrent()==false -> IndexReader.reopen() -> still index not current

2008-09-24 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12634155#action_12634155
 ] 

Michael McCandless commented on LUCENE-1385:


Super, thanks Uwe!

> IndexReader.isIndexCurrent()==false -> IndexReader.reopen() -> still index 
> not current
> --
>
> Key: LUCENE-1385
> URL: https://issues.apache.org/jira/browse/LUCENE-1385
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 2.3.2
> Environment: Linux, Solaris, Windows XP
>Reporter: Uwe Schindler
>Assignee: Michael McCandless
> Fix For: 2.4
>
> Attachments: LUCENE-1385.patch
>
>
> I found a strange error occurring with IndexReader.reopen. It is not always 
> reproduceable, it only happens sometimes, but strangely on all my computers 
> with different platforms at the same time. Maybe has something to to with the 
> timestamp used in index versions.
> I have a search server using an IndexReader, that is openend in webapp 
> startup and should stay open. Every half an hour this web application checks, 
> if the index is still current using IndexReader.isCurrent(). When a parallel 
> job that indexes documents (in another virtual machine) and modifies the 
> indexes, isCurrent() return TRUE. The half-hourly cron-job then uses 
> IndexReader.reopen() to reopen the index. But sometimes, directly after 
> reopen() the Index is still not current (and no updates occur). Again calling 
> reopen does not change it, too. Searching on the index shows all new/updated 
> documents, but isCurrent() still return false. The problem with this is, that 
> now the index is reopened all the time, because the detection of a current 
> index does not work any more.
> I have now a workaround in my code to handle this: After calling 
> IndexReader.reopen(), I test for IndexReader.isCurrent(), and if not, I close 
> it hard and open a new instance.
> Most times IndexReader.reopen works correct, but sometimes this error occurs. 
> Looking into the code of reopen(), I realized, that there is some extra 
> check, if the Index has modifications, and if yes the reopen call returns the 
> original reader (this maybe the problem I have). But the IndexReader is only 
> used for searching, no updates occur.
> My questions: Why is there this check for modifications in reopen()? Why does 
> this happen only at certain times on all my servers with different platforms?
> I want to use reopen, because in future, when the new FieldCache will be 
> reopen-aware and does not everytime rebuild the full cache, it will be very 
> important, to have this fixed. At the moment, I have no problem with the 
> case, that reopen may fail and I have to do a rough reopen.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1380) Patch for ShingleFilter.enablePositions (or PositionFilter)

2008-09-24 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12634191#action_12634191
 ] 

Steven Rowe commented on LUCENE-1380:
-

When I wrote:
bq. 4.  You should provide a standalone test for the PositionFilter, in 
addition to the ShingleFilterTest tests.

I meant that testing of PositionFilter should be separate from testing its 
functionality with ShingleFilter.  Your PositionFilter tests looks at offsets, 
which PositionFilter doesn't affect at all.  It is possible that PositionFilter 
will be used for other things than ShingleFilter.  Hence, there should be basic 
test(s) that evaluate PositionFilter without ShingleFilter.

I also think a test to make sure a single instance of PositionFilter will work 
with multiple documents should be added.

BTW, you don't need to delete JIRA attachments if you want to upload a new 
version - when you upload a same-named file, the most recent version of the 
file will be colored black, and older versions will be colored gray.  This is 
the conventional way Lucene uses JIRA.  It allows people to follow the JIRA 
comments in the progressive versions of the patch(es).

A typo on line 66 of PositionFilterTest: 
{code:java}
// end of stream so reset firstTokePositioned
{code}


> Patch for ShingleFilter.enablePositions (or PositionFilter)
> ---
>
> Key: LUCENE-1380
> URL: https://issues.apache.org/jira/browse/LUCENE-1380
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Reporter: Mck SembWever
>Priority: Trivial
> Attachments: LUCENE-1380-PositionFilter.patch, LUCENE-1380.patch, 
> LUCENE-1380.patch
>
>
> Make it possible for *all* words and shingles to be placed at the same 
> position, that is for _all_ shingles (and unigrams if included) to be treated 
> as synonyms of each other.
> Today the shingles generated are synonyms only to the first term in the 
> shingle.
> For example the query "abcd efgh ijkl" results in:
>("abcd" "abcd efgh" "abcd efgh ijkl") ("efgh" efgh ijkl") ("ijkl")
> where "abcd efgh" and "abcd efgh ijkl" are synonyms of "abcd", and "efgh 
> ijkl" is a synonym of "efgh".
> There exists no way today to alter which token a particular shingle is a 
> synonym for.
> This patch takes the first step in making it possible to make all shingles 
> (and unigrams if included) synonyms of each other.
> See http://comments.gmane.org/gmane.comp.jakarta.lucene.user/34746 for 
> mailing list thread.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1380) Patch for ShingleFilter.enablePositions (or PositionFilter)

2008-09-24 Thread Mck SembWever (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mck SembWever updated LUCENE-1380:
--

Attachment: LUCENE-1380-PositionFilter.patch

Re-attached the PositionFilter patch addressing Steve's moderation comments. (2)
Steve,  can you look at the reset versus null token in stream difference. Are 
both approaches valid to test? (I'd not overridden TokenStream.reset() in the 
previous patch).

> Patch for ShingleFilter.enablePositions (or PositionFilter)
> ---
>
> Key: LUCENE-1380
> URL: https://issues.apache.org/jira/browse/LUCENE-1380
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Reporter: Mck SembWever
>Priority: Trivial
> Attachments: LUCENE-1380-PositionFilter.patch, 
> LUCENE-1380-PositionFilter.patch, LUCENE-1380.patch, LUCENE-1380.patch
>
>
> Make it possible for *all* words and shingles to be placed at the same 
> position, that is for _all_ shingles (and unigrams if included) to be treated 
> as synonyms of each other.
> Today the shingles generated are synonyms only to the first term in the 
> shingle.
> For example the query "abcd efgh ijkl" results in:
>("abcd" "abcd efgh" "abcd efgh ijkl") ("efgh" efgh ijkl") ("ijkl")
> where "abcd efgh" and "abcd efgh ijkl" are synonyms of "abcd", and "efgh 
> ijkl" is a synonym of "efgh".
> There exists no way today to alter which token a particular shingle is a 
> synonym for.
> This patch takes the first step in making it possible to make all shingles 
> (and unigrams if included) synonyms of each other.
> See http://comments.gmane.org/gmane.comp.jakarta.lucene.user/34746 for 
> mailing list thread.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1380) Patch for ShingleFilter.enablePositions (or PositionFilter)

2008-09-24 Thread Steven Rowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rowe updated LUCENE-1380:


Attachment: LUCENE-1380-PositionFilter.patch

Mck, I was wrong about Filter testing over multiple docs - each instance of a 
Filter is defined only over a single doc, so this doesn't make sense.

However, you are completely on the right track with the reset() operation, 
since PositionFilter is sensitive to whether it's at the beginning of a stream, 
and it should respond as you have written it.

So, since I was wrong about PositionFilter needing to handle usage with 
multiple documents, the else clause that I said should go in (upon receiving 
null from the input stream) should come back out.  In fact, the proper response 
from a filter in the analysis chain upon encountering null is to stop 
processing, since it means end-of-stream, so I've removed your tests with null 
embedded in this revised patch.

bq. Steve, can you look at the reset versus null token in stream difference. 
Are both approaches valid to test? (I'd not overridden TokenStream.reset() in 
the previous patch).

I removed the void-return filterTest(), since it wasn't called from anywhere, 
and it only used ShingleFilter, and no PositionFilter.  In its place I've added 
another test named testReset().

I added a test that checks for non-default positionIncrement: 
testNonZeroPositionIncrement().

I removed PositionFilter.setPositionIncrement(), because using it one could 
potentially change the position increment in mid-stream, which makes little 
sense.  The alternate constructor provides a way to set it.

In the patch, I have modified the formatting a little to conform to Lucene 
convention, which is outlined on the [HowToContribute wiki 
page|http://wiki.apache.org/lucene-java/HowToContribute#head-59ae13df098fbdcc46abdf980aa8ee76d3ee2e3b]:

{quote}
* Code should be formatted according to [Sun's 
conventions|http://java.sun.com/docs/codeconv/] with one exception:
** indent two spaces per level, not four.
{quote}

I ran "svn diff" under the trunk/ directory, instead of in 
trunk/contrib/analyzers/ (where you based your patches) - it's simpler for 
people who look at a lot of these things to have them always be based from 
trunk/.

Take a look and make sure things are as they should be - the tests pass for me, 
and I think it's doing what it should do.

If you agree, then hopefully we can get Karl (or another committer, which I'm 
not) to take a look and see if they think it can be committed.


> Patch for ShingleFilter.enablePositions (or PositionFilter)
> ---
>
> Key: LUCENE-1380
> URL: https://issues.apache.org/jira/browse/LUCENE-1380
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Reporter: Mck SembWever
>Priority: Trivial
> Attachments: LUCENE-1380-PositionFilter.patch, 
> LUCENE-1380-PositionFilter.patch, LUCENE-1380-PositionFilter.patch, 
> LUCENE-1380.patch, LUCENE-1380.patch
>
>
> Make it possible for *all* words and shingles to be placed at the same 
> position, that is for _all_ shingles (and unigrams if included) to be treated 
> as synonyms of each other.
> Today the shingles generated are synonyms only to the first term in the 
> shingle.
> For example the query "abcd efgh ijkl" results in:
>("abcd" "abcd efgh" "abcd efgh ijkl") ("efgh" efgh ijkl") ("ijkl")
> where "abcd efgh" and "abcd efgh ijkl" are synonyms of "abcd", and "efgh 
> ijkl" is a synonym of "efgh".
> There exists no way today to alter which token a particular shingle is a 
> synonym for.
> This patch takes the first step in making it possible to make all shingles 
> (and unigrams if included) synonyms of each other.
> See http://comments.gmane.org/gmane.comp.jakarta.lucene.user/34746 for 
> mailing list thread.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-1405) Support for new Resources model in ant 1.7 in Lucene ant task.

2008-09-24 Thread Przemyslaw Sztoch (JIRA)
Support for new Resources model in ant 1.7 in Lucene ant task.
--

 Key: LUCENE-1405
 URL: https://issues.apache.org/jira/browse/LUCENE-1405
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/*
Affects Versions: 2.3.2
Reporter: Przemyslaw Sztoch
 Fix For: 2.3.3


Ant Task for Lucene should use modern Resource model (not only FileSet child 
element).
There is a patch with required changes.

Supported by old (ant 1.6) and new (ant 1.7) resources model:
 
  
 

Supported only by new (ant 1.7) resources model:
 
  
 

 
  
 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1405) Support for new Resources model in ant 1.7 in Lucene ant task.

2008-09-24 Thread Przemyslaw Sztoch (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Przemyslaw Sztoch updated LUCENE-1405:
--

Attachment: lucene-ant1.7-newresources.patch

Patch for current lucene SVN (rev 698454).

> Support for new Resources model in ant 1.7 in Lucene ant task.
> --
>
> Key: LUCENE-1405
> URL: https://issues.apache.org/jira/browse/LUCENE-1405
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/*
>Affects Versions: 2.3.2
>Reporter: Przemyslaw Sztoch
> Fix For: 2.3.3
>
> Attachments: lucene-ant1.7-newresources.patch
>
>
> Ant Task for Lucene should use modern Resource model (not only FileSet child 
> element).
> There is a patch with required changes.
> Supported by old (ant 1.6) and new (ant 1.7) resources model:
>  
>   
>  
> Supported only by new (ant 1.7) resources model:
>  
>   
>  
>  
>   
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-973) Token of "" returns in CJK

2008-09-24 Thread Toru Matsuzawa (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12634363#action_12634363
 ] 

Toru Matsuzawa commented on LUCENE-973:
---

Thank you for Sekiguchi-san and Steven comment. 
I am sorry for slow comment . 

{quote}
The following part of your patch appears to address a problem that you haven't 
covered in your comments - is this so? If it is a problem separate from the 
empty-string issue, can you describe the effects of this change?:
{quote}
In current CJKTokenizer, "C3" becomes "Single" of non-ascii as shown by the 
following examples. 
{noformat} 
// C1C2C3 is non-ascii
String str = "C1C2abcC3def" ;
Tokenizer tokenizer = new CJKTokenizer( new StringReader( str ) );
for( Token token = tokenizer.next(); token != null; token = tokenizer.next() )
System.out.println( "token=\"" + token.termText() + "\"" + " type=\""+ 
token.type() + "\"");
{noformat} 
current CJKTokenizer outputs:
{noformat} 
token="C1C2" type="double"
token="" type="single"
token="abc" type="single"
token="C3" type="single"
token="def" type="single"
{noformat} 
applying patch:
{noformat} 
token="C1C2" type="double"
token="C2" type="double"
token="abc" type="single"
token="C3" type="double"
token="def" type="single"
{noformat} 

{quote}
Wouldn't it be simpler/clearer to test length for zero instead of constructing 
a String and testing it for equality with the empty string?:
{quote}
I think that your correction is better. 

> Token of  "" returns in CJK
> ---
>
> Key: LUCENE-973
> URL: https://issues.apache.org/jira/browse/LUCENE-973
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 2.3
>Reporter: Toru Matsuzawa
> Attachments: CJKTokenizer20070807.patch, with-patch.jpg, 
> without-patch.jpg
>
>
> The "" string returns as Token in the boundary of two byte character and one 
> byte character. 
> There is no problem in CJKAnalyzer. 
> When CJKTokenizer is used with the unit, it becomes a problem. (Use it with 
> Solr etc.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]