date:20090127

Build failed in Hudson: Lucene-trunk #720

2009-01-27 Thread Apache Hudson Server

See http://hudson.zones.apache.org/hudson/job/Lucene-trunk/720/changes

Changes:

[mikemccand] LUCENE-1483: switch to newly added MultiReaderHitCollector for all 
core collectors, that is aware of segment transitions during searching, to 
improve performance of searching and warming

[uschindler] Implement a shortcut, when range has min>max. In this case a 
static empty SortedVIntList is returned.

[uschindler] LUCENE-1530: Support inclusive/exclusive for 
TrieRangeQuery/-Filter, remove default trie variant setters/getters

--
[...truncated 10489 lines...]

init:

compile-test:
 [echo] Building swing...

javacc-uptodate-check:

javacc-notice:

jflex-uptodate-check:

jflex-notice:

common.init:

build-lucene:

build-lucene-tests:

init:

clover.setup:

clover.info:

clover:

compile-core:

common.compile-test:

common.test:
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/swing/test
 
[junit] Testsuite: org.apache.lucene.swing.models.TestBasicList
[junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 0.614 sec
[junit] 
[junit] Testsuite: org.apache.lucene.swing.models.TestBasicTable
[junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 0.6 sec
[junit] 
[junit] Testsuite: org.apache.lucene.swing.models.TestSearchingList
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.595 sec
[junit] 
[junit] Testsuite: org.apache.lucene.swing.models.TestSearchingTable
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.636 sec
[junit] 
[junit] Testsuite: org.apache.lucene.swing.models.TestUpdatingList
[junit] Tests run: 4, Failures: 0, Errors: 0, Time elapsed: 0.73 sec
[junit] 
[junit] Testsuite: org.apache.lucene.swing.models.TestUpdatingTable
[junit] Tests run: 4, Failures: 0, Errors: 0, Time elapsed: 1.09 sec
[junit] 
   [delete] Deleting: 
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/swing/test/junitfailed.flag
 
 [echo] Building wikipedia...

javacc-uptodate-check:

javacc-notice:

jflex-uptodate-check:

jflex-notice:

common.init:

build-lucene:

build-lucene-tests:

init:

test:
 [echo] Building wikipedia...

javacc-uptodate-check:

javacc-notice:

jflex-uptodate-check:

jflex-notice:

common.init:

build-lucene:

build-lucene-tests:

init:

compile-test:
 [echo] Building wikipedia...

javacc-uptodate-check:

javacc-notice:

jflex-uptodate-check:

jflex-notice:

common.init:

build-lucene:

build-lucene-tests:

init:

clover.setup:

clover.info:

clover:

compile-core:

common.compile-test:

common.test:
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/wikipedia/test
 
[junit] Testsuite: 
org.apache.lucene.wikipedia.analysis.WikipediaTokenizerTest
[junit] Tests run: 5, Failures: 0, Errors: 0, Time elapsed: 0.407 sec
[junit] 
   [delete] Deleting: 
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/wikipedia/test/junitfailed.flag
 
 [echo] Building wordnet...

javacc-uptodate-check:

javacc-notice:

jflex-uptodate-check:

jflex-notice:

common.init:

build-lucene:

build-lucene-tests:

init:

test:
 [echo] Building xml-query-parser...

javacc-uptodate-check:

javacc-notice:

jflex-uptodate-check:

jflex-notice:

common.init:

build-lucene:

build-lucene-tests:

init:

test:
 [echo] Building xml-query-parser...

javacc-uptodate-check:

javacc-notice:

jflex-uptodate-check:

jflex-notice:

common.init:

build-lucene:

build-lucene-tests:

init:

compile-test:
 [echo] Building xml-query-parser...

build-queries:

javacc-uptodate-check:

javacc-notice:

jflex-uptodate-check:

jflex-notice:

common.init:

build-lucene:

build-lucene-tests:

init:

clover.setup:

clover.info:

clover:

common.compile-core:

compile-core:

common.compile-test:

common.test:
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/xml-query-parser/test
 
[junit] Testsuite: org.apache.lucene.xmlparser.TestParser
[junit] Tests run: 17, Failures: 0, Errors: 0, Time elapsed: 3.062 sec
[junit] 
[junit] Testsuite: org.apache.lucene.xmlparser.TestQueryTemplateManager
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 2.782 sec
[junit] 
   [delete] Deleting: 
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/xml-query-parser/test/junitfailed.flag
 

test:

download-tag:
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/tags/tags/lucene_2_4_back_compat_tests_20090127
 
 [exec] Error validating server certificate for 
'https://svn.apache.org:443':
 [exec]  - The certificate is not issued by a trusted authority. Use the
 [exec]fingerprint to validate the certificate manually!
 [exec] Certificate information:
 [exec]  - Hostname: s

[jira] Commented: (LUCENE-1484) Remove SegmentReader.document synchronization

2009-01-27 Thread Jason Bennett (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12667917#action_12667917
 ] 

Jason Bennett commented on LUCENE-1484:
---

Is there any chance this patch could be released in 2.4.1, instead of waiting 
for 2.9?

> Remove SegmentReader.document synchronization
> -
>
> Key: LUCENE-1484
> URL: https://issues.apache.org/jira/browse/LUCENE-1484
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.4
>Reporter: Jason Rutherglen
>Assignee: Michael McCandless
> Fix For: 2.9
>
> Attachments: LUCENE-1484.patch, LUCENE-1484.patch
>
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> This is probably the last synchronization issue in Lucene.  It is the 
> document method in SegmentReader.  It is avoidable by using a threadlocal for 
> FieldsReader.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

what is the correct javacc version to generate the queryparser?

2009-01-27 Thread Luis Alves


Hi,

Javacc 4.2 is out, but the code generated by this version is different 
than code generated by javacc 4.1,

which I think is the version used to generate the lucene queryparser files.

What is the official javacc version used when generating the queryparser 
classes ?


Is it a good idea to submit a patch with code generated by javacc 4.2 ?




-Lafa


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

RE: RE: Hudson Java Docs?

2009-01-27 Thread Uwe Schindler

Hi,

When looking through the All-Javadocs index generated on hudson, I have seen
some small bugs due to the new spatial contrib and the trie package. I
modified build.xml and site.xml (attached patch).

Can somebody with commit rights apply this?

Maybe, the spatial contrib should go into CHANGES.txt of contrib, too.

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


> -Original Message-
> From: Uwe Schindler [mailto:u...@thetaphi.de]
> Sent: Tuesday, January 27, 2009 11:56 AM
> To: java-dev@lucene.apache.org
> Subject: RE: RE: Hudson Java Docs?
> 
> > Alternately, we could turn off the "Publish Javadoc" feature, and
> instead
> > add trunk/build/docs/api to the list of files to "Archive" and then
> start
> > refering to a URL like this (doesn't work at the moment) for all the
> > javadocs...
> >
> > http://hudson.zones.apache.org/hudson/view/Lucene/job/Lucene-
> > trunk/lastSuccessfulBuild/artifact/trunk/build/docs/api/
> >
> > turning that Javadoc feature off should eliminate the existing Javadoc
> > links in the hudson navigation, but I suspect the old files would still
> be
> > there (and in search engine caches)
> 
> Can we do a one-time cleanup (rm -rf in the directory) and then have a new
> and clean start (maybe ask the hudson team at Apache)? The index.html file
> for the javadocs in the root javadoc folder is possible, but it would not
> remove the old files (the others, not index.html) from Googles
> cache/index.
> 
> Uwe
> 
> 
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org



fix-javadocs-build.patch
Description: Binary data
-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1528) Add support for Ideographic Space to the queryparser - also know as fullwith space and wide-space

2009-01-27 Thread Luis Alves (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12667837#action_12667837
 ] 

Luis Alves commented on LUCENE-1528:


Hi Michael,

I checked the book "Generating parser with JavaCC" and I checked the javacc 
website (https://javacc.dev.java.net/doc/javaccgrm.html)
for grammar, here is the syntax for a character list:

character_list  ::= [ "~" ] "[" [ character_descriptor ( "," 
character_descriptor )* ] "]"
character_descriptor::= java_string_literal [ "-" java_string_literal ]

also the '|' character in javacc syntax is used like an XOR, and there is no OR 
or AND operator to be used in the javacc syntax that I'm aware.
So the expression <_WHITESPACE> | [ "+", ... ]  would have to look like 
~(<_WHITESPACE> & [ "+", ... ]) but this is not possible in javacc grammar.

So I think the best option for now, is to keep the current syntax.

If you like, I can change 

<#_WHITESPACE: ( " " | "\t" | "\n" | "\r") >

to a character_list to make it more consistent, but that would not help to 
remove the duplicated list of characters.

<#_WHITESPACE: [ " ", "\t", "\n", "\r" ] >



> Add support for Ideographic Space to the queryparser - also know as fullwith 
> space and wide-space
> -
>
> Key: LUCENE-1528
> URL: https://issues.apache.org/jira/browse/LUCENE-1528
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: QueryParser
>Affects Versions: 2.4.1
>Reporter: Luis Alves
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 2.4.1
>
> Attachments: lucene_wide_space_v1_src.patch
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> The Ideographic Space is a space character that is as wide as a normal CJK 
> character cell.
> It is also known as wide-space or fullwith space.This type of space is used 
> in CJK languages.
> This patch adds support for the wide space, making the queryparser component 
> more friendly
> to queries that contain CJK text.
> Reference:
> 'http://en.wikipedia.org/wiki/Space_(punctuation)' - see Table of spaces, 
> char U+3000.
> I also added a new testcase that fails before the patch.
> After the patch is applied all junits pass.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1507) adding EmptyDocIdSet/Iterator

2009-01-27 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12667793#action_12667793
 ] 

Michael McCandless commented on LUCENE-1507:


New patch looks good, thanks Uwe.

> adding EmptyDocIdSet/Iterator
> -
>
> Key: LUCENE-1507
> URL: https://issues.apache.org/jira/browse/LUCENE-1507
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.4
>Reporter: John Wang
>Assignee: Michael McCandless
> Attachments: emptydocidset.txt, LUCENE-1507.patch, LUCENE-1507.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Adding convenience classes for EmptyDocIdSet and EmptyDocIdSetIterator

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Resolved: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector

2009-01-27 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-1483.


   Resolution: Fixed
Fix Version/s: 2.9

Committed revision 738219.  Thanks to everyone who helped out
here...and many thanks to Mark for working through so many iterations
as we explored the different approaches here!


> Change IndexSearcher multisegment searches to search each individual segment 
> using a single HitCollector
> 
>
> Key: LUCENE-1483
> URL: https://issues.apache.org/jira/browse/LUCENE-1483
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 2.9
>Reporter: Mark Miller
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1483-backcompat.patch, LUCENE-1483-partial.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, sortBench.py, sortCollate.py
>
>
> This issue changes how an IndexSearcher searches over multiple segments. The 
> current method of searching multiple segments is to use a MultiSegmentReader 
> and treat all of the segments as one. This causes filters and FieldCaches to 
> be keyed to the MultiReader and makes reopen expensive. If only a few 
> segments change, the FieldCache is still loaded for all of them.
> This patch changes things by searching each individual segment one at a time, 
> but sharing the HitCollector used across each segment. This allows 
> FieldCaches and Filters to be keyed on individual SegmentReaders, making 
> reopen much cheaper. FieldCache loading over multiple segments can be much 
> faster as well - with the old method, all unique terms for every segment is 
> enumerated against each segment - because of the likely logarithmic change in 
> terms per segment, this can be very wasteful. Searching individual segments 
> avoids this cost. The term/document statistics from the multireader are used 
> to score results for each segment.
> When sorting, its more difficult to use a single HitCollector for each sub 
> searcher. Ordinals are not comparable across segments. To account for this, a 
> new field sort enabled HitCollector is introduced that is able to collect and 
> sort across segments (because of its ability to compare ordinals across 
> segments). This TopFieldCollector class will collect the values/ordinals for 
> a given segment, and upon moving to the next segment, translate any 
> ordinals/values so that they can be compared against the values for the new 
> segment. This is done lazily.
> All and all, the switch seems to provide numerous performance benefits, in 
> both sorted and non sorted search. We were seeing a good loss on indices with 
> lots of segments (1000?) and certain queue sizes / queries, but the latest 
> results seem to show thats been mostly taken care of (you shouldnt be using 
> such a large queue on such a segmented index anyway).
> * Introduces
> ** MultiReaderHitCollector - a HitCollector that can collect across multiple 
> IndexReaders. Old HitCollectors are wrapped to support multiple IndexReaders.
> ** TopFieldCollector - a HitCollector that can compare values/ordinals across 
> IndexReaders and sort on fields.
> ** FieldValueHitQueue - a Priority queue that is part of the 
> TopFieldCollector implementation.
> ** FieldComparator - a new Comparator class that works across IndexReaders. 
> Part of the TopFieldCollector implementation.
> ** FieldComparatorSource - new class to allow for custom Comparators.
> * Alters
> ** IndexSearcher uses a single HitCollector to collect hits against each 
> individual SegmentReader. All the other changes stem from this ;)
> * Deprecates
> ** TopFieldDocCollector
> ** FieldSortedHitQueue

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1507) adding EmptyDocIdSet/Iterator

2009-01-27 Thread Uwe Schindler (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-1507:
--

Attachment: LUCENE-1507.patch

Hi Mike,
I just updated the patch a little bit to supply javadocs for iterator() method, 
too.

It also contains the first example usage in TrieRangeFilter (where a private 
instance was used until now). This can be committed together with this.

Maybe the conventional RangeFilter/RangeQuery can be optimized in that way, too.

> adding EmptyDocIdSet/Iterator
> -
>
> Key: LUCENE-1507
> URL: https://issues.apache.org/jira/browse/LUCENE-1507
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.4
>Reporter: John Wang
>Assignee: Michael McCandless
> Attachments: emptydocidset.txt, LUCENE-1507.patch, LUCENE-1507.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Adding convenience classes for EmptyDocIdSet and EmptyDocIdSetIterator

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector

2009-01-27 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12667775#action_12667775
 ] 

Michael McCandless commented on LUCENE-1483:


bq. The only problem with parallelization is that the MultiReaderHitCollector 
must be synchronized in some way.

I think we'd have to collect to separate collectors and then merge
(like ParallelMultiSearcher does today)?

I think this (separate thread for the "big" segments, and one thread
for the "long tail") would be a good approach, except I don't like
that the performance would depend so much on the structure of the
index.  EG after you've optimized your index you'd suddenly get no
concurrency, and presumably worse performance than when you had a few
big segments.

Could we instead divide the index into chunks and have each thread
skipTo the start of its chunk?  EG if the index has N docs, and you
want to use M threads, each thread visits N/M docs.  If that can work
it should be less dependent on the index structure.


> Change IndexSearcher multisegment searches to search each individual segment 
> using a single HitCollector
> 
>
> Key: LUCENE-1483
> URL: https://issues.apache.org/jira/browse/LUCENE-1483
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 2.9
>Reporter: Mark Miller
>Priority: Minor
> Attachments: LUCENE-1483-backcompat.patch, LUCENE-1483-partial.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, sortBench.py, sortCollate.py
>
>
> This issue changes how an IndexSearcher searches over multiple segments. The 
> current method of searching multiple segments is to use a MultiSegmentReader 
> and treat all of the segments as one. This causes filters and FieldCaches to 
> be keyed to the MultiReader and makes reopen expensive. If only a few 
> segments change, the FieldCache is still loaded for all of them.
> This patch changes things by searching each individual segment one at a time, 
> but sharing the HitCollector used across each segment. This allows 
> FieldCaches and Filters to be keyed on individual SegmentReaders, making 
> reopen much cheaper. FieldCache loading over multiple segments can be much 
> faster as well - with the old method, all unique terms for every segment is 
> enumerated against each segment - because of the likely logarithmic change in 
> terms per segment, this can be very wasteful. Searching individual segments 
> avoids this cost. The term/document statistics from the multireader are used 
> to score results for each segment.
> When sorting, its more difficult to use a single HitCollector for each sub 
> searcher. Ordinals are not comparable across segments. To account for this, a 
> new field sort enabled HitCollector is introduced that is able to collect and 
> sort across segments (because of its ability to compare ordinals across 
> segments). This TopFieldCollector class will collect the values/ordinals for 
> a given segment, and upon moving to the next segment, translate any 
> ordinals/values so that they can be compared against the values for the new 
> segment. This is done lazily.
> All and all, the switch seems to provide numerous performance benefits, in 
> both sorted and non sorted search. We were seeing a good loss on indices with 
> lots of segments (1000?) and certain queue sizes / queries, but the latest 
> results seem to show thats been mostly taken care of (you shouldnt be using 
> such a large queue on such a segmented index anyway).
> * Introduces
> ** MultiReaderHitCollector - a HitCollector that can collect across multiple 
> IndexReaders. Old HitCollectors are wrapped to support multiple IndexReaders.
> ** TopFieldCollector - a HitCollector that can compare values/ordinals across 
> IndexReaders and sort on fields.
> ** FieldValueHitQueue - a Priority queue that is part of the 
> TopFieldCollector implementation.
> ** FieldComparator - a new Comparator class that works across IndexReaders. 
> Part of the TopFieldCollector implementation.
> ** FieldComparatorSource - new class to allow for custom Comparators.
> * Alters
> ** IndexSearcher uses a sing

[jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector

2009-01-27 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12667774#action_12667774
 ] 

Michael McCandless commented on LUCENE-1483:


bq. since last Friday, we had no problem with the new sort implementation.

OK, excellent.  I will commit shortly!

> Change IndexSearcher multisegment searches to search each individual segment 
> using a single HitCollector
> 
>
> Key: LUCENE-1483
> URL: https://issues.apache.org/jira/browse/LUCENE-1483
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 2.9
>Reporter: Mark Miller
>Priority: Minor
> Attachments: LUCENE-1483-backcompat.patch, LUCENE-1483-partial.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, sortBench.py, sortCollate.py
>
>
> This issue changes how an IndexSearcher searches over multiple segments. The 
> current method of searching multiple segments is to use a MultiSegmentReader 
> and treat all of the segments as one. This causes filters and FieldCaches to 
> be keyed to the MultiReader and makes reopen expensive. If only a few 
> segments change, the FieldCache is still loaded for all of them.
> This patch changes things by searching each individual segment one at a time, 
> but sharing the HitCollector used across each segment. This allows 
> FieldCaches and Filters to be keyed on individual SegmentReaders, making 
> reopen much cheaper. FieldCache loading over multiple segments can be much 
> faster as well - with the old method, all unique terms for every segment is 
> enumerated against each segment - because of the likely logarithmic change in 
> terms per segment, this can be very wasteful. Searching individual segments 
> avoids this cost. The term/document statistics from the multireader are used 
> to score results for each segment.
> When sorting, its more difficult to use a single HitCollector for each sub 
> searcher. Ordinals are not comparable across segments. To account for this, a 
> new field sort enabled HitCollector is introduced that is able to collect and 
> sort across segments (because of its ability to compare ordinals across 
> segments). This TopFieldCollector class will collect the values/ordinals for 
> a given segment, and upon moving to the next segment, translate any 
> ordinals/values so that they can be compared against the values for the new 
> segment. This is done lazily.
> All and all, the switch seems to provide numerous performance benefits, in 
> both sorted and non sorted search. We were seeing a good loss on indices with 
> lots of segments (1000?) and certain queue sizes / queries, but the latest 
> results seem to show thats been mostly taken care of (you shouldnt be using 
> such a large queue on such a segmented index anyway).
> * Introduces
> ** MultiReaderHitCollector - a HitCollector that can collect across multiple 
> IndexReaders. Old HitCollectors are wrapped to support multiple IndexReaders.
> ** TopFieldCollector - a HitCollector that can compare values/ordinals across 
> IndexReaders and sort on fields.
> ** FieldValueHitQueue - a Priority queue that is part of the 
> TopFieldCollector implementation.
> ** FieldComparator - a new Comparator class that works across IndexReaders. 
> Part of the TopFieldCollector implementation.
> ** FieldComparatorSource - new class to allow for custom Comparators.
> * Alters
> ** IndexSearcher uses a single HitCollector to collect hits against each 
> individual SegmentReader. All the other changes stem from this ;)
> * Deprecates
> ** TopFieldDocCollector
> ** FieldSortedHitQueue

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1476) BitVector implement DocIdSet, IndexReader returns DocIdSet deleted docs

2009-01-27 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12667770#action_12667770
 ] 

Michael McCandless commented on LUCENE-1476:


bq. Perhaps this should be an option in the benchmark output?

That's a great idea!

Something silly must be going on... 99% performance drop can't be right.

> BitVector implement DocIdSet, IndexReader returns DocIdSet deleted docs
> ---
>
> Key: LUCENE-1476
> URL: https://issues.apache.org/jira/browse/LUCENE-1476
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.4
>Reporter: Jason Rutherglen
>Priority: Trivial
> Attachments: LUCENE-1476.patch, LUCENE-1476.patch, LUCENE-1476.patch, 
> quasi_iterator_deletions.diff, quasi_iterator_deletions_r2.diff, 
> searchdeletes.alg
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> Update BitVector to implement DocIdSet.  Expose deleted docs DocIdSet from 
> IndexReader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1507) adding EmptyDocIdSet/Iterator

2009-01-27 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12667762#action_12667762
 ] 

Michael McCandless commented on LUCENE-1507:


That looks great to me!  I'll commit in a day or two.

> adding EmptyDocIdSet/Iterator
> -
>
> Key: LUCENE-1507
> URL: https://issues.apache.org/jira/browse/LUCENE-1507
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.4
>Reporter: John Wang
> Attachments: emptydocidset.txt, LUCENE-1507.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Adding convenience classes for EmptyDocIdSet and EmptyDocIdSetIterator

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Assigned: (LUCENE-1507) adding EmptyDocIdSet/Iterator

2009-01-27 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reassigned LUCENE-1507:
--

Assignee: Michael McCandless

> adding EmptyDocIdSet/Iterator
> -
>
> Key: LUCENE-1507
> URL: https://issues.apache.org/jira/browse/LUCENE-1507
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.4
>Reporter: John Wang
>Assignee: Michael McCandless
> Attachments: emptydocidset.txt, LUCENE-1507.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Adding convenience classes for EmptyDocIdSet and EmptyDocIdSetIterator

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector

2009-01-27 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12667755#action_12667755
 ] 

Uwe Schindler commented on LUCENE-1483:
---

Jason: We should open a new issue for that after this one is solved. Maybe we 
can create a good parallelized implementation after solving the problems with 
MultiReaderHitCollector (if more than one thread call setNextReader with 
collect calls inbetween, it would not work).

> Change IndexSearcher multisegment searches to search each individual segment 
> using a single HitCollector
> 
>
> Key: LUCENE-1483
> URL: https://issues.apache.org/jira/browse/LUCENE-1483
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 2.9
>Reporter: Mark Miller
>Priority: Minor
> Attachments: LUCENE-1483-backcompat.patch, LUCENE-1483-partial.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, sortBench.py, sortCollate.py
>
>
> This issue changes how an IndexSearcher searches over multiple segments. The 
> current method of searching multiple segments is to use a MultiSegmentReader 
> and treat all of the segments as one. This causes filters and FieldCaches to 
> be keyed to the MultiReader and makes reopen expensive. If only a few 
> segments change, the FieldCache is still loaded for all of them.
> This patch changes things by searching each individual segment one at a time, 
> but sharing the HitCollector used across each segment. This allows 
> FieldCaches and Filters to be keyed on individual SegmentReaders, making 
> reopen much cheaper. FieldCache loading over multiple segments can be much 
> faster as well - with the old method, all unique terms for every segment is 
> enumerated against each segment - because of the likely logarithmic change in 
> terms per segment, this can be very wasteful. Searching individual segments 
> avoids this cost. The term/document statistics from the multireader are used 
> to score results for each segment.
> When sorting, its more difficult to use a single HitCollector for each sub 
> searcher. Ordinals are not comparable across segments. To account for this, a 
> new field sort enabled HitCollector is introduced that is able to collect and 
> sort across segments (because of its ability to compare ordinals across 
> segments). This TopFieldCollector class will collect the values/ordinals for 
> a given segment, and upon moving to the next segment, translate any 
> ordinals/values so that they can be compared against the values for the new 
> segment. This is done lazily.
> All and all, the switch seems to provide numerous performance benefits, in 
> both sorted and non sorted search. We were seeing a good loss on indices with 
> lots of segments (1000?) and certain queue sizes / queries, but the latest 
> results seem to show thats been mostly taken care of (you shouldnt be using 
> such a large queue on such a segmented index anyway).
> * Introduces
> ** MultiReaderHitCollector - a HitCollector that can collect across multiple 
> IndexReaders. Old HitCollectors are wrapped to support multiple IndexReaders.
> ** TopFieldCollector - a HitCollector that can compare values/ordinals across 
> IndexReaders and sort on fields.
> ** FieldValueHitQueue - a Priority queue that is part of the 
> TopFieldCollector implementation.
> ** FieldComparator - a new Comparator class that works across IndexReaders. 
> Part of the TopFieldCollector implementation.
> ** FieldComparatorSource - new class to allow for custom Comparators.
> * Alters
> ** IndexSearcher uses a single HitCollector to collect hits against each 
> individual SegmentReader. All the other changes stem from this ;)
> * Deprecates
> ** TopFieldDocCollector
> ** FieldSortedHitQueue

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector

2009-01-27 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12667752#action_12667752
 ] 

Uwe Schindler commented on LUCENE-1483:
---

Hi Mike,
since last Friday, we had no problem with the new sort implementation. No 
exceptions from Lucene or any problems with Lucene. The sorting of results was 
(as far as I have seen) always correct (tested was SortField.INT, 
SortField.STRING). The index was updated each half hour and reopened, really 
great performance. There were also no errors after an optimize() and reopen 
again on Sunday (only that it took longer than to warmup the sorting).

> Change IndexSearcher multisegment searches to search each individual segment 
> using a single HitCollector
> 
>
> Key: LUCENE-1483
> URL: https://issues.apache.org/jira/browse/LUCENE-1483
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 2.9
>Reporter: Mark Miller
>Priority: Minor
> Attachments: LUCENE-1483-backcompat.patch, LUCENE-1483-partial.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, sortBench.py, sortCollate.py
>
>
> This issue changes how an IndexSearcher searches over multiple segments. The 
> current method of searching multiple segments is to use a MultiSegmentReader 
> and treat all of the segments as one. This causes filters and FieldCaches to 
> be keyed to the MultiReader and makes reopen expensive. If only a few 
> segments change, the FieldCache is still loaded for all of them.
> This patch changes things by searching each individual segment one at a time, 
> but sharing the HitCollector used across each segment. This allows 
> FieldCaches and Filters to be keyed on individual SegmentReaders, making 
> reopen much cheaper. FieldCache loading over multiple segments can be much 
> faster as well - with the old method, all unique terms for every segment is 
> enumerated against each segment - because of the likely logarithmic change in 
> terms per segment, this can be very wasteful. Searching individual segments 
> avoids this cost. The term/document statistics from the multireader are used 
> to score results for each segment.
> When sorting, its more difficult to use a single HitCollector for each sub 
> searcher. Ordinals are not comparable across segments. To account for this, a 
> new field sort enabled HitCollector is introduced that is able to collect and 
> sort across segments (because of its ability to compare ordinals across 
> segments). This TopFieldCollector class will collect the values/ordinals for 
> a given segment, and upon moving to the next segment, translate any 
> ordinals/values so that they can be compared against the values for the new 
> segment. This is done lazily.
> All and all, the switch seems to provide numerous performance benefits, in 
> both sorted and non sorted search. We were seeing a good loss on indices with 
> lots of segments (1000?) and certain queue sizes / queries, but the latest 
> results seem to show thats been mostly taken care of (you shouldnt be using 
> such a large queue on such a segmented index anyway).
> * Introduces
> ** MultiReaderHitCollector - a HitCollector that can collect across multiple 
> IndexReaders. Old HitCollectors are wrapped to support multiple IndexReaders.
> ** TopFieldCollector - a HitCollector that can compare values/ordinals across 
> IndexReaders and sort on fields.
> ** FieldValueHitQueue - a Priority queue that is part of the 
> TopFieldCollector implementation.
> ** FieldComparator - a new Comparator class that works across IndexReaders. 
> Part of the TopFieldCollector implementation.
> ** FieldComparatorSource - new class to allow for custom Comparators.
> * Alters
> ** IndexSearcher uses a single HitCollector to collect hits against each 
> individual SegmentReader. All the other changes stem from this ;)
> * Deprecates
> ** TopFieldDocCollector
> ** FieldSortedHitQueue

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe

[jira] Updated: (LUCENE-1507) adding EmptyDocIdSet/Iterator

2009-01-27 Thread Uwe Schindler (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-1507:
--

Attachment: LUCENE-1507.patch

How about that patch? It just a static final for usage like this in filters:

{code}
if (shortcut condition) return DocIdSet.EMPTY_DOCIDSET
{code}

> adding EmptyDocIdSet/Iterator
> -
>
> Key: LUCENE-1507
> URL: https://issues.apache.org/jira/browse/LUCENE-1507
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.4
>Reporter: John Wang
> Attachments: emptydocidset.txt, LUCENE-1507.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Adding convenience classes for EmptyDocIdSet and EmptyDocIdSetIterator

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1476) BitVector implement DocIdSet, IndexReader returns DocIdSet deleted docs

2009-01-27 Thread Marvin Humphrey (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12667748#action_12667748
 ] 

Marvin Humphrey commented on LUCENE-1476:
-

> The percentage performance decrease in the previous
> results is 99%. 

That's pretty strange. I look forward to seeing profiling data.

> BitVector implement DocIdSet, IndexReader returns DocIdSet deleted docs
> ---
>
> Key: LUCENE-1476
> URL: https://issues.apache.org/jira/browse/LUCENE-1476
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.4
>Reporter: Jason Rutherglen
>Priority: Trivial
> Attachments: LUCENE-1476.patch, LUCENE-1476.patch, LUCENE-1476.patch, 
> quasi_iterator_deletions.diff, quasi_iterator_deletions_r2.diff, 
> searchdeletes.alg
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> Update BitVector to implement DocIdSet.  Expose deleted docs DocIdSet from 
> IndexReader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1476) BitVector implement DocIdSet, IndexReader returns DocIdSet deleted docs

2009-01-27 Thread Jason Rutherglen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12667713#action_12667713
 ] 

Jason Rutherglen commented on LUCENE-1476:
--

The percentage performance decrease in the previous 
results is 99%.  

{quote} 
Jason can you format those results using a Jira table? 
{quote}

Perhaps this should be an option in the benchmark output?

{quote} 
M.M. LUCENE-1516 comment: "I think the larger number of
[harder-for-cpu-to-predict] if statements may be the cause of the
slowdown once %tg deletes gets high enough?" 
{quote}

I have been looking at the performance with YourKit and don't have
any conclusions yet. The main difference between using skipto and 
BV.get is the if statements and some added method calls, which even 
if they are inlined I suspect will not make up the difference.

Next steps: 
1. Deletes as a NOT boolean query which probably should
be it's own patch 
2.  Pluggable alternative representations such as
OpenBitSet and int array, part of this patch? 



> BitVector implement DocIdSet, IndexReader returns DocIdSet deleted docs
> ---
>
> Key: LUCENE-1476
> URL: https://issues.apache.org/jira/browse/LUCENE-1476
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.4
>Reporter: Jason Rutherglen
>Priority: Trivial
> Attachments: LUCENE-1476.patch, LUCENE-1476.patch, LUCENE-1476.patch, 
> quasi_iterator_deletions.diff, quasi_iterator_deletions_r2.diff, 
> searchdeletes.alg
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> Update BitVector to implement DocIdSet.  Expose deleted docs DocIdSet from 
> IndexReader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Resolved: (LUCENE-1530) Support inclusive/exclusive for TrieRangeQuery/-Filter, remove default trie variant setters/getters

2009-01-27 Thread Uwe Schindler (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler resolved LUCENE-1530.
---

   Resolution: Fixed
Lucene Fields: [New, Patch Available]  (was: [Patch Available, New])

Comitted revision 738109

> Support inclusive/exclusive for TrieRangeQuery/-Filter, remove default trie 
> variant setters/getters
> ---
>
> Key: LUCENE-1530
> URL: https://issues.apache.org/jira/browse/LUCENE-1530
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Affects Versions: 2.9
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
> Fix For: 2.9
>
> Attachments: LUCENE-1530.patch, LUCENE-1530.patch
>
>
> TrieRangeQuery/Filter is missing one thing: Ranges that have exclusive 
> bounds. For TrieRangeQuery this may not be important for ranges on long or 
> Date (==long) values (because [1..5] is the same like ]0..6[ or ]0..5]). This 
> is not so simple for doubles because you must add/substract 1 from the trie 
> encoded unsigned long.
> To be conform with the other range queries, I will submit a patch that has 
> two additional boolean parameters in the ctors to support inclusive/exclusive 
> ranges for both ends. Internally it will be implemented using 
> TrieUtils.incrementTrieCoded/decrementTrieCoded() but makes life simplier for 
> double ranges (a simple exclusive replacement for the floating point range 
> [0.0..1.0] is not possible without having the underlying unsigned long).
> In December, when trie contrib was included (LUCENE-1470), 3 trie variants 
> were supplied by TrieUtils. For new APIs a statically configureable default 
> Trie variant does not conform to an API we want in Lucene (currently we want 
> to deprecate all these static setters/getters). The important thing: It does 
> not make code shorter or easier to understand, its more error prone. Before 
> release of 2.9 it is a good time to remove the default trie variant and 
> always force the parameter in TrieRangeQuery/Filter. It is better to choose 
> the variant in the application and do not automatically manage it.
> As Lucene 2.9 was not yet released, I will change the ctors and not preserve 
> the old ones.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1489) highlighter problem with n-gram tokens

2009-01-27 Thread Mark Harwood (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12667654#action_12667654
 ] 

Mark Harwood commented on LUCENE-1489:
--

It looks to me like this could be fixed in the "Formatter" classes when marking 
up the output string.

Currently classes such as SimpleHTMLFormatter in their "highlightTerm" method 
put a tag around the whole section of text, if it contains a hit, i.e.

{code:title=SimpleHTMLFormatter.java|borderStyle=solid}
public String highlightTerm(String originalText, TokenGroup tokenGroup)
{
StringBuffer returnBuffer;
if(tokenGroup.getTotalScore()>0)
{
returnBuffer=new StringBuffer();
returnBuffer.append(preTag);
returnBuffer.append(originalText);
returnBuffer.append(postTag);
return returnBuffer.toString();
}
return originalText;
}
{code}

The TokenGroup object passed to this method contains all of the tokens and 
their scores so it should be possible to use this information to deconstruct 
the originalText parameter and inject markup according to which tokens in the 
group had a match rather than putting a tag around the whole block.  Some 
complexity may lie in handling token streams that produce tokens that "rewind" 
to earlier offsets.
SimpleHtmlFormatter suddenly seems less simple!

TokenStreams that produce entirely overlapping streams of tokens will 
automatically be broken into multiple TokenGroups because TokenGroup has a 
maximum number of linked Tokens it will ever hold in a single group.

I haven't got the time to fix this right now but if someone has a burning need 
to leap in, the above seems like what may be required.

Cheers
Mark






> highlighter problem with n-gram tokens
> --
>
> Key: LUCENE-1489
> URL: https://issues.apache.org/jira/browse/LUCENE-1489
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/highlighter
>Reporter: Koji Sekiguchi
>Priority: Minor
>
> I have a problem when using n-gram and highlighter. I thought it had been 
> solved in LUCENE-627...
> Actually, I found this problem when I was using CJKTokenizer on Solr, though, 
> here is lucene program to reproduce it using NGramTokenizer(min=2,max=2) 
> instead of CJKTokenizer:
> {code:java}
> public class TestNGramHighlighter {
>   public static void main(String[] args) throws Exception {
> Analyzer analyzer = new NGramAnalyzer();
> final String TEXT = "Lucene can make index. Then Lucene can search.";
> final String QUERY = "can";
> QueryParser parser = new QueryParser("f",analyzer);
> Query query = parser.parse(QUERY);
> QueryScorer scorer = new QueryScorer(query,"f");
> Highlighter h = new Highlighter( scorer );
> System.out.println( h.getBestFragment(analyzer, "f", TEXT) );
>   }
>   static class NGramAnalyzer extends Analyzer {
> public TokenStream tokenStream(String field, Reader input) {
>   return new NGramTokenizer(input,2,2);
> }
>   }
> }
> {code}
> expected output is:
> Lucene can make index. Then Lucene can search.
> but the actual output is:
> Lucene can make index. Then Lucene can search.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1507) adding EmptyDocIdSet/Iterator

2009-01-27 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12667645#action_12667645
 ] 

Michael McCandless commented on LUCENE-1507:


We could simply add a static method somewhere (getEmptyDocIdSet()) to retrieve 
a single re-used instance of 0-sized SortedVIntList?

> adding EmptyDocIdSet/Iterator
> -
>
> Key: LUCENE-1507
> URL: https://issues.apache.org/jira/browse/LUCENE-1507
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.4
>Reporter: John Wang
> Attachments: emptydocidset.txt
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Adding convenience classes for EmptyDocIdSet and EmptyDocIdSetIterator

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1530) Support inclusive/exclusive for TrieRangeQuery/-Filter, remove default trie variant setters/getters

2009-01-27 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12667644#action_12667644
 ] 

Michael McCandless commented on LUCENE-1530:


bq. If nobody complains and the removal of not yet released constructors 
without the boolean parameters is ok,

This is perfectly fine.  Not yet released APIs are free to change.

> Support inclusive/exclusive for TrieRangeQuery/-Filter, remove default trie 
> variant setters/getters
> ---
>
> Key: LUCENE-1530
> URL: https://issues.apache.org/jira/browse/LUCENE-1530
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Affects Versions: 2.9
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
> Fix For: 2.9
>
> Attachments: LUCENE-1530.patch, LUCENE-1530.patch
>
>
> TrieRangeQuery/Filter is missing one thing: Ranges that have exclusive 
> bounds. For TrieRangeQuery this may not be important for ranges on long or 
> Date (==long) values (because [1..5] is the same like ]0..6[ or ]0..5]). This 
> is not so simple for doubles because you must add/substract 1 from the trie 
> encoded unsigned long.
> To be conform with the other range queries, I will submit a patch that has 
> two additional boolean parameters in the ctors to support inclusive/exclusive 
> ranges for both ends. Internally it will be implemented using 
> TrieUtils.incrementTrieCoded/decrementTrieCoded() but makes life simplier for 
> double ranges (a simple exclusive replacement for the floating point range 
> [0.0..1.0] is not possible without having the underlying unsigned long).
> In December, when trie contrib was included (LUCENE-1470), 3 trie variants 
> were supplied by TrieUtils. For new APIs a statically configureable default 
> Trie variant does not conform to an API we want in Lucene (currently we want 
> to deprecate all these static setters/getters). The important thing: It does 
> not make code shorter or easier to understand, its more error prone. Before 
> release of 2.9 it is a good time to remove the default trie variant and 
> always force the parameter in TrieRangeQuery/Filter. It is better to choose 
> the variant in the application and do not automatically manage it.
> As Lucene 2.9 was not yet released, I will change the ctors and not preserve 
> the old ones.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

RE: RE: Hudson Java Docs?

2009-01-27 Thread Uwe Schindler

> Alternately, we could turn off the "Publish Javadoc" feature, and instead
> add trunk/build/docs/api to the list of files to "Archive" and then start
> refering to a URL like this (doesn't work at the moment) for all the
> javadocs...
> 
> http://hudson.zones.apache.org/hudson/view/Lucene/job/Lucene-
> trunk/lastSuccessfulBuild/artifact/trunk/build/docs/api/
> 
> turning that Javadoc feature off should eliminate the existing Javadoc
> links in the hudson navigation, but I suspect the old files would still be
> there (and in search engine caches)

Can we do a one-time cleanup (rm -rf in the directory) and then have a new
and clean start (maybe ask the hudson team at Apache)? The index.html file
for the javadocs in the root javadoc folder is possible, but it would not
remove the old files (the others, not index.html) from Googles cache/index.

Uwe


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1530) Support inclusive/exclusive for TrieRangeQuery/-Filter, remove default trie variant setters/getters

2009-01-27 Thread Uwe Schindler (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-1530:
--

  Description: 
TrieRangeQuery/Filter is missing one thing: Ranges that have exclusive bounds. 
For TrieRangeQuery this may not be important for ranges on long or Date 
(==long) values (because [1..5] is the same like ]0..6[ or ]0..5]). This is not 
so simple for doubles because you must add/substract 1 from the trie encoded 
unsigned long.

To be conform with the other range queries, I will submit a patch that has two 
additional boolean parameters in the ctors to support inclusive/exclusive 
ranges for both ends. Internally it will be implemented using 
TrieUtils.incrementTrieCoded/decrementTrieCoded() but makes life simplier for 
double ranges (a simple exclusive replacement for the floating point range 
[0.0..1.0] is not possible without having the underlying unsigned long).

In December, when trie contrib was included (LUCENE-1470), 3 trie variants were 
supplied by TrieUtils. For new APIs a statically configureable default Trie 
variant does not conform to an API we want in Lucene (currently we want to 
deprecate all these static setters/getters). The important thing: It does not 
make code shorter or easier to understand, its more error prone. Before release 
of 2.9 it is a good time to remove the default trie variant and always force 
the parameter in TrieRangeQuery/Filter. It is better to choose the variant in 
the application and do not automatically manage it.

As Lucene 2.9 was not yet released, I will change the ctors and not preserve 
the old ones.

  was:
TrieRangeQuery/Filter is missing one thing: Ranges that have exclusive bounds. 
For TrieRangeQuery this may not be important for ranges on long or Date 
(==long) values (because [1..5] is the same like ]0..6[ or ]0..5]). This is not 
so simple for doubles because you must add/substract 1 from the trie encoded 
unsigned long.

To be conform with the other range queries, I will submit a patch that has two 
additional boolean parameters in the ctors to support inclusive/exclusive 
ranges for both ends. Internally it will be implemented using 
TrieUtils.incrementTrieCoded/decrementTrieCoded() but makes life simplier for 
double ranges (a simple exclusive replacement for the floating point range 
[0.0..1.0] is not possible without having the underlying unsigned long).

As Lucene 2.9 was not yet released, I will change the ctors and not preserve 
the old ones.

Lucene Fields: [New, Patch Available]  (was: [New])

Update issue description to include both changes.

> Support inclusive/exclusive for TrieRangeQuery/-Filter, remove default trie 
> variant setters/getters
> ---
>
> Key: LUCENE-1530
> URL: https://issues.apache.org/jira/browse/LUCENE-1530
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Affects Versions: 2.9
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
> Fix For: 2.9
>
> Attachments: LUCENE-1530.patch, LUCENE-1530.patch
>
>
> TrieRangeQuery/Filter is missing one thing: Ranges that have exclusive 
> bounds. For TrieRangeQuery this may not be important for ranges on long or 
> Date (==long) values (because [1..5] is the same like ]0..6[ or ]0..5]). This 
> is not so simple for doubles because you must add/substract 1 from the trie 
> encoded unsigned long.
> To be conform with the other range queries, I will submit a patch that has 
> two additional boolean parameters in the ctors to support inclusive/exclusive 
> ranges for both ends. Internally it will be implemented using 
> TrieUtils.incrementTrieCoded/decrementTrieCoded() but makes life simplier for 
> double ranges (a simple exclusive replacement for the floating point range 
> [0.0..1.0] is not possible without having the underlying unsigned long).
> In December, when trie contrib was included (LUCENE-1470), 3 trie variants 
> were supplied by TrieUtils. For new APIs a statically configureable default 
> Trie variant does not conform to an API we want in Lucene (currently we want 
> to deprecate all these static setters/getters). The important thing: It does 
> not make code shorter or easier to understand, its more error prone. Before 
> release of 2.9 it is a good time to remove the default trie variant and 
> always force the parameter in TrieRangeQuery/Filter. It is better to choose 
> the variant in the application and do not automatically manage it.
> As Lucene 2.9 was not yet released, I will change the ctors and not preserve 
> the old ones.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To u

RE: Hudson Java Docs?

2009-01-27 Thread Uwe Schindler

> Chris Hostetter wrote:
> > : > I think, the outdated docs should be removed from the server to also
> > : > disappear from search engines.
> 
> We do not want unofficial builds to be indexed by search engines anyway.
>   Folks who're searching for information about Lucene should not be
> referred to unreleased docuementation on an Apache host that can easily
> be confused with official documentation.  I am frankly appalled to see
> that nightly build documentation still appears at the top of the search
> results for queries such as "lucene api".
> 
> We should add a robots.txt for Hudson that prohibits crawling, no?
> 
> Why waste effort on documentation for use only by those very same people
> who can easily create their own copy?
> 
> > Alternately, we could turn off the "Publish Javadoc" feature, and
> instead
> > add trunk/build/docs/api to the list of files to "Archive" and then
> start
> > refering to a URL like this (doesn't work at the moment) for all the
> > javadocs...
> >
> > http://hudson.zones.apache.org/hudson/view/Lucene/job/Lucene-
> trunk/lastSuccessfulBuild/artifact/trunk/build/docs/api/
> 
> +1, except the referring part.

Why not refer? A robots.txt is OK, but the docs should be accessible via a
link from Hudson and the developer resources page. If search engines do not
harvest them, there is no problem with the linking, I think it would be
fine.

Uwe


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Build failed in Hudson: Lucene-trunk #720

[jira] Commented: (LUCENE-1484) Remove SegmentReader.document synchronization

what is the correct javacc version to generate the queryparser?

RE: RE: Hudson Java Docs?

[jira] Commented: (LUCENE-1528) Add support for Ideographic Space to the queryparser - also know as fullwith space and wide-space

[jira] Commented: (LUCENE-1507) adding EmptyDocIdSet/Iterator

[jira] Resolved: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector

[jira] Updated: (LUCENE-1507) adding EmptyDocIdSet/Iterator

[jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector

[jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector

[jira] Commented: (LUCENE-1476) BitVector implement DocIdSet, IndexReader returns DocIdSet deleted docs

[jira] Commented: (LUCENE-1507) adding EmptyDocIdSet/Iterator

[jira] Assigned: (LUCENE-1507) adding EmptyDocIdSet/Iterator

[jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector

[jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector

[jira] Updated: (LUCENE-1507) adding EmptyDocIdSet/Iterator

[jira] Commented: (LUCENE-1476) BitVector implement DocIdSet, IndexReader returns DocIdSet deleted docs

[jira] Commented: (LUCENE-1476) BitVector implement DocIdSet, IndexReader returns DocIdSet deleted docs

[jira] Resolved: (LUCENE-1530) Support inclusive/exclusive for TrieRangeQuery/-Filter, remove default trie variant setters/getters

[jira] Commented: (LUCENE-1489) highlighter problem with n-gram tokens

[jira] Commented: (LUCENE-1507) adding EmptyDocIdSet/Iterator

[jira] Commented: (LUCENE-1530) Support inclusive/exclusive for TrieRangeQuery/-Filter, remove default trie variant setters/getters

RE: RE: Hudson Java Docs?

[jira] Updated: (LUCENE-1530) Support inclusive/exclusive for TrieRangeQuery/-Filter, remove default trie variant setters/getters

RE: Hudson Java Docs?

25 matches

Site Navigation

Mail list logo

Footer information