date:20091020

[jira] Updated: (LUCENE-1257) Port to Java5

2009-10-20 Thread Paul Cowan (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Cowan updated LUCENE-1257:
---

Attachment: LUCENE-1257-clone_covariance.patch

OK, thought I'd jump in and help out here with one of my Java 5 favourites. 
Haven't seen anyone discuss this, and don't believe any of the patches address 
this, so thought I'd throw a patch out there (against SVN HEAD @ revision 
827821) which uses Java 5 covariant return types for (almost) all of the 
Object#clone() implementations in core.

i.e. this:

  public Object clone() {
changes to:
  public SpanNotQuery clone() {

which lets us get rid of a whole bunch of now-unnecessary casts, so e.g.

  if (clone == null) clone = (SpanNotQuery) this.clone();
becomes
  if (clone == null) clone = this.clone();

Almost everything has been done and all downcasts removed, in core, with the 
exception of

* Some SpanQuery stuff, where it's assumed that it's safe to cast the clone() 
of a SpanQuery to a SpanQuery -- this can't be made covariant without declaring 
"abstract SpanQuery clone()" in SpanQuery itself, which breaks those SpanQuerys 
that don't declare their own clone()
* Some IndexReaders, e.g. DirectoryReader -- we can't be more specific than 
changing .clone() to return IndexReader, because it returns the result of 
IndexReader.clone(boolean). We could use covariant types for THAT, which would 
work fine, but that didn't follow the pattern of the others so that could be a 
later commit.

Two changes were also made in contrib/, where not making the changes would have 
broken code by trying to widen IndexInput#clone() back out to returning Object, 
which is not permitted. contrib/ was otherwise left untouched.

Let me know what you think, or if you have any other questions.


> Port to Java5
> -
>
> Key: LUCENE-1257
> URL: https://issues.apache.org/jira/browse/LUCENE-1257
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis, Examples, Index, Other, Query/Scoring, 
> QueryParser, Search, Store, Term Vectors
>Affects Versions: 3.0
>Reporter: Cédric Champeau
>Assignee: Uwe Schindler
>Priority: Minor
> Fix For: 3.0
>
> Attachments: instantiated_fieldable.patch, 
> LUCENE-1257-BooleanQuery.patch, LUCENE-1257-BooleanScorer_2.patch, 
> LUCENE-1257-BufferedDeletes_DocumentsWriter.patch, 
> LUCENE-1257-CheckIndex.patch, LUCENE-1257-clone_covariance.patch, 
> LUCENE-1257-CloseableThreadLocal.patch, 
> LUCENE-1257-CompoundFileReaderWriter.patch, 
> LUCENE-1257-ConcurrentMergeScheduler.patch, 
> LUCENE-1257-DirectoryReader.patch, 
> LUCENE-1257-DisjunctionMaxQuery-more_type_safety.patch, 
> LUCENE-1257-DocFieldProcessorPerThread.patch, LUCENE-1257-Document.patch, 
> LUCENE-1257-FieldCacheImpl.patch, LUCENE-1257-FieldCacheRangeFilter.patch, 
> LUCENE-1257-IndexDeleter.patch, 
> LUCENE-1257-IndexDeletionPolicy_IndexFileDeleter.patch, LUCENE-1257-iw.patch, 
> LUCENE-1257-MTQWF.patch, LUCENE-1257-NormalizeCharMap.patch, 
> LUCENE-1257-o.a.l.util.patch, LUCENE-1257-org_apache_lucene_document.patch, 
> LUCENE-1257-org_apache_lucene_document.patch, 
> LUCENE-1257-org_apache_lucene_document.patch, LUCENE-1257-SegmentInfos.patch, 
> LUCENE-1257-StringBuffer.patch, LUCENE-1257-StringBuffer.patch, 
> LUCENE-1257-StringBuffer.patch, LUCENE-1257-TopDocsCollector.patch, 
> LUCENE-1257-WordListLoader.patch, LUCENE-1257_analysis.patch, 
> LUCENE-1257_BooleanFilter_Generics.patch, 
> LUCENE-1257_contrib_highlighting.patch, LUCENE-1257_javacc_upgrade.patch, 
> LUCENE-1257_messages.patch, LUCENE-1257_more_unnecessary_casts.patch, 
> LUCENE-1257_MultiFieldQueryParser.patch, LUCENE-1257_o.a.l.queryParser.patch, 
> LUCENE-1257_o.a.l.store.patch, LUCENE-1257_o_a_l_index_test.patch, 
> LUCENE-1257_o_a_l_index_test.patch, LUCENE-1257_o_a_l_search.patch, 
> LUCENE-1257_o_a_l_search_spans.patch, 
> LUCENE-1257_org_apache_lucene_index.patch, 
> LUCENE-1257_org_apache_lucene_index.patch, LUCENE-1257_queryParser_jj.patch, 
> LUCENE-1257_unnecessary_casts.patch, lucene1257surround1.patch, 
> lucene1257surround1.patch, shinglematrixfilter_generified.patch
>
>
> For my needs I've updated Lucene so that it uses Java 5 constructs. I know 
> Java 5 migration had been planned for 2.1 someday in the past, but don't know 
> when it is planned now. This patch against the trunk includes :
> - most obvious generics usage (there are tons of usages of sets, ... Those 
> which are commonly used have been generified)
> - PriorityQueue generification
> - replacement of indexed for loops with for each constructs
> - removal of unnececessary unboxing
> The code is to my opinion much more readable with those features (you 
> actually *know* what is stored in collections reading the code, without the 
> need to lookup for field definitions everytime) and it simplifie

Re: lucene 2.9 sorting algorithm

2009-10-20 Thread John Wang

Hi Mike:
That's weird. Let me take a look at the patch. Need to brush up on
python though :)
Thanks
-John

On Tue, Oct 20, 2009 at 10:25 AM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> OK I posted a patch that folds the MultiPQ approach into
> contrib/benchmark, plus a simple python wrapper to run old/new tests
> across different queries, sort, topN, etc.
>
> But I got different results... MultiPQ looks generally slower than
> SinglePQ.  So I think we now need to reconcile what's different
> between our tests.
>
> Mike
>
> On Mon, Oct 19, 2009 at 9:28 PM, John Wang  wrote:
> > Hi Michael:
> >  Was wondering if you got a chance to take a look at this.
> >  Since deprecated APIs are being removed in 3.0, I was wondering
> if/when
> > we would decide on keeping the ScoreDocComparator API and thus would be
> kept
> > for Lucene 3.0.
> > Thanks
> > -John
> >
> > On Fri, Oct 16, 2009 at 9:53 AM, Michael McCandless
> >  wrote:
> >>
> >> Oh, no problem...
> >>
> >> Mike
> >>
> >> On Fri, Oct 16, 2009 at 12:33 PM, John Wang 
> wrote:
> >> > Mike, just a clarification on my first perf report email.
> >> > The first section, numHits is incorrectly labeled, it should be 20
> >> > instead
> >> > of 50. Sorry about the possible confusion.
> >> > Thanks
> >> > -John
> >> >
> >> > On Fri, Oct 16, 2009 at 3:21 AM, Michael McCandless
> >> >  wrote:
> >> >>
> >> >> Thanks John; I'll have a look.
> >> >>
> >> >> Mike
> >> >>
> >> >> On Fri, Oct 16, 2009 at 12:57 AM, John Wang 
> >> >> wrote:
> >> >> > Hi Michael:
> >> >> > I added classes: ScoreDocComparatorQueue
> >> >> > and OneSortNoScoreCollector
> >> >> > as
> >> >> > a more general case. I think keeping the old api for
> >> >> > ScoreDocComparator
> >> >> > and
> >> >> > SortComparatorSource would work.
> >> >> >   Please take a look.
> >> >> > Thanks
> >> >> > -John
> >> >> >
> >> >> > On Thu, Oct 15, 2009 at 6:52 PM, John Wang 
> >> >> > wrote:
> >> >> >>
> >> >> >> Hi Michael:
> >> >> >>  It is
> >> >> >> open, http://code.google.com/p/lucene-book/source/checkout
> >> >> >>  I think I sent the https url instead, sorry.
> >> >> >> The multi PQ sorting is fairly self-contained, I have 2
> >> >> >> versions, 1
> >> >> >> for string and 1 for int, each are Collector impls.
> >> >> >>  I shouldn't say the Multi Q is faster on int sort, it is
> within
> >> >> >> the
> >> >> >> error boundary. The diff is very very small, I would stay they are
> >> >> >> more
> >> >> >> equal.
> >> >> >>  If you think it is a good thing to go this way, (if not for
> the
> >> >> >> perf,
> >> >> >> just for the simpler api) I'd be happy to work on a patch.
> >> >> >> Thanks
> >> >> >> -John
> >> >> >> On Thu, Oct 15, 2009 at 5:18 PM, Michael McCandless
> >> >> >>  wrote:
> >> >> >>>
> >> >> >>> John, looks like this requires login -- any plans to open that
> up,
> >> >> >>> or,
> >> >> >>> post the code on an issue?
> >> >> >>>
> >> >> >>> How self-contained is your Multi PQ sorting?  EG is it a
> standalone
> >> >> >>> Collector impl that I can test?
> >> >> >>>
> >> >> >>> Mike
> >> >> >>>
> >> >> >>> On Thu, Oct 15, 2009 at 6:33 PM, John Wang 
> >> >> >>> wrote:
> >> >> >>> > BTW, we are have a little sandbox for these experiments. And
> all
> >> >> >>> > my
> >> >> >>> > testcode
> >> >> >>> > are at. They are not very polished.
> >> >> >>> >
> >> >> >>> > https://lucene-book.googlecode.com/svn/trunk
> >> >> >>> >
> >> >> >>> > -John
> >> >> >>> >
> >> >> >>> > On Thu, Oct 15, 2009 at 3:29 PM, John Wang <
> john.w...@gmail.com>
> >> >> >>> > wrote:
> >> >> >>> >>
> >> >> >>> >> Numbers Mike requested for Int types:
> >> >> >>> >>
> >> >> >>> >> only the time/cputime are posted, others are all the same
> since
> >> >> >>> >> the
> >> >> >>> >> algorithm is the same.
> >> >> >>> >>
> >> >> >>> >> Lucene 2.9:
> >> >> >>> >> numhits: 10
> >> >> >>> >> time: 14619495
> >> >> >>> >> cpu: 146126
> >> >> >>> >>
> >> >> >>> >> numhits: 20
> >> >> >>> >> time: 14550568
> >> >> >>> >> cpu: 163242
> >> >> >>> >>
> >> >> >>> >> numhits: 100
> >> >> >>> >> time: 16467647
> >> >> >>> >> cpu: 178379
> >> >> >>> >>
> >> >> >>> >>
> >> >> >>> >> my test:
> >> >> >>> >> numHits: 10
> >> >> >>> >> time: 14101094
> >> >> >>> >> cpu: 144715
> >> >> >>> >>
> >> >> >>> >> numHits: 20
> >> >> >>> >> time: 14804821
> >> >> >>> >> cpu: 151305
> >> >> >>> >>
> >> >> >>> >> numHits: 100
> >> >> >>> >> time: 15372157
> >> >> >>> >> cpu time: 158842
> >> >> >>> >>
> >> >> >>> >> Conclusions:
> >> >> >>> >> The are very similar, the differences are all within error
> >> >> >>> >> bounds,
> >> >> >>> >> especially with lower PQ sizes, which second sort alg again
> >> >> >>> >> slightly
> >> >> >>> >> faster.
> >> >> >>> >>
> >> >> >>> >> Hope this helps.
> >> >> >>> >>
> >> >> >>> >> -John
> >> >> >>> >>
> >> >> >>> >>
> >> >> >>> >> On Thu, Oct 15, 2009 at 3:04 PM, Yonik Seeley
> >> >> >>> >> 
> >> >> >>> >> wrote:
> >> >> >>> >>>
> >> >> >>> >>> On Thu, Oct 15, 2009 at

[jira] Updated: (LUCENE-1999) Match spotter for all query types

2009-10-20 Thread Mark Harwood (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-1999:
-

Attachment: matchflagger.patch

> Match spotter for all query types
> -
>
> Key: LUCENE-1999
> URL: https://issues.apache.org/jira/browse/LUCENE-1999
> Project: Lucene - Java
>  Issue Type: New Feature
>Affects Versions: 2.9
>Reporter: Mark Harwood
> Attachments: matchflagger.patch
>
>
> Related to LUCENE-1929 and the current inability to highlight 
> NumericRangeQuery, spatial, cached term filters and other exotica.
> This patch provides the ability to wrap *any* Query objects and record match 
> info as flags encoded in the overall document score.
> Using this approach it would be possible to understand (and therefore 
> highlight) which fields matched clauses in a query.
> The match encoding approach loses some precision in scores as noted here: 
> http://tinyurl.com/ykt8nx7
> Avoiding these precision issues would require a change to Lucene core to 
> record docId, score AND a matchFlag byte in ScoreDoc objects and collector 
> APIs.
> This may be something we should consider.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Created: (LUCENE-1999) Match spotter for all query types

2009-10-20 Thread Mark Harwood (JIRA)

Match spotter for all query types
-

 Key: LUCENE-1999
 URL: https://issues.apache.org/jira/browse/LUCENE-1999
 Project: Lucene - Java
  Issue Type: New Feature
Affects Versions: 2.9
Reporter: Mark Harwood
 Attachments: matchflagger.patch

Related to LUCENE-1929 and the current inability to highlight 
NumericRangeQuery, spatial, cached term filters and other exotica.

This patch provides the ability to wrap *any* Query objects and record match 
info as flags encoded in the overall document score.
Using this approach it would be possible to understand (and therefore 
highlight) which fields matched clauses in a query.

The match encoding approach loses some precision in scores as noted here: 
http://tinyurl.com/ykt8nx7

Avoiding these precision issues would require a change to Lucene core to record 
docId, score AND a matchFlag byte in ScoreDoc objects and collector APIs.
This may be something we should consider.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1257) Port to Java5

2009-10-20 Thread DM Smith (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DM Smith updated LUCENE-1257:
-

Attachment: (was: LUCENE-1257_enum.patch)

> Port to Java5
> -
>
> Key: LUCENE-1257
> URL: https://issues.apache.org/jira/browse/LUCENE-1257
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis, Examples, Index, Other, Query/Scoring, 
> QueryParser, Search, Store, Term Vectors
>Affects Versions: 3.0
>Reporter: Cédric Champeau
>Assignee: Uwe Schindler
>Priority: Minor
> Fix For: 3.0
>
> Attachments: instantiated_fieldable.patch, 
> LUCENE-1257-BooleanQuery.patch, LUCENE-1257-BooleanScorer_2.patch, 
> LUCENE-1257-BufferedDeletes_DocumentsWriter.patch, 
> LUCENE-1257-CheckIndex.patch, LUCENE-1257-CloseableThreadLocal.patch, 
> LUCENE-1257-CompoundFileReaderWriter.patch, 
> LUCENE-1257-ConcurrentMergeScheduler.patch, 
> LUCENE-1257-DirectoryReader.patch, 
> LUCENE-1257-DisjunctionMaxQuery-more_type_safety.patch, 
> LUCENE-1257-DocFieldProcessorPerThread.patch, LUCENE-1257-Document.patch, 
> LUCENE-1257-FieldCacheImpl.patch, LUCENE-1257-FieldCacheRangeFilter.patch, 
> LUCENE-1257-IndexDeleter.patch, 
> LUCENE-1257-IndexDeletionPolicy_IndexFileDeleter.patch, LUCENE-1257-iw.patch, 
> LUCENE-1257-MTQWF.patch, LUCENE-1257-NormalizeCharMap.patch, 
> LUCENE-1257-o.a.l.util.patch, LUCENE-1257-org_apache_lucene_document.patch, 
> LUCENE-1257-org_apache_lucene_document.patch, 
> LUCENE-1257-org_apache_lucene_document.patch, LUCENE-1257-SegmentInfos.patch, 
> LUCENE-1257-StringBuffer.patch, LUCENE-1257-StringBuffer.patch, 
> LUCENE-1257-StringBuffer.patch, LUCENE-1257-TopDocsCollector.patch, 
> LUCENE-1257-WordListLoader.patch, LUCENE-1257_analysis.patch, 
> LUCENE-1257_BooleanFilter_Generics.patch, 
> LUCENE-1257_contrib_highlighting.patch, LUCENE-1257_javacc_upgrade.patch, 
> LUCENE-1257_messages.patch, LUCENE-1257_more_unnecessary_casts.patch, 
> LUCENE-1257_MultiFieldQueryParser.patch, LUCENE-1257_o.a.l.queryParser.patch, 
> LUCENE-1257_o.a.l.store.patch, LUCENE-1257_o_a_l_index_test.patch, 
> LUCENE-1257_o_a_l_index_test.patch, LUCENE-1257_o_a_l_search.patch, 
> LUCENE-1257_o_a_l_search_spans.patch, 
> LUCENE-1257_org_apache_lucene_index.patch, 
> LUCENE-1257_org_apache_lucene_index.patch, LUCENE-1257_queryParser_jj.patch, 
> LUCENE-1257_unnecessary_casts.patch, lucene1257surround1.patch, 
> lucene1257surround1.patch, shinglematrixfilter_generified.patch
>
>
> For my needs I've updated Lucene so that it uses Java 5 constructs. I know 
> Java 5 migration had been planned for 2.1 someday in the past, but don't know 
> when it is planned now. This patch against the trunk includes :
> - most obvious generics usage (there are tons of usages of sets, ... Those 
> which are commonly used have been generified)
> - PriorityQueue generification
> - replacement of indexed for loops with for each constructs
> - removal of unnececessary unboxing
> The code is to my opinion much more readable with those features (you 
> actually *know* what is stored in collections reading the code, without the 
> need to lookup for field definitions everytime) and it simplifies many 
> algorithms.
> Note that this patch also includes an interface for the Query class. This has 
> been done for my company's needs for building custom Query classes which add 
> some behaviour to the base Lucene queries. It prevents multiple unnnecessary 
> casts. I know this introduction is not wanted by the team, but it really 
> makes our developments easier to maintain. If you don't want to use this, 
> replace all /Queriable/ calls with standard /Query/.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: lucene 2.9 sorting algorithm

2009-10-20 Thread Earwin Burrfoot

That's quite possible to reimplement, I believe. You can have your
docid->ordinal map bound to toplevel reader, as it was before and then
your FIeldComparator rebases incoming compare() docids based on what
last setNextReader() was called with.

On Wed, Oct 21, 2009 at 02:07, TomS  wrote:
> Hi,
>
> I can confirm the below mentioned problems trying to migrate to 2.9.
>
> Our Lucene-based (2.4) app uses custom multi-level sorting on a lot of
> different fields and pretty large indexes (> 100m docs). Most of the fields
> that we sort on are strings, some with up to 400 characters in length. A lot
> of the strings share a common prefix, so comparing these
> character-by-character is costly. As the comparison proved to be a hotspot
> (and ordinary string field caches a memory-hog) we've been using a similar
> approach like the one mentioned by John Wang for years now, starting with
> Lucene 1.x: Using a map from doc id to ordinals (int->int) where the ordinal
> represents the order imposed by the (costly) comparison.
>
> This has been rather simple to implement and gives our app sizeable
> performance gains. Furthermore, it saves a lot of memory compared to
> standard field caches.
>
> With the new API, though, the simple mapping from doc id to pre-generated
> sort ordinals won't work anymore. This is the only reason that has prevented
> us from upgrading to 2.9. Currently it seems to us that adapting to the new
> sorting API would be either much more complicated to implement, or - when
> reverting to standard field cache-based sorting - slower and use far more
> memory. On the other hand, we might not have a full understanding of 2.9's
> inner workings yet, being used to (hopefully not stuck with) our a.m. custom
> solution.
>
> Other than that, 2.9 seems to be a very nice release. Thanks for the great
> work!
>
> -Tom
>
>
> On Tue, Oct 20, 2009 at 8:55 AM, John Wang  wrote:
> [...]
> Let me provide an example:
>     We have a multi valued field on integers, we define a sort on this set
> of strings by defining a comparator on each value to be similar to a lex
> order, instead of compare on characters, we do on strings, we also want to
> keep the multi value representation as we do filtering and facet counting on
> it. The in memory representation is similar to the UnInvertedField in Solr.
>    Implementing a sort with the old API was rather simple, as we only needed
> mapping from a docid to a set of ordinals. With the new api, we needed to do
> a "conversion", which would mean mapping a set of String/ordinals back to a
> doc. Which is to me, is not trivial, let alone performance implications.
>    That actually gave us to motivation to see if the old api can handle the
> segment level changes that was made in 2.9 (which in my opinion is the best
> thing in lucene since payloads :) )
>    So after some investigation, with code and big O analysis, and
> discussions with Mike and Yonik, on our end, we feel given the performance
> numbers, it is unnecessary to go with the more complicated API.
> Thanks
> -John
>
>



-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1998) Use Java 5 enums

2009-10-20 Thread DM Smith (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DM Smith updated LUCENE-1998:
-

Attachment: LUCENE-1998_enum.patch

This issue and patch were part of LUCENE-1257, but may have backward 
compatibility issues. (I'll remove the patch from there.)

> Use Java 5 enums
> 
>
> Key: LUCENE-1998
> URL: https://issues.apache.org/jira/browse/LUCENE-1998
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 3.0
>Reporter: DM Smith
>Priority: Minor
> Fix For: 3.0
>
> Attachments: LUCENE-1998_enum.patch
>
>
> Replace the use of o.a.l.util.Parameter with Java 5 enums, deprecating 
> Parameter.
> Replace other custom enum patterns with Java 5 enums.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Created: (LUCENE-1998) Use Java 5 enums

2009-10-20 Thread DM Smith (JIRA)

Use Java 5 enums


 Key: LUCENE-1998
 URL: https://issues.apache.org/jira/browse/LUCENE-1998
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 3.0
Reporter: DM Smith
Priority: Minor
 Fix For: 3.0


Replace the use of o.a.l.util.Parameter with Java 5 enums, deprecating 
Parameter.

Replace other custom enum patterns with Java 5 enums.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1257) Port to Java5

2009-10-20 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12767977#action_12767977
 ] 

Uwe Schindler commented on LUCENE-1257:
---

DM Smith: Can you open a new issue. This is more complicated because of 
backwards compatibility, so a separate issue would be great.

I like your code rewrite (it could even be done with Parameter in the same 
way). By the way: Java5's Enum is also Serializable, so no problem.

> Port to Java5
> -
>
> Key: LUCENE-1257
> URL: https://issues.apache.org/jira/browse/LUCENE-1257
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis, Examples, Index, Other, Query/Scoring, 
> QueryParser, Search, Store, Term Vectors
>Affects Versions: 3.0
>Reporter: Cédric Champeau
>Assignee: Uwe Schindler
>Priority: Minor
> Fix For: 3.0
>
> Attachments: instantiated_fieldable.patch, 
> LUCENE-1257-BooleanQuery.patch, LUCENE-1257-BooleanScorer_2.patch, 
> LUCENE-1257-BufferedDeletes_DocumentsWriter.patch, 
> LUCENE-1257-CheckIndex.patch, LUCENE-1257-CloseableThreadLocal.patch, 
> LUCENE-1257-CompoundFileReaderWriter.patch, 
> LUCENE-1257-ConcurrentMergeScheduler.patch, 
> LUCENE-1257-DirectoryReader.patch, 
> LUCENE-1257-DisjunctionMaxQuery-more_type_safety.patch, 
> LUCENE-1257-DocFieldProcessorPerThread.patch, LUCENE-1257-Document.patch, 
> LUCENE-1257-FieldCacheImpl.patch, LUCENE-1257-FieldCacheRangeFilter.patch, 
> LUCENE-1257-IndexDeleter.patch, 
> LUCENE-1257-IndexDeletionPolicy_IndexFileDeleter.patch, LUCENE-1257-iw.patch, 
> LUCENE-1257-MTQWF.patch, LUCENE-1257-NormalizeCharMap.patch, 
> LUCENE-1257-o.a.l.util.patch, LUCENE-1257-org_apache_lucene_document.patch, 
> LUCENE-1257-org_apache_lucene_document.patch, 
> LUCENE-1257-org_apache_lucene_document.patch, LUCENE-1257-SegmentInfos.patch, 
> LUCENE-1257-StringBuffer.patch, LUCENE-1257-StringBuffer.patch, 
> LUCENE-1257-StringBuffer.patch, LUCENE-1257-TopDocsCollector.patch, 
> LUCENE-1257-WordListLoader.patch, LUCENE-1257_analysis.patch, 
> LUCENE-1257_BooleanFilter_Generics.patch, 
> LUCENE-1257_contrib_highlighting.patch, LUCENE-1257_enum.patch, 
> LUCENE-1257_javacc_upgrade.patch, LUCENE-1257_messages.patch, 
> LUCENE-1257_more_unnecessary_casts.patch, 
> LUCENE-1257_MultiFieldQueryParser.patch, LUCENE-1257_o.a.l.queryParser.patch, 
> LUCENE-1257_o.a.l.store.patch, LUCENE-1257_o_a_l_index_test.patch, 
> LUCENE-1257_o_a_l_index_test.patch, LUCENE-1257_o_a_l_search.patch, 
> LUCENE-1257_o_a_l_search_spans.patch, 
> LUCENE-1257_org_apache_lucene_index.patch, 
> LUCENE-1257_org_apache_lucene_index.patch, LUCENE-1257_queryParser_jj.patch, 
> LUCENE-1257_unnecessary_casts.patch, lucene1257surround1.patch, 
> lucene1257surround1.patch, shinglematrixfilter_generified.patch
>
>
> For my needs I've updated Lucene so that it uses Java 5 constructs. I know 
> Java 5 migration had been planned for 2.1 someday in the past, but don't know 
> when it is planned now. This patch against the trunk includes :
> - most obvious generics usage (there are tons of usages of sets, ... Those 
> which are commonly used have been generified)
> - PriorityQueue generification
> - replacement of indexed for loops with for each constructs
> - removal of unnececessary unboxing
> The code is to my opinion much more readable with those features (you 
> actually *know* what is stored in collections reading the code, without the 
> need to lookup for field definitions everytime) and it simplifies many 
> algorithms.
> Note that this patch also includes an interface for the Query class. This has 
> been done for my company's needs for building custom Query classes which add 
> some behaviour to the base Lucene queries. It prevents multiple unnnecessary 
> casts. I know this introduction is not wanted by the team, but it really 
> makes our developments easier to maintain. If you don't want to use this, 
> replace all /Queriable/ calls with standard /Query/.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1257) Port to Java5

2009-10-20 Thread Kay Kay (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Kay updated LUCENE-1257:


Attachment: LUCENE-1257_contrib_highlighting.patch

> Port to Java5
> -
>
> Key: LUCENE-1257
> URL: https://issues.apache.org/jira/browse/LUCENE-1257
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis, Examples, Index, Other, Query/Scoring, 
> QueryParser, Search, Store, Term Vectors
>Affects Versions: 3.0
>Reporter: Cédric Champeau
>Assignee: Uwe Schindler
>Priority: Minor
> Fix For: 3.0
>
> Attachments: instantiated_fieldable.patch, 
> LUCENE-1257-BooleanQuery.patch, LUCENE-1257-BooleanScorer_2.patch, 
> LUCENE-1257-BufferedDeletes_DocumentsWriter.patch, 
> LUCENE-1257-CheckIndex.patch, LUCENE-1257-CloseableThreadLocal.patch, 
> LUCENE-1257-CompoundFileReaderWriter.patch, 
> LUCENE-1257-ConcurrentMergeScheduler.patch, 
> LUCENE-1257-DirectoryReader.patch, 
> LUCENE-1257-DisjunctionMaxQuery-more_type_safety.patch, 
> LUCENE-1257-DocFieldProcessorPerThread.patch, LUCENE-1257-Document.patch, 
> LUCENE-1257-FieldCacheImpl.patch, LUCENE-1257-FieldCacheRangeFilter.patch, 
> LUCENE-1257-IndexDeleter.patch, 
> LUCENE-1257-IndexDeletionPolicy_IndexFileDeleter.patch, LUCENE-1257-iw.patch, 
> LUCENE-1257-MTQWF.patch, LUCENE-1257-NormalizeCharMap.patch, 
> LUCENE-1257-o.a.l.util.patch, LUCENE-1257-org_apache_lucene_document.patch, 
> LUCENE-1257-org_apache_lucene_document.patch, 
> LUCENE-1257-org_apache_lucene_document.patch, LUCENE-1257-SegmentInfos.patch, 
> LUCENE-1257-StringBuffer.patch, LUCENE-1257-StringBuffer.patch, 
> LUCENE-1257-StringBuffer.patch, LUCENE-1257-TopDocsCollector.patch, 
> LUCENE-1257-WordListLoader.patch, LUCENE-1257_analysis.patch, 
> LUCENE-1257_BooleanFilter_Generics.patch, 
> LUCENE-1257_contrib_highlighting.patch, LUCENE-1257_enum.patch, 
> LUCENE-1257_javacc_upgrade.patch, LUCENE-1257_messages.patch, 
> LUCENE-1257_more_unnecessary_casts.patch, 
> LUCENE-1257_MultiFieldQueryParser.patch, LUCENE-1257_o.a.l.queryParser.patch, 
> LUCENE-1257_o.a.l.store.patch, LUCENE-1257_o_a_l_index_test.patch, 
> LUCENE-1257_o_a_l_index_test.patch, LUCENE-1257_o_a_l_search.patch, 
> LUCENE-1257_o_a_l_search_spans.patch, 
> LUCENE-1257_org_apache_lucene_index.patch, 
> LUCENE-1257_org_apache_lucene_index.patch, LUCENE-1257_queryParser_jj.patch, 
> LUCENE-1257_unnecessary_casts.patch, lucene1257surround1.patch, 
> lucene1257surround1.patch, shinglematrixfilter_generified.patch
>
>
> For my needs I've updated Lucene so that it uses Java 5 constructs. I know 
> Java 5 migration had been planned for 2.1 someday in the past, but don't know 
> when it is planned now. This patch against the trunk includes :
> - most obvious generics usage (there are tons of usages of sets, ... Those 
> which are commonly used have been generified)
> - PriorityQueue generification
> - replacement of indexed for loops with for each constructs
> - removal of unnececessary unboxing
> The code is to my opinion much more readable with those features (you 
> actually *know* what is stored in collections reading the code, without the 
> need to lookup for field definitions everytime) and it simplifies many 
> algorithms.
> Note that this patch also includes an interface for the Query class. This has 
> been done for my company's needs for building custom Query classes which add 
> some behaviour to the base Lucene queries. It prevents multiple unnnecessary 
> casts. I know this introduction is not wanted by the team, but it really 
> makes our developments easier to maintain. If you don't want to use this, 
> replace all /Queriable/ calls with standard /Query/.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1257) Port to Java5

2009-10-20 Thread Kay Kay (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Kay updated LUCENE-1257:


Attachment: (was: LUCENE-1257_contrib_highlighting.patch)

> Port to Java5
> -
>
> Key: LUCENE-1257
> URL: https://issues.apache.org/jira/browse/LUCENE-1257
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis, Examples, Index, Other, Query/Scoring, 
> QueryParser, Search, Store, Term Vectors
>Affects Versions: 3.0
>Reporter: Cédric Champeau
>Assignee: Uwe Schindler
>Priority: Minor
> Fix For: 3.0
>
> Attachments: instantiated_fieldable.patch, 
> LUCENE-1257-BooleanQuery.patch, LUCENE-1257-BooleanScorer_2.patch, 
> LUCENE-1257-BufferedDeletes_DocumentsWriter.patch, 
> LUCENE-1257-CheckIndex.patch, LUCENE-1257-CloseableThreadLocal.patch, 
> LUCENE-1257-CompoundFileReaderWriter.patch, 
> LUCENE-1257-ConcurrentMergeScheduler.patch, 
> LUCENE-1257-DirectoryReader.patch, 
> LUCENE-1257-DisjunctionMaxQuery-more_type_safety.patch, 
> LUCENE-1257-DocFieldProcessorPerThread.patch, LUCENE-1257-Document.patch, 
> LUCENE-1257-FieldCacheImpl.patch, LUCENE-1257-FieldCacheRangeFilter.patch, 
> LUCENE-1257-IndexDeleter.patch, 
> LUCENE-1257-IndexDeletionPolicy_IndexFileDeleter.patch, LUCENE-1257-iw.patch, 
> LUCENE-1257-MTQWF.patch, LUCENE-1257-NormalizeCharMap.patch, 
> LUCENE-1257-o.a.l.util.patch, LUCENE-1257-org_apache_lucene_document.patch, 
> LUCENE-1257-org_apache_lucene_document.patch, 
> LUCENE-1257-org_apache_lucene_document.patch, LUCENE-1257-SegmentInfos.patch, 
> LUCENE-1257-StringBuffer.patch, LUCENE-1257-StringBuffer.patch, 
> LUCENE-1257-StringBuffer.patch, LUCENE-1257-TopDocsCollector.patch, 
> LUCENE-1257-WordListLoader.patch, LUCENE-1257_analysis.patch, 
> LUCENE-1257_BooleanFilter_Generics.patch, 
> LUCENE-1257_contrib_highlighting.patch, LUCENE-1257_enum.patch, 
> LUCENE-1257_javacc_upgrade.patch, LUCENE-1257_messages.patch, 
> LUCENE-1257_more_unnecessary_casts.patch, 
> LUCENE-1257_MultiFieldQueryParser.patch, LUCENE-1257_o.a.l.queryParser.patch, 
> LUCENE-1257_o.a.l.store.patch, LUCENE-1257_o_a_l_index_test.patch, 
> LUCENE-1257_o_a_l_index_test.patch, LUCENE-1257_o_a_l_search.patch, 
> LUCENE-1257_o_a_l_search_spans.patch, 
> LUCENE-1257_org_apache_lucene_index.patch, 
> LUCENE-1257_org_apache_lucene_index.patch, LUCENE-1257_queryParser_jj.patch, 
> LUCENE-1257_unnecessary_casts.patch, lucene1257surround1.patch, 
> lucene1257surround1.patch, shinglematrixfilter_generified.patch
>
>
> For my needs I've updated Lucene so that it uses Java 5 constructs. I know 
> Java 5 migration had been planned for 2.1 someday in the past, but don't know 
> when it is planned now. This patch against the trunk includes :
> - most obvious generics usage (there are tons of usages of sets, ... Those 
> which are commonly used have been generified)
> - PriorityQueue generification
> - replacement of indexed for loops with for each constructs
> - removal of unnececessary unboxing
> The code is to my opinion much more readable with those features (you 
> actually *know* what is stored in collections reading the code, without the 
> need to lookup for field definitions everytime) and it simplifies many 
> algorithms.
> Note that this patch also includes an interface for the Query class. This has 
> been done for my company's needs for building custom Query classes which add 
> some behaviour to the base Lucene queries. It prevents multiple unnnecessary 
> casts. I know this introduction is not wanted by the team, but it really 
> makes our developments easier to maintain. If you don't want to use this, 
> replace all /Queriable/ calls with standard /Query/.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1257) Port to Java5

2009-10-20 Thread Kay Kay (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Kay updated LUCENE-1257:


Attachment: LUCENE-1257_contrib_highlighting.patch

> Port to Java5
> -
>
> Key: LUCENE-1257
> URL: https://issues.apache.org/jira/browse/LUCENE-1257
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis, Examples, Index, Other, Query/Scoring, 
> QueryParser, Search, Store, Term Vectors
>Affects Versions: 3.0
>Reporter: Cédric Champeau
>Assignee: Uwe Schindler
>Priority: Minor
> Fix For: 3.0
>
> Attachments: instantiated_fieldable.patch, 
> LUCENE-1257-BooleanQuery.patch, LUCENE-1257-BooleanScorer_2.patch, 
> LUCENE-1257-BufferedDeletes_DocumentsWriter.patch, 
> LUCENE-1257-CheckIndex.patch, LUCENE-1257-CloseableThreadLocal.patch, 
> LUCENE-1257-CompoundFileReaderWriter.patch, 
> LUCENE-1257-ConcurrentMergeScheduler.patch, 
> LUCENE-1257-DirectoryReader.patch, 
> LUCENE-1257-DisjunctionMaxQuery-more_type_safety.patch, 
> LUCENE-1257-DocFieldProcessorPerThread.patch, LUCENE-1257-Document.patch, 
> LUCENE-1257-FieldCacheImpl.patch, LUCENE-1257-FieldCacheRangeFilter.patch, 
> LUCENE-1257-IndexDeleter.patch, 
> LUCENE-1257-IndexDeletionPolicy_IndexFileDeleter.patch, LUCENE-1257-iw.patch, 
> LUCENE-1257-MTQWF.patch, LUCENE-1257-NormalizeCharMap.patch, 
> LUCENE-1257-o.a.l.util.patch, LUCENE-1257-org_apache_lucene_document.patch, 
> LUCENE-1257-org_apache_lucene_document.patch, 
> LUCENE-1257-org_apache_lucene_document.patch, LUCENE-1257-SegmentInfos.patch, 
> LUCENE-1257-StringBuffer.patch, LUCENE-1257-StringBuffer.patch, 
> LUCENE-1257-StringBuffer.patch, LUCENE-1257-TopDocsCollector.patch, 
> LUCENE-1257-WordListLoader.patch, LUCENE-1257_analysis.patch, 
> LUCENE-1257_BooleanFilter_Generics.patch, 
> LUCENE-1257_contrib_highlighting.patch, LUCENE-1257_enum.patch, 
> LUCENE-1257_javacc_upgrade.patch, LUCENE-1257_messages.patch, 
> LUCENE-1257_more_unnecessary_casts.patch, 
> LUCENE-1257_MultiFieldQueryParser.patch, LUCENE-1257_o.a.l.queryParser.patch, 
> LUCENE-1257_o.a.l.store.patch, LUCENE-1257_o_a_l_index_test.patch, 
> LUCENE-1257_o_a_l_index_test.patch, LUCENE-1257_o_a_l_search.patch, 
> LUCENE-1257_o_a_l_search_spans.patch, 
> LUCENE-1257_org_apache_lucene_index.patch, 
> LUCENE-1257_org_apache_lucene_index.patch, LUCENE-1257_queryParser_jj.patch, 
> LUCENE-1257_unnecessary_casts.patch, lucene1257surround1.patch, 
> lucene1257surround1.patch, shinglematrixfilter_generified.patch
>
>
> For my needs I've updated Lucene so that it uses Java 5 constructs. I know 
> Java 5 migration had been planned for 2.1 someday in the past, but don't know 
> when it is planned now. This patch against the trunk includes :
> - most obvious generics usage (there are tons of usages of sets, ... Those 
> which are commonly used have been generified)
> - PriorityQueue generification
> - replacement of indexed for loops with for each constructs
> - removal of unnececessary unboxing
> The code is to my opinion much more readable with those features (you 
> actually *know* what is stored in collections reading the code, without the 
> need to lookup for field definitions everytime) and it simplifies many 
> algorithms.
> Note that this patch also includes an interface for the Query class. This has 
> been done for my company's needs for building custom Query classes which add 
> some behaviour to the base Lucene queries. It prevents multiple unnnecessary 
> casts. I know this introduction is not wanted by the team, but it really 
> makes our developments easier to maintain. If you don't want to use this, 
> replace all /Queriable/ calls with standard /Query/.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: lucene 2.9 sorting algorithm

2009-10-20 Thread TomS

Hi,

I can confirm the below mentioned problems trying to migrate to 2.9.

Our Lucene-based (2.4) app uses custom multi-level sorting on a lot of
different fields and pretty large indexes (> 100m docs). Most of the fields
that we sort on are strings, some with up to 400 characters in length. A lot
of the strings share a common prefix, so comparing these
character-by-character is costly. As the comparison proved to be a hotspot
(and ordinary string field caches a memory-hog) we've been using a similar
approach like the one mentioned by John Wang for years now, starting with
Lucene 1.x: Using a map from doc id to ordinals (int->int) where the ordinal
represents the order imposed by the (costly) comparison.

This has been rather simple to implement and gives our app sizeable
performance gains. Furthermore, it saves a lot of memory compared to
standard field caches.

With the new API, though, the simple mapping from doc id to pre-generated
sort ordinals won't work anymore. This is the only reason that has prevented
us from upgrading to 2.9. Currently it seems to us that adapting to the new
sorting API would be either much more complicated to implement, or - when
reverting to standard field cache-based sorting - slower and use far more
memory. On the other hand, we might not have a full understanding of 2.9's
inner workings yet, being used to (hopefully not stuck with) our a.m. custom
solution.

Other than that, 2.9 seems to be a very nice release. Thanks for the great
work!

-Tom


On Tue, Oct 20, 2009 at 8:55 AM, John Wang  wrote:
[...]
Let me provide an example:

We have a multi valued field on integers, we define a sort on this set
of strings by defining a comparator on each value to be similar to a lex
order, instead of compare on characters, we do on strings, we also want to
keep the multi value representation as we do filtering and facet counting on
it. The in memory representation is similar to the UnInvertedField in Solr.

   Implementing a sort with the old API was rather simple, as we only needed
mapping from a docid to a set of ordinals. With the new api, we needed to do
a "conversion", which would mean mapping a set of String/ordinals back to a
doc. Which is to me, is not trivial, let alone performance implications.

   That actually gave us to motivation to see if the old api can handle the
segment level changes that was made in 2.9 (which in my opinion is the best
thing in lucene since payloads :) )

   So after some investigation, with code and big O analysis, and
discussions with Mike and Yonik, on our end, we feel given the performance
numbers, it is unnecessary to go with the more complicated API.

Thanks

-John

[jira] Updated: (LUCENE-1257) Port to Java5

2009-10-20 Thread DM Smith (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DM Smith updated LUCENE-1257:
-

Attachment: LUCENE-1257_enum.patch

Migrates to Java 5 enums in core and contrib. All tests pass.
Deprecates o.a.l.util.Parameter. It probably can be removed.

> Port to Java5
> -
>
> Key: LUCENE-1257
> URL: https://issues.apache.org/jira/browse/LUCENE-1257
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis, Examples, Index, Other, Query/Scoring, 
> QueryParser, Search, Store, Term Vectors
>Affects Versions: 3.0
>Reporter: Cédric Champeau
>Assignee: Uwe Schindler
>Priority: Minor
> Fix For: 3.0
>
> Attachments: instantiated_fieldable.patch, 
> LUCENE-1257-BooleanQuery.patch, LUCENE-1257-BooleanScorer_2.patch, 
> LUCENE-1257-BufferedDeletes_DocumentsWriter.patch, 
> LUCENE-1257-CheckIndex.patch, LUCENE-1257-CloseableThreadLocal.patch, 
> LUCENE-1257-CompoundFileReaderWriter.patch, 
> LUCENE-1257-ConcurrentMergeScheduler.patch, 
> LUCENE-1257-DirectoryReader.patch, 
> LUCENE-1257-DisjunctionMaxQuery-more_type_safety.patch, 
> LUCENE-1257-DocFieldProcessorPerThread.patch, LUCENE-1257-Document.patch, 
> LUCENE-1257-FieldCacheImpl.patch, LUCENE-1257-FieldCacheRangeFilter.patch, 
> LUCENE-1257-IndexDeleter.patch, 
> LUCENE-1257-IndexDeletionPolicy_IndexFileDeleter.patch, LUCENE-1257-iw.patch, 
> LUCENE-1257-MTQWF.patch, LUCENE-1257-NormalizeCharMap.patch, 
> LUCENE-1257-o.a.l.util.patch, LUCENE-1257-org_apache_lucene_document.patch, 
> LUCENE-1257-org_apache_lucene_document.patch, 
> LUCENE-1257-org_apache_lucene_document.patch, LUCENE-1257-SegmentInfos.patch, 
> LUCENE-1257-StringBuffer.patch, LUCENE-1257-StringBuffer.patch, 
> LUCENE-1257-StringBuffer.patch, LUCENE-1257-TopDocsCollector.patch, 
> LUCENE-1257-WordListLoader.patch, LUCENE-1257_analysis.patch, 
> LUCENE-1257_BooleanFilter_Generics.patch, LUCENE-1257_enum.patch, 
> LUCENE-1257_javacc_upgrade.patch, LUCENE-1257_messages.patch, 
> LUCENE-1257_more_unnecessary_casts.patch, 
> LUCENE-1257_MultiFieldQueryParser.patch, LUCENE-1257_o.a.l.queryParser.patch, 
> LUCENE-1257_o.a.l.store.patch, LUCENE-1257_o_a_l_index_test.patch, 
> LUCENE-1257_o_a_l_index_test.patch, LUCENE-1257_o_a_l_search.patch, 
> LUCENE-1257_o_a_l_search_spans.patch, 
> LUCENE-1257_org_apache_lucene_index.patch, 
> LUCENE-1257_org_apache_lucene_index.patch, LUCENE-1257_queryParser_jj.patch, 
> LUCENE-1257_unnecessary_casts.patch, lucene1257surround1.patch, 
> lucene1257surround1.patch, shinglematrixfilter_generified.patch
>
>
> For my needs I've updated Lucene so that it uses Java 5 constructs. I know 
> Java 5 migration had been planned for 2.1 someday in the past, but don't know 
> when it is planned now. This patch against the trunk includes :
> - most obvious generics usage (there are tons of usages of sets, ... Those 
> which are commonly used have been generified)
> - PriorityQueue generification
> - replacement of indexed for loops with for each constructs
> - removal of unnececessary unboxing
> The code is to my opinion much more readable with those features (you 
> actually *know* what is stored in collections reading the code, without the 
> need to lookup for field definitions everytime) and it simplifies many 
> algorithms.
> Note that this patch also includes an interface for the Query class. This has 
> been done for my company's needs for building custom Query classes which add 
> some behaviour to the base Lucene queries. It prevents multiple unnnecessary 
> casts. I know this introduction is not wanted by the team, but it really 
> makes our developments easier to maintain. If you don't want to use this, 
> replace all /Queriable/ calls with standard /Query/.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-10-20 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12767963#action_12767963
 ] 

Uwe Schindler commented on LUCENE-1606:
---

No prob! I will help you, I am on heavy committing :-)

> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.0
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-10-20 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12767961#action_12767961
 ] 

Robert Muir commented on LUCENE-1606:
-

if no one objects, i'd like to commit this in a few days. Can someone help out 
and commit the update to NOTICE?

> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.0
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1257) Port to Java5

2009-10-20 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12767953#action_12767953
 ] 

Uwe Schindler commented on LUCENE-1257:
---

Thanks! Much cleaner code. -- Committed revision: 827811

> Port to Java5
> -
>
> Key: LUCENE-1257
> URL: https://issues.apache.org/jira/browse/LUCENE-1257
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis, Examples, Index, Other, Query/Scoring, 
> QueryParser, Search, Store, Term Vectors
>Affects Versions: 3.0
>Reporter: Cédric Champeau
>Assignee: Uwe Schindler
>Priority: Minor
> Fix For: 3.0
>
> Attachments: instantiated_fieldable.patch, 
> LUCENE-1257-BooleanQuery.patch, LUCENE-1257-BooleanScorer_2.patch, 
> LUCENE-1257-BufferedDeletes_DocumentsWriter.patch, 
> LUCENE-1257-CheckIndex.patch, LUCENE-1257-CloseableThreadLocal.patch, 
> LUCENE-1257-CompoundFileReaderWriter.patch, 
> LUCENE-1257-ConcurrentMergeScheduler.patch, 
> LUCENE-1257-DirectoryReader.patch, 
> LUCENE-1257-DisjunctionMaxQuery-more_type_safety.patch, 
> LUCENE-1257-DocFieldProcessorPerThread.patch, LUCENE-1257-Document.patch, 
> LUCENE-1257-FieldCacheImpl.patch, LUCENE-1257-FieldCacheRangeFilter.patch, 
> LUCENE-1257-IndexDeleter.patch, 
> LUCENE-1257-IndexDeletionPolicy_IndexFileDeleter.patch, LUCENE-1257-iw.patch, 
> LUCENE-1257-MTQWF.patch, LUCENE-1257-NormalizeCharMap.patch, 
> LUCENE-1257-o.a.l.util.patch, LUCENE-1257-org_apache_lucene_document.patch, 
> LUCENE-1257-org_apache_lucene_document.patch, 
> LUCENE-1257-org_apache_lucene_document.patch, LUCENE-1257-SegmentInfos.patch, 
> LUCENE-1257-StringBuffer.patch, LUCENE-1257-StringBuffer.patch, 
> LUCENE-1257-StringBuffer.patch, LUCENE-1257-TopDocsCollector.patch, 
> LUCENE-1257-WordListLoader.patch, LUCENE-1257_analysis.patch, 
> LUCENE-1257_BooleanFilter_Generics.patch, LUCENE-1257_javacc_upgrade.patch, 
> LUCENE-1257_messages.patch, LUCENE-1257_more_unnecessary_casts.patch, 
> LUCENE-1257_MultiFieldQueryParser.patch, LUCENE-1257_o.a.l.queryParser.patch, 
> LUCENE-1257_o.a.l.store.patch, LUCENE-1257_o_a_l_index_test.patch, 
> LUCENE-1257_o_a_l_index_test.patch, LUCENE-1257_o_a_l_search.patch, 
> LUCENE-1257_o_a_l_search_spans.patch, 
> LUCENE-1257_org_apache_lucene_index.patch, 
> LUCENE-1257_org_apache_lucene_index.patch, LUCENE-1257_queryParser_jj.patch, 
> LUCENE-1257_unnecessary_casts.patch, lucene1257surround1.patch, 
> lucene1257surround1.patch, shinglematrixfilter_generified.patch
>
>
> For my needs I've updated Lucene so that it uses Java 5 constructs. I know 
> Java 5 migration had been planned for 2.1 someday in the past, but don't know 
> when it is planned now. This patch against the trunk includes :
> - most obvious generics usage (there are tons of usages of sets, ... Those 
> which are commonly used have been generified)
> - PriorityQueue generification
> - replacement of indexed for loops with for each constructs
> - removal of unnececessary unboxing
> The code is to my opinion much more readable with those features (you 
> actually *know* what is stored in collections reading the code, without the 
> need to lookup for field definitions everytime) and it simplifies many 
> algorithms.
> Note that this patch also includes an interface for the Query class. This has 
> been done for my company's needs for building custom Query classes which add 
> some behaviour to the base Lucene queries. It prevents multiple unnnecessary 
> casts. I know this introduction is not wanted by the team, but it really 
> makes our developments easier to maintain. If you don't want to use this, 
> replace all /Queriable/ calls with standard /Query/.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1257) Port to Java5

2009-10-20 Thread Kay Kay (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Kay updated LUCENE-1257:


Attachment: LUCENE-1257_more_unnecessary_casts.patch

> Port to Java5
> -
>
> Key: LUCENE-1257
> URL: https://issues.apache.org/jira/browse/LUCENE-1257
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis, Examples, Index, Other, Query/Scoring, 
> QueryParser, Search, Store, Term Vectors
>Affects Versions: 3.0
>Reporter: Cédric Champeau
>Assignee: Uwe Schindler
>Priority: Minor
> Fix For: 3.0
>
> Attachments: instantiated_fieldable.patch, 
> LUCENE-1257-BooleanQuery.patch, LUCENE-1257-BooleanScorer_2.patch, 
> LUCENE-1257-BufferedDeletes_DocumentsWriter.patch, 
> LUCENE-1257-CheckIndex.patch, LUCENE-1257-CloseableThreadLocal.patch, 
> LUCENE-1257-CompoundFileReaderWriter.patch, 
> LUCENE-1257-ConcurrentMergeScheduler.patch, 
> LUCENE-1257-DirectoryReader.patch, 
> LUCENE-1257-DisjunctionMaxQuery-more_type_safety.patch, 
> LUCENE-1257-DocFieldProcessorPerThread.patch, LUCENE-1257-Document.patch, 
> LUCENE-1257-FieldCacheImpl.patch, LUCENE-1257-FieldCacheRangeFilter.patch, 
> LUCENE-1257-IndexDeleter.patch, 
> LUCENE-1257-IndexDeletionPolicy_IndexFileDeleter.patch, LUCENE-1257-iw.patch, 
> LUCENE-1257-MTQWF.patch, LUCENE-1257-NormalizeCharMap.patch, 
> LUCENE-1257-o.a.l.util.patch, LUCENE-1257-org_apache_lucene_document.patch, 
> LUCENE-1257-org_apache_lucene_document.patch, 
> LUCENE-1257-org_apache_lucene_document.patch, LUCENE-1257-SegmentInfos.patch, 
> LUCENE-1257-StringBuffer.patch, LUCENE-1257-StringBuffer.patch, 
> LUCENE-1257-StringBuffer.patch, LUCENE-1257-TopDocsCollector.patch, 
> LUCENE-1257-WordListLoader.patch, LUCENE-1257_analysis.patch, 
> LUCENE-1257_BooleanFilter_Generics.patch, LUCENE-1257_javacc_upgrade.patch, 
> LUCENE-1257_messages.patch, LUCENE-1257_more_unnecessary_casts.patch, 
> LUCENE-1257_MultiFieldQueryParser.patch, LUCENE-1257_o.a.l.queryParser.patch, 
> LUCENE-1257_o.a.l.store.patch, LUCENE-1257_o_a_l_index_test.patch, 
> LUCENE-1257_o_a_l_index_test.patch, LUCENE-1257_o_a_l_search.patch, 
> LUCENE-1257_o_a_l_search_spans.patch, 
> LUCENE-1257_org_apache_lucene_index.patch, 
> LUCENE-1257_org_apache_lucene_index.patch, LUCENE-1257_queryParser_jj.patch, 
> LUCENE-1257_unnecessary_casts.patch, lucene1257surround1.patch, 
> lucene1257surround1.patch, shinglematrixfilter_generified.patch
>
>
> For my needs I've updated Lucene so that it uses Java 5 constructs. I know 
> Java 5 migration had been planned for 2.1 someday in the past, but don't know 
> when it is planned now. This patch against the trunk includes :
> - most obvious generics usage (there are tons of usages of sets, ... Those 
> which are commonly used have been generified)
> - PriorityQueue generification
> - replacement of indexed for loops with for each constructs
> - removal of unnececessary unboxing
> The code is to my opinion much more readable with those features (you 
> actually *know* what is stored in collections reading the code, without the 
> need to lookup for field definitions everytime) and it simplifies many 
> algorithms.
> Note that this patch also includes an interface for the Query class. This has 
> been done for my company's needs for building custom Query classes which add 
> some behaviour to the base Lucene queries. It prevents multiple unnnecessary 
> casts. I know this introduction is not wanted by the team, but it really 
> makes our developments easier to maintain. If you don't want to use this, 
> replace all /Queriable/ calls with standard /Query/.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Resolved: (LUCENE-1183) TRStringDistance uses way too much memory (with patch)

2009-10-20 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-1183.


   Resolution: Fixed
Fix Version/s: 3.0

Thanks Cédrik!

> TRStringDistance uses way too much memory (with patch)
> --
>
> Key: LUCENE-1183
> URL: https://issues.apache.org/jira/browse/LUCENE-1183
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/*
>Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3
>Reporter: Cédrik LIME
>Assignee: Otis Gospodnetic
>Priority: Minor
> Fix For: 3.0
>
> Attachments: FuzzyTermEnum.patch, TRStringDistance.java, 
> TRStringDistance.patch
>
>   Original Estimate: 0.17h
>  Remaining Estimate: 0.17h
>
> The implementation of TRStringDistance is based on version 2.1 of 
> org.apache.commons.lang.StringUtils#getLevenshteinDistance(String, String), 
> which uses an un-optimized implementation of the Levenshtein Distance 
> algorithm (it uses way too much memory). Please see Bug 38911 
> (http://issues.apache.org/bugzilla/show_bug.cgi?id=38911) for more 
> information.
> The commons-lang implementation has been heavily optimized as of version 2.2 
> (3x speed-up). I have reported the new implementation to TRStringDistance.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1183) TRStringDistance uses way too much memory (with patch)

2009-10-20 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12767935#action_12767935
 ] 

Michael McCandless commented on LUCENE-1183:


OK I had 2 hunks fail but I managed to apply them.

> TRStringDistance uses way too much memory (with patch)
> --
>
> Key: LUCENE-1183
> URL: https://issues.apache.org/jira/browse/LUCENE-1183
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/*
>Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3
>Reporter: Cédrik LIME
>Assignee: Otis Gospodnetic
>Priority: Minor
> Attachments: FuzzyTermEnum.patch, TRStringDistance.java, 
> TRStringDistance.patch
>
>   Original Estimate: 0.17h
>  Remaining Estimate: 0.17h
>
> The implementation of TRStringDistance is based on version 2.1 of 
> org.apache.commons.lang.StringUtils#getLevenshteinDistance(String, String), 
> which uses an un-optimized implementation of the Levenshtein Distance 
> algorithm (it uses way too much memory). Please see Bug 38911 
> (http://issues.apache.org/bugzilla/show_bug.cgi?id=38911) for more 
> information.
> The commons-lang implementation has been heavily optimized as of version 2.2 
> (3x speed-up). I have reported the new implementation to TRStringDistance.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1257) Port to Java5

2009-10-20 Thread Kay Kay (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Kay updated LUCENE-1257:


Attachment: LUCENE-1257_unnecessary_casts.patch

Remove unnecessary cast across the codebase (as a result of generification) 

> Port to Java5
> -
>
> Key: LUCENE-1257
> URL: https://issues.apache.org/jira/browse/LUCENE-1257
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis, Examples, Index, Other, Query/Scoring, 
> QueryParser, Search, Store, Term Vectors
>Affects Versions: 3.0
>Reporter: Cédric Champeau
>Assignee: Uwe Schindler
>Priority: Minor
> Fix For: 3.0
>
> Attachments: instantiated_fieldable.patch, 
> LUCENE-1257-BooleanQuery.patch, LUCENE-1257-BooleanScorer_2.patch, 
> LUCENE-1257-BufferedDeletes_DocumentsWriter.patch, 
> LUCENE-1257-CheckIndex.patch, LUCENE-1257-CloseableThreadLocal.patch, 
> LUCENE-1257-CompoundFileReaderWriter.patch, 
> LUCENE-1257-ConcurrentMergeScheduler.patch, 
> LUCENE-1257-DirectoryReader.patch, 
> LUCENE-1257-DisjunctionMaxQuery-more_type_safety.patch, 
> LUCENE-1257-DocFieldProcessorPerThread.patch, LUCENE-1257-Document.patch, 
> LUCENE-1257-FieldCacheImpl.patch, LUCENE-1257-FieldCacheRangeFilter.patch, 
> LUCENE-1257-IndexDeleter.patch, 
> LUCENE-1257-IndexDeletionPolicy_IndexFileDeleter.patch, LUCENE-1257-iw.patch, 
> LUCENE-1257-MTQWF.patch, LUCENE-1257-NormalizeCharMap.patch, 
> LUCENE-1257-o.a.l.util.patch, LUCENE-1257-org_apache_lucene_document.patch, 
> LUCENE-1257-org_apache_lucene_document.patch, 
> LUCENE-1257-org_apache_lucene_document.patch, LUCENE-1257-SegmentInfos.patch, 
> LUCENE-1257-StringBuffer.patch, LUCENE-1257-StringBuffer.patch, 
> LUCENE-1257-StringBuffer.patch, LUCENE-1257-TopDocsCollector.patch, 
> LUCENE-1257-WordListLoader.patch, LUCENE-1257_analysis.patch, 
> LUCENE-1257_BooleanFilter_Generics.patch, LUCENE-1257_javacc_upgrade.patch, 
> LUCENE-1257_messages.patch, LUCENE-1257_MultiFieldQueryParser.patch, 
> LUCENE-1257_o.a.l.queryParser.patch, LUCENE-1257_o.a.l.store.patch, 
> LUCENE-1257_o_a_l_index_test.patch, LUCENE-1257_o_a_l_index_test.patch, 
> LUCENE-1257_o_a_l_search.patch, LUCENE-1257_o_a_l_search_spans.patch, 
> LUCENE-1257_org_apache_lucene_index.patch, 
> LUCENE-1257_org_apache_lucene_index.patch, LUCENE-1257_queryParser_jj.patch, 
> LUCENE-1257_unnecessary_casts.patch, lucene1257surround1.patch, 
> lucene1257surround1.patch, shinglematrixfilter_generified.patch
>
>
> For my needs I've updated Lucene so that it uses Java 5 constructs. I know 
> Java 5 migration had been planned for 2.1 someday in the past, but don't know 
> when it is planned now. This patch against the trunk includes :
> - most obvious generics usage (there are tons of usages of sets, ... Those 
> which are commonly used have been generified)
> - PriorityQueue generification
> - replacement of indexed for loops with for each constructs
> - removal of unnececessary unboxing
> The code is to my opinion much more readable with those features (you 
> actually *know* what is stored in collections reading the code, without the 
> need to lookup for field definitions everytime) and it simplifies many 
> algorithms.
> Note that this patch also includes an interface for the Query class. This has 
> been done for my company's needs for building custom Query classes which add 
> some behaviour to the base Lucene queries. It prevents multiple unnnecessary 
> casts. I know this introduction is not wanted by the team, but it really 
> makes our developments easier to maintain. If you don't want to use this, 
> replace all /Queriable/ calls with standard /Query/.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1183) TRStringDistance uses way too much memory (with patch)

2009-10-20 Thread JIRA


[ 
https://issues.apache.org/jira/browse/LUCENE-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12767928#action_12767928
 ] 

Cédrik LIME commented on LUCENE-1183:
-

Thanks Michael.
FuzzyTermEnum.java has not changed for more than 2 years. The uploaded patch 
(FuzzyTermEnum.patch) is still valid for trunk.

> TRStringDistance uses way too much memory (with patch)
> --
>
> Key: LUCENE-1183
> URL: https://issues.apache.org/jira/browse/LUCENE-1183
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/*
>Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3
>Reporter: Cédrik LIME
>Assignee: Otis Gospodnetic
>Priority: Minor
> Attachments: FuzzyTermEnum.patch, TRStringDistance.java, 
> TRStringDistance.patch
>
>   Original Estimate: 0.17h
>  Remaining Estimate: 0.17h
>
> The implementation of TRStringDistance is based on version 2.1 of 
> org.apache.commons.lang.StringUtils#getLevenshteinDistance(String, String), 
> which uses an un-optimized implementation of the Levenshtein Distance 
> algorithm (it uses way too much memory). Please see Bug 38911 
> (http://issues.apache.org/bugzilla/show_bug.cgi?id=38911) for more 
> information.
> The commons-lang implementation has been heavily optimized as of version 2.2 
> (3x speed-up). I have reported the new implementation to TRStringDistance.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: lucene 2.9 sorting algorithm

2009-10-20 Thread Michael McCandless

OK I posted a patch that folds the MultiPQ approach into
contrib/benchmark, plus a simple python wrapper to run old/new tests
across different queries, sort, topN, etc.

But I got different results... MultiPQ looks generally slower than
SinglePQ.  So I think we now need to reconcile what's different
between our tests.

Mike

On Mon, Oct 19, 2009 at 9:28 PM, John Wang  wrote:
> Hi Michael:
>      Was wondering if you got a chance to take a look at this.
>      Since deprecated APIs are being removed in 3.0, I was wondering if/when
> we would decide on keeping the ScoreDocComparator API and thus would be kept
> for Lucene 3.0.
> Thanks
> -John
>
> On Fri, Oct 16, 2009 at 9:53 AM, Michael McCandless
>  wrote:
>>
>> Oh, no problem...
>>
>> Mike
>>
>> On Fri, Oct 16, 2009 at 12:33 PM, John Wang  wrote:
>> > Mike, just a clarification on my first perf report email.
>> > The first section, numHits is incorrectly labeled, it should be 20
>> > instead
>> > of 50. Sorry about the possible confusion.
>> > Thanks
>> > -John
>> >
>> > On Fri, Oct 16, 2009 at 3:21 AM, Michael McCandless
>> >  wrote:
>> >>
>> >> Thanks John; I'll have a look.
>> >>
>> >> Mike
>> >>
>> >> On Fri, Oct 16, 2009 at 12:57 AM, John Wang 
>> >> wrote:
>> >> > Hi Michael:
>> >> >     I added classes: ScoreDocComparatorQueue
>> >> > and OneSortNoScoreCollector
>> >> > as
>> >> > a more general case. I think keeping the old api for
>> >> > ScoreDocComparator
>> >> > and
>> >> > SortComparatorSource would work.
>> >> >   Please take a look.
>> >> > Thanks
>> >> > -John
>> >> >
>> >> > On Thu, Oct 15, 2009 at 6:52 PM, John Wang 
>> >> > wrote:
>> >> >>
>> >> >> Hi Michael:
>> >> >>      It is
>> >> >> open, http://code.google.com/p/lucene-book/source/checkout
>> >> >>      I think I sent the https url instead, sorry.
>> >> >>     The multi PQ sorting is fairly self-contained, I have 2
>> >> >> versions, 1
>> >> >> for string and 1 for int, each are Collector impls.
>> >> >>      I shouldn't say the Multi Q is faster on int sort, it is within
>> >> >> the
>> >> >> error boundary. The diff is very very small, I would stay they are
>> >> >> more
>> >> >> equal.
>> >> >>      If you think it is a good thing to go this way, (if not for the
>> >> >> perf,
>> >> >> just for the simpler api) I'd be happy to work on a patch.
>> >> >> Thanks
>> >> >> -John
>> >> >> On Thu, Oct 15, 2009 at 5:18 PM, Michael McCandless
>> >> >>  wrote:
>> >> >>>
>> >> >>> John, looks like this requires login -- any plans to open that up,
>> >> >>> or,
>> >> >>> post the code on an issue?
>> >> >>>
>> >> >>> How self-contained is your Multi PQ sorting?  EG is it a standalone
>> >> >>> Collector impl that I can test?
>> >> >>>
>> >> >>> Mike
>> >> >>>
>> >> >>> On Thu, Oct 15, 2009 at 6:33 PM, John Wang 
>> >> >>> wrote:
>> >> >>> > BTW, we are have a little sandbox for these experiments. And all
>> >> >>> > my
>> >> >>> > testcode
>> >> >>> > are at. They are not very polished.
>> >> >>> >
>> >> >>> > https://lucene-book.googlecode.com/svn/trunk
>> >> >>> >
>> >> >>> > -John
>> >> >>> >
>> >> >>> > On Thu, Oct 15, 2009 at 3:29 PM, John Wang 
>> >> >>> > wrote:
>> >> >>> >>
>> >> >>> >> Numbers Mike requested for Int types:
>> >> >>> >>
>> >> >>> >> only the time/cputime are posted, others are all the same since
>> >> >>> >> the
>> >> >>> >> algorithm is the same.
>> >> >>> >>
>> >> >>> >> Lucene 2.9:
>> >> >>> >> numhits: 10
>> >> >>> >> time: 14619495
>> >> >>> >> cpu: 146126
>> >> >>> >>
>> >> >>> >> numhits: 20
>> >> >>> >> time: 14550568
>> >> >>> >> cpu: 163242
>> >> >>> >>
>> >> >>> >> numhits: 100
>> >> >>> >> time: 16467647
>> >> >>> >> cpu: 178379
>> >> >>> >>
>> >> >>> >>
>> >> >>> >> my test:
>> >> >>> >> numHits: 10
>> >> >>> >> time: 14101094
>> >> >>> >> cpu: 144715
>> >> >>> >>
>> >> >>> >> numHits: 20
>> >> >>> >> time: 14804821
>> >> >>> >> cpu: 151305
>> >> >>> >>
>> >> >>> >> numHits: 100
>> >> >>> >> time: 15372157
>> >> >>> >> cpu time: 158842
>> >> >>> >>
>> >> >>> >> Conclusions:
>> >> >>> >> The are very similar, the differences are all within error
>> >> >>> >> bounds,
>> >> >>> >> especially with lower PQ sizes, which second sort alg again
>> >> >>> >> slightly
>> >> >>> >> faster.
>> >> >>> >>
>> >> >>> >> Hope this helps.
>> >> >>> >>
>> >> >>> >> -John
>> >> >>> >>
>> >> >>> >>
>> >> >>> >> On Thu, Oct 15, 2009 at 3:04 PM, Yonik Seeley
>> >> >>> >> 
>> >> >>> >> wrote:
>> >> >>> >>>
>> >> >>> >>> On Thu, Oct 15, 2009 at 5:33 PM, Michael McCandless
>> >> >>> >>>  wrote:
>> >> >>> >>> > Though it'd be odd if the switch to searching by segment
>> >> >>> >>> > really was most of the gains here.
>> >> >>> >>>
>> >> >>> >>> I had assumed that much of the improvement was due to ditching
>> >> >>> >>> MultiTermEnum/MultiTermDocs.
>> >> >>> >>> Note that LUCENE-1483 was before LUCENE-1596... but that only
>> >> >>> >>> helps
>> >> >>> >>> with queries that use a TermEnum (range, prefix, etc).
>> >> >>> >>>
>> >> >>> >>> -Yonik
>> >> >>> >>> http://www.lucidima

Re: lucene 2.9 sorting algorithm

2009-10-20 Thread Michael McCandless

On Tue, Oct 20, 2009 at 11:47 AM, Michael McCandless
 wrote:
> On Tue, Oct 20, 2009 at 10:49 AM, Mark Miller  wrote:
>> bq. One trivial thing that could be improved is to perhaps move all of
>> the methods to the top of the class?
>>
>> +1 - I think Mike and silently fought on that one once in the patches :)
>> Though I don't know how conscious it was. I prefer the methods at the
>> top myself.
>
> +1
>
> I didn't mean to fight that :)

I'll move them up...

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API

2009-10-20 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1997:
---

Attachment: LUCENE-1997.patch

New patch attached:

  * Turn off testing on the balanced index by default (set DO_BALANCED to True 
if you want to change this)

  * Minor formatting fixes in generating the report

> Explore performance of multi-PQ vs single-PQ sorting API
> 
>
> Key: LUCENE-1997
> URL: https://issues.apache.org/jira/browse/LUCENE-1997
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Attachments: LUCENE-1997.patch, LUCENE-1997.patch
>
>
> Spinoff from recent "lucene 2.9 sorting algorithm" thread on java-dev,
> where a simpler (non-segment-based) comparator API is proposed that
> gathers results into multiple PQs (one per segment) and then merges
> them in the end.
> I started from John's multi-PQ code and worked it into
> contrib/benchmark so that we could run perf tests.  Then I generified
> the Python script I use for running search benchmarks (in
> contrib/benchmark/sortBench.py).
> The script first creates indexes with 1M docs (based on
> SortableSingleDocSource, and based on wikipedia, if available).  Then
> it runs various combinations:
>   * Index with 20 balanced segments vs index with the "normal" log
> segment size
>   * Queries with different numbers of hits (only for wikipedia index)
>   * Different top N
>   * Different sorts (by title, for wikipedia, and by random string,
> random int, and country for the random index)
> For each test, 7 search rounds are run and the best QPS is kept.  The
> script runs singlePQ then multiPQ, and records the resulting best QPS
> for each and produces table (in Jira format) as output.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API

2009-10-20 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12767870#action_12767870
 ] 

Michael McCandless commented on LUCENE-1997:


OK I ran sortBench.py on opensolaris 2009.06 box, Java 1.6.0_13.

It'd be great if others with more mainstream platforms (Linux,
Windows) could run this and post back.

Raw results (only ran on the log-sized segments):

||Seg size||Query||Tot hits||Sort||Top N||QPS old||QPS new||Pct change||
|log|1|318481|title|10|114.26|112.40|{color:red}-1.6%{color}|
|log|1|318481|title|25|117.59|110.08|{color:red}-6.4%{color}|
|log|1|318481|title|50|116.22|106.96|{color:red}-8.0%{color}|
|log|1|318481|title|100|114.48|100.07|{color:red}-12.6%{color}|
|log|1|318481|title|500|103.16|73.98|{color:red}-28.3%{color}|
|log|1|318481|title|1000|95.60|57.85|{color:red}-39.5%{color}|
|log||100|title|10|95.71|109.41|{color:green}14.3%{color}|
|log||100|title|25|111.56|101.73|{color:red}-8.8%{color}|
|log||100|title|50|110.56|98.84|{color:red}-10.6%{color}|
|log||100|title|100|104.09|93.02|{color:red}-10.6%{color}|
|log||100|title|500|93.36|66.67|{color:red}-28.6%{color}|
|log||100|title|1000|97.07|50.03|{color:red}-48.5%{color}|
|log||100|rand string|10|118.10|109.63|{color:red}-7.2%{color}|
|log||100|rand string|25|107.68|102.33|{color:red}-5.0%{color}|
|log||100|rand string|50|107.12|100.37|{color:red}-6.3%{color}|
|log||100|rand string|100|110.63|95.17|{color:red}-14.0%{color}|
|log||100|rand string|500|79.97|72.09|{color:red}-9.9%{color}|
|log||100|rand string|1000|76.82|54.67|{color:red}-28.8%{color}|
|log||100|country|10|129.49|103.63|{color:red}-20.0%{color}|
|log||100|country|25|111.74|102.60|{color:red}-8.2%{color}|
|log||100|country|50|108.82|100.90|{color:red}-7.3%{color}|
|log||100|country|100|108.01|96.84|{color:red}-10.3%{color}|
|log||100|country|500|97.60|72.02|{color:red}-26.2%{color}|
|log||100|country|1000|85.19|54.56|{color:red}-36.0%{color}|
|log||100|rand int|10|151.75|110.37|{color:red}-27.3%{color}|
|log||100|rand int|25|138.06|109.15|{color:red}-20.9%{color}|
|log||100|rand int|50|135.40|106.49|{color:red}-21.4%{color}|
|log||100|rand int|100|108.30|101.86|{color:red}-5.9%{color}|
|log||100|rand int|500|94.45|73.42|{color:red}-22.3%{color}|
|log||100|rand int|1000|88.30|54.71|{color:red}-38.0%{color}|

Some observations:
 
  * MultiPQ seems like it's generally slower, thought it is faster in
one case, when topN = 10, sorting by title.  It's only faster with
the *:* (MatchAllDocsQuery) query, not with the TermQuery for
term=1, which is odd.

  * MultiPQ slows down, relatively, as topN increases.

  * Sorting by int acts differently: MultiPQ is quite a bit slower
across the board, except for topN=100 


> Explore performance of multi-PQ vs single-PQ sorting API
> 
>
> Key: LUCENE-1997
> URL: https://issues.apache.org/jira/browse/LUCENE-1997
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Attachments: LUCENE-1997.patch, LUCENE-1997.patch
>
>
> Spinoff from recent "lucene 2.9 sorting algorithm" thread on java-dev,
> where a simpler (non-segment-based) comparator API is proposed that
> gathers results into multiple PQs (one per segment) and then merges
> them in the end.
> I started from John's multi-PQ code and worked it into
> contrib/benchmark so that we could run perf tests.  Then I generified
> the Python script I use for running search benchmarks (in
> contrib/benchmark/sortBench.py).
> The script first creates indexes with 1M docs (based on
> SortableSingleDocSource, and based on wikipedia, if available).  Then
> it runs various combinations:
>   * Index with 20 balanced segments vs index with the "normal" log
> segment size
>   * Queries with different numbers of hits (only for wikipedia index)
>   * Different top N
>   * Different sorts (by title, for wikipedia, and by random string,
> random int, and country for the random index)
> For each test, 7 search rounds are run and the best QPS is kept.  The
> script runs singlePQ then multiPQ, and records the resulting best QPS
> for each and produces table (in Jira format) as output.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1257) Port to Java5

2009-10-20 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12767852#action_12767852
 ] 

Uwe Schindler commented on LUCENE-1257:
---

Committed:
   LUCENE-1257_javacc_upgrade.patch 2009-10-20 12:06 AM Kay Kay 0.8 kB 
   LUCENE-1257_MultiFieldQueryParser.patch 2009-10-19 11:50 PM Kay Kay 3 kB 
   LUCENE-1257_queryParser_jj.patch 2009-10-19 11:45 PM Kay Kay 6 kB 

Also removed deprecated API from QueryParser.

At revision: 827717

> Port to Java5
> -
>
> Key: LUCENE-1257
> URL: https://issues.apache.org/jira/browse/LUCENE-1257
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis, Examples, Index, Other, Query/Scoring, 
> QueryParser, Search, Store, Term Vectors
>Affects Versions: 3.0
>Reporter: Cédric Champeau
>Assignee: Uwe Schindler
>Priority: Minor
> Fix For: 3.0
>
> Attachments: instantiated_fieldable.patch, 
> LUCENE-1257-BooleanQuery.patch, LUCENE-1257-BooleanScorer_2.patch, 
> LUCENE-1257-BufferedDeletes_DocumentsWriter.patch, 
> LUCENE-1257-CheckIndex.patch, LUCENE-1257-CloseableThreadLocal.patch, 
> LUCENE-1257-CompoundFileReaderWriter.patch, 
> LUCENE-1257-ConcurrentMergeScheduler.patch, 
> LUCENE-1257-DirectoryReader.patch, 
> LUCENE-1257-DisjunctionMaxQuery-more_type_safety.patch, 
> LUCENE-1257-DocFieldProcessorPerThread.patch, LUCENE-1257-Document.patch, 
> LUCENE-1257-FieldCacheImpl.patch, LUCENE-1257-FieldCacheRangeFilter.patch, 
> LUCENE-1257-IndexDeleter.patch, 
> LUCENE-1257-IndexDeletionPolicy_IndexFileDeleter.patch, LUCENE-1257-iw.patch, 
> LUCENE-1257-MTQWF.patch, LUCENE-1257-NormalizeCharMap.patch, 
> LUCENE-1257-o.a.l.util.patch, LUCENE-1257-org_apache_lucene_document.patch, 
> LUCENE-1257-org_apache_lucene_document.patch, 
> LUCENE-1257-org_apache_lucene_document.patch, LUCENE-1257-SegmentInfos.patch, 
> LUCENE-1257-StringBuffer.patch, LUCENE-1257-StringBuffer.patch, 
> LUCENE-1257-StringBuffer.patch, LUCENE-1257-TopDocsCollector.patch, 
> LUCENE-1257-WordListLoader.patch, LUCENE-1257_analysis.patch, 
> LUCENE-1257_BooleanFilter_Generics.patch, LUCENE-1257_javacc_upgrade.patch, 
> LUCENE-1257_messages.patch, LUCENE-1257_MultiFieldQueryParser.patch, 
> LUCENE-1257_o.a.l.queryParser.patch, LUCENE-1257_o.a.l.store.patch, 
> LUCENE-1257_o_a_l_index_test.patch, LUCENE-1257_o_a_l_index_test.patch, 
> LUCENE-1257_o_a_l_search.patch, LUCENE-1257_o_a_l_search_spans.patch, 
> LUCENE-1257_org_apache_lucene_index.patch, 
> LUCENE-1257_org_apache_lucene_index.patch, LUCENE-1257_queryParser_jj.patch, 
> lucene1257surround1.patch, lucene1257surround1.patch, 
> shinglematrixfilter_generified.patch
>
>
> For my needs I've updated Lucene so that it uses Java 5 constructs. I know 
> Java 5 migration had been planned for 2.1 someday in the past, but don't know 
> when it is planned now. This patch against the trunk includes :
> - most obvious generics usage (there are tons of usages of sets, ... Those 
> which are commonly used have been generified)
> - PriorityQueue generification
> - replacement of indexed for loops with for each constructs
> - removal of unnececessary unboxing
> The code is to my opinion much more readable with those features (you 
> actually *know* what is stored in collections reading the code, without the 
> need to lookup for field definitions everytime) and it simplifies many 
> algorithms.
> Note that this patch also includes an interface for the Query class. This has 
> been done for my company's needs for building custom Query classes which add 
> some behaviour to the base Lucene queries. It prevents multiple unnnecessary 
> casts. I know this introduction is not wanted by the team, but it really 
> makes our developments easier to maintain. If you don't want to use this, 
> replace all /Queriable/ calls with standard /Query/.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: lucene 2.9 sorting algorithm

2009-10-20 Thread John Wang

Sorry, mistyped again, we have a multivalued field of STRINGS, no integers.
-John

On Tue, Oct 20, 2009 at 8:55 AM, John Wang  wrote:

> Hi guys:
> I am not suggesting just simply changing the deprecated signatures.
> There are some work to be done of course. In the beginning of the thread, we
> discussed two algorithms (both handling per-segment field loading), and at
> the conclusion, (to be still verified by Mike) that both algorithms perform
> the same. (We do see once the queue size increases, the performance cost
> increased more for the single Q approach, the one in the trunk, than the
> multiQ approach, please see the numbers I posted earlier in this thread.)
>
> However, the multiQ algorithm would allow us to keep the old simpler
> api, and the simpler api places less restriction on the type of custom
> sorting that can be done.
>
> Let me provide an example:
>
> We have a multi valued field on integers, we define a sort on this set
> of strings by defining a comparator on each value to be similar to a lex
> order, instead of compare on characters, we do on strings, we also want to
> keep the multi value representation as we do filtering and facet counting on
> it. The in memory representation is similar to the UnInvertedField in Solr.
>
>Implementing a sort with the old API was rather simple, as we only
> needed mapping from a docid to a set of ordinals. With the new api, we
> needed to do a "conversion", which would mean mapping a set of
> String/ordinals back to a doc. Which is to me, is not trivial, let alone
> performance implications.
>
>That actually gave us to motivation to see if the old api can handle the
> segment level changes that was made in 2.9 (which in my opinion is the best
> thing in lucene since payloads :) )
>
>So after some investigation, with code and big O analysis, and
> discussions with Mike and Yonik, on our end, we feel given the performance
> numbers, it is unnecessary to go with the more complicated API.
>
> Thanks
>
> -John
>
>
>
> On Tue, Oct 20, 2009 at 6:00 AM, Mark Miller wrote:
>
>> Actually though - how are we supposed to get back there? I don't think
>> its as simple as just not removing the deprecated API's. Doesn't even
>> seem close to that simple. Its another nightmare. It would have to be
>> some serious wins to go through that pain starting at a 3.0 release
>> wouldn't it? We just had a ton of people switch. We would have to
>> deprecate a bunch of stuff. Hard to imagine wanting to switch now - the
>> new API is certainly not that bad.
>>
>> -
>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>>
>>
>

Re: lucene 2.9 sorting algorithm

2009-10-20 Thread John Wang

Hi guys:
I am not suggesting just simply changing the deprecated signatures.
There are some work to be done of course. In the beginning of the thread, we
discussed two algorithms (both handling per-segment field loading), and at
the conclusion, (to be still verified by Mike) that both algorithms perform
the same. (We do see once the queue size increases, the performance cost
increased more for the single Q approach, the one in the trunk, than the
multiQ approach, please see the numbers I posted earlier in this thread.)

However, the multiQ algorithm would allow us to keep the old simpler
api, and the simpler api places less restriction on the type of custom
sorting that can be done.

Let me provide an example:

We have a multi valued field on integers, we define a sort on this set
of strings by defining a comparator on each value to be similar to a lex
order, instead of compare on characters, we do on strings, we also want to
keep the multi value representation as we do filtering and facet counting on
it. The in memory representation is similar to the UnInvertedField in Solr.

   Implementing a sort with the old API was rather simple, as we only needed
mapping from a docid to a set of ordinals. With the new api, we needed to do
a "conversion", which would mean mapping a set of String/ordinals back to a
doc. Which is to me, is not trivial, let alone performance implications.

   That actually gave us to motivation to see if the old api can handle the
segment level changes that was made in 2.9 (which in my opinion is the best
thing in lucene since payloads :) )

   So after some investigation, with code and big O analysis, and
discussions with Mike and Yonik, on our end, we feel given the performance
numbers, it is unnecessary to go with the more complicated API.

Thanks

-John

On Tue, Oct 20, 2009 at 6:00 AM, Mark Miller  wrote:

> Actually though - how are we supposed to get back there? I don't think
> its as simple as just not removing the deprecated API's. Doesn't even
> seem close to that simple. Its another nightmare. It would have to be
> some serious wins to go through that pain starting at a 3.0 release
> wouldn't it? We just had a ton of people switch. We would have to
> deprecate a bunch of stuff. Hard to imagine wanting to switch now - the
> new API is certainly not that bad.
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>

Re: lucene 2.9 sorting algorithm

2009-10-20 Thread Michael McCandless

On Tue, Oct 20, 2009 at 10:49 AM, Mark Miller  wrote:
> bq. One trivial thing that could be improved is to perhaps move all of
> the methods to the top of the class?
>
> +1 - I think Mike and silently fought on that one once in the patches :)
> Though I don't know how conscious it was. I prefer the methods at the
> top myself.

+1

I didn't mean to fight that :)

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API

2009-10-20 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1997:
---

Attachment: LUCENE-1997.patch

Attached patch.

Note that patch is based on 2.9.x branch, so first checkout 2.9.x,
apply the patch, then:

  cd contrib/benchmark
  ant compile
  
  python -u sortBench.py -run results
  python -u sortBench.py -report results

The important constants are INDEX_DIR_BASE (where created indexes are
stored), WIKI_FILE (points to .tar.bz2 or .tar export of wikipedia; if
this file can't be found the script just skips the wikipedia tests).
You can also change INDEX_NUM_DOCS and INDEX_NUM_THREADS.

If you don't have the wiki export downloaded, that's fine... the
script should just run the tests based on the random index.


> Explore performance of multi-PQ vs single-PQ sorting API
> 
>
> Key: LUCENE-1997
> URL: https://issues.apache.org/jira/browse/LUCENE-1997
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Attachments: LUCENE-1997.patch
>
>
> Spinoff from recent "lucene 2.9 sorting algorithm" thread on java-dev,
> where a simpler (non-segment-based) comparator API is proposed that
> gathers results into multiple PQs (one per segment) and then merges
> them in the end.
> I started from John's multi-PQ code and worked it into
> contrib/benchmark so that we could run perf tests.  Then I generified
> the Python script I use for running search benchmarks (in
> contrib/benchmark/sortBench.py).
> The script first creates indexes with 1M docs (based on
> SortableSingleDocSource, and based on wikipedia, if available).  Then
> it runs various combinations:
>   * Index with 20 balanced segments vs index with the "normal" log
> segment size
>   * Queries with different numbers of hits (only for wikipedia index)
>   * Different top N
>   * Different sorts (by title, for wikipedia, and by random string,
> random int, and country for the random index)
> For each test, 7 search rounds are run and the best QPS is kept.  The
> script runs singlePQ then multiPQ, and records the resulting best QPS
> for each and produces table (in Jira format) as output.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Created: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API

2009-10-20 Thread Michael McCandless (JIRA)

Explore performance of multi-PQ vs single-PQ sorting API


 Key: LUCENE-1997
 URL: https://issues.apache.org/jira/browse/LUCENE-1997
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless


Spinoff from recent "lucene 2.9 sorting algorithm" thread on java-dev,
where a simpler (non-segment-based) comparator API is proposed that
gathers results into multiple PQs (one per segment) and then merges
them in the end.

I started from John's multi-PQ code and worked it into
contrib/benchmark so that we could run perf tests.  Then I generified
the Python script I use for running search benchmarks (in
contrib/benchmark/sortBench.py).

The script first creates indexes with 1M docs (based on
SortableSingleDocSource, and based on wikipedia, if available).  Then
it runs various combinations:

  * Index with 20 balanced segments vs index with the "normal" log
segment size

  * Queries with different numbers of hits (only for wikipedia index)

  * Different top N

  * Different sorts (by title, for wikipedia, and by random string,
random int, and country for the random index)

For each test, 7 search rounds are run and the best QPS is kept.  The
script runs singlePQ then multiPQ, and records the resulting best QPS
for each and produces table (in Jira format) as output.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1183) TRStringDistance uses way too much memory (with patch)

2009-10-20 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12767820#action_12767820
 ] 

Michael McCandless commented on LUCENE-1183:


Cédrik, could you update the patch to trunk?  It sounds like a compelling 
improvement.  We should get it in.

> TRStringDistance uses way too much memory (with patch)
> --
>
> Key: LUCENE-1183
> URL: https://issues.apache.org/jira/browse/LUCENE-1183
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/*
>Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3
>Reporter: Cédrik LIME
>Assignee: Otis Gospodnetic
>Priority: Minor
> Attachments: FuzzyTermEnum.patch, TRStringDistance.java, 
> TRStringDistance.patch
>
>   Original Estimate: 0.17h
>  Remaining Estimate: 0.17h
>
> The implementation of TRStringDistance is based on version 2.1 of 
> org.apache.commons.lang.StringUtils#getLevenshteinDistance(String, String), 
> which uses an un-optimized implementation of the Levenshtein Distance 
> algorithm (it uses way too much memory). Please see Bug 38911 
> (http://issues.apache.org/bugzilla/show_bug.cgi?id=38911) for more 
> information.
> The commons-lang implementation has been heavily optimized as of version 2.2 
> (3x speed-up). I have reported the new implementation to TRStringDistance.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: lucene 2.9 sorting algorithm

2009-10-20 Thread Mark Miller

bq. One trivial thing that could be improved is to perhaps move all of
the methods to the top of the class?

+1 - I think Mike and silently fought on that one once in the patches :)
Though I don't know how conscious it was. I prefer the methods at the
top myself.

Yonik Seeley wrote:
> On Tue, Oct 20, 2009 at 9:31 AM, Uwe Schindler  wrote:
>   
>> It is not bad, only harder to understand (for some people).
>> 
>
> The Javadoc is much improved since I made the switch.
>
> Right now, if I go and browse FieldComparator, I'm immediately hit
> with inner classes... and it takes time to find the actual methods one
> must override to create their own FieldComparator, and that make the
> class seem more complex at first blush (but maybe that's just me
> too...)
>
>
> -Yonik
> http://www.lucidimagination.com
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>   


-- 
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: lucene 2.9 sorting algorithm

2009-10-20 Thread Earwin Burrfoot

There are some advanced things that are plain impossible with stock new API.
Like having more than one HitQueue in your Collector, and stashing
overflowing values from one of them into another. Once you cross the
segment border - BOOM!

Otherwise it may look intimidating, but is pretty simple in fact.

On Tue, Oct 20, 2009 at 17:41, Yonik Seeley  wrote:
> On Tue, Oct 20, 2009 at 9:31 AM, Uwe Schindler  wrote:
>> It is not bad, only harder to understand (for some people).
>
> The Javadoc is much improved since I made the switch.
> One trivial thing that could be improved is to perhaps move all of the
> methods to the top of the class?
> Right now, if I go and browse FieldComparator, I'm immediately hit
> with inner classes... and it takes time to find the actual methods one
> must override to create their own FieldComparator, and that make the
> class seem more complex at first blush (but maybe that's just me
> too...)
>
>
> -Yonik
> http://www.lucidimagination.com
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org


-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

RE: Compile failure in 2.9.1 Highlighter

2009-10-20 Thread Uwe Schindler

Lucene 2.9 is Java 1.4 only (in build script), so autoboxing does not work.
With trunk it works, but not with 2.9.

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


> -Original Message-
> From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik
> Seeley
> Sent: Tuesday, October 20, 2009 4:19 PM
> To: java-dev@lucene.apache.org
> Subject: Re: Compile failure in 2.9.1 Highlighter
> 
> Not sure... but could it just be that you are relying on autoboxing,
> which is a Java5 feature, and the core is still marked for Java1.4
> syntax?
> 
> -Yonik
> http://www.lucidimagination.com
> 
> 
> 
> On Tue, Oct 20, 2009 at 10:02 AM, Mark Miller 
> wrote:
> > Can someone help me figure this out. The Highlighter test runs and
> > passes for me in Eclipse. It obviously compiles too.
> >
> > But when I try and compile with the ant build scripts, I get:
> >
> >    [javac]
> >
> /home/mark/workspace/lucene_2_9/contrib/highlighter/src/test/org/apache/lu
> cene/search/highlight/HighlighterTest.java:266:
> > cannot find symbol
> >    [javac] symbol  : method
> > newIntRange(java.lang.String,int,int,boolean,boolean)
> >    [javac] location: class org.apache.lucene.search.NumericRangeQuery
> >    [javac]     query =
> > NumericRangeQuery.newIntRange(NUMERIC_FIELD_NAME, 2, 6, true, true);
> >
> > But that method appears to exist - I'm momentarily at a loss - someone
> > see the issue?
> >
> > --
> > - Mark
> >
> > http://www.lucidimagination.com
> >
> >
> >
> >
> > -
> > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-dev-h...@lucene.apache.org
> >
> >
> 
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Compile failure in 2.9.1 Highlighter

2009-10-20 Thread Mark Miller

Thank you, than you, thank you ... I could have run around that for years.

Yonik Seeley wrote:
> Not sure... but could it just be that you are relying on autoboxing,
> which is a Java5 feature, and the core is still marked for Java1.4
> syntax?
>
> -Yonik
> http://www.lucidimagination.com
>
>
>
> On Tue, Oct 20, 2009 at 10:02 AM, Mark Miller  wrote:
>   
>> Can someone help me figure this out. The Highlighter test runs and
>> passes for me in Eclipse. It obviously compiles too.
>>
>> But when I try and compile with the ant build scripts, I get:
>>
>>[javac]
>> /home/mark/workspace/lucene_2_9/contrib/highlighter/src/test/org/apache/lucene/search/highlight/HighlighterTest.java:266:
>> cannot find symbol
>>[javac] symbol  : method
>> newIntRange(java.lang.String,int,int,boolean,boolean)
>>[javac] location: class org.apache.lucene.search.NumericRangeQuery
>>[javac] query =
>> NumericRangeQuery.newIntRange(NUMERIC_FIELD_NAME, 2, 6, true, true);
>>
>> But that method appears to exist - I'm momentarily at a loss - someone
>> see the issue?
>>
>> --
>> - Mark
>>
>> http://www.lucidimagination.com
>>
>>
>>
>>
>> -
>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>>
>>
>> 
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>   


-- 
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Compile failure in 2.9.1 Highlighter

2009-10-20 Thread Yonik Seeley

Not sure... but could it just be that you are relying on autoboxing,
which is a Java5 feature, and the core is still marked for Java1.4
syntax?

-Yonik
http://www.lucidimagination.com



On Tue, Oct 20, 2009 at 10:02 AM, Mark Miller  wrote:
> Can someone help me figure this out. The Highlighter test runs and
> passes for me in Eclipse. It obviously compiles too.
>
> But when I try and compile with the ant build scripts, I get:
>
>    [javac]
> /home/mark/workspace/lucene_2_9/contrib/highlighter/src/test/org/apache/lucene/search/highlight/HighlighterTest.java:266:
> cannot find symbol
>    [javac] symbol  : method
> newIntRange(java.lang.String,int,int,boolean,boolean)
>    [javac] location: class org.apache.lucene.search.NumericRangeQuery
>    [javac]     query =
> NumericRangeQuery.newIntRange(NUMERIC_FIELD_NAME, 2, 6, true, true);
>
> But that method appears to exist - I'm momentarily at a loss - someone
> see the issue?
>
> --
> - Mark
>
> http://www.lucidimagination.com
>
>
>
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Compile failure in 2.9.1 Highlighter

2009-10-20 Thread Mark Miller

Can someone help me figure this out. The Highlighter test runs and
passes for me in Eclipse. It obviously compiles too.

But when I try and compile with the ant build scripts, I get:

[javac]
/home/mark/workspace/lucene_2_9/contrib/highlighter/src/test/org/apache/lucene/search/highlight/HighlighterTest.java:266:
cannot find symbol
[javac] symbol  : method
newIntRange(java.lang.String,int,int,boolean,boolean)
[javac] location: class org.apache.lucene.search.NumericRangeQuery
[javac] query =
NumericRangeQuery.newIntRange(NUMERIC_FIELD_NAME, 2, 6, true, true);

But that method appears to exist - I'm momentarily at a loss - someone
see the issue?

-- 
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: lucene 2.9 sorting algorithm

2009-10-20 Thread Yonik Seeley

On Tue, Oct 20, 2009 at 9:31 AM, Uwe Schindler  wrote:
> It is not bad, only harder to understand (for some people).

The Javadoc is much improved since I made the switch.
One trivial thing that could be improved is to perhaps move all of the
methods to the top of the class?
Right now, if I go and browse FieldComparator, I'm immediately hit
with inner classes... and it takes time to find the actual methods one
must override to create their own FieldComparator, and that make the
class seem more complex at first blush (but maybe that's just me
too...)


-Yonik
http://www.lucidimagination.com

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

RE: lucene 2.9 sorting algorithm

2009-10-20 Thread Uwe Schindler

> Actually though - how are we supposed to get back there? I don't think
> its as simple as just not removing the deprecated API's. Doesn't even
> seem close to that simple. Its another nightmare. It would have to be
> some serious wins to go through that pain starting at a 3.0 release
> wouldn't it? We just had a ton of people switch. We would have to
> deprecate a bunch of stuff. Hard to imagine wanting to switch now - the
> new API is certainly not that bad.

It is not bad, only harder to understand (for some people). As noted in my
previous mail, we could start the same discussion like with the Collectors:

Should we provide some easy to use FieldComparators, that can be used as a
starting point? Maybe someone that looks like the comparators used for
byte/int/short and so on, but has an abstract method
getComparable(IndexReader reader, int doc) [relative to current reader] that
returns an Comparable. All other methods and Arrays use the datatype
Comparable and implement just the compare functions inner slot and
botton. Or maybe another wrapper that takes an java.lang.Comparator and a
method T getSortValue(IndexReader reader, int doc) and implements the
comparison. We had something like this in the old API, too. It could be
implemented by a wrapper with this easy API.

These wrapper classes may be lot of slower than native FieldComparators
directly working on the native types, but would help to create simple
sorting for things like geo-distances, where the calculation of the field
values/Comparable takes most time.

So some easy-access or sample comparators like this in contrib would be
fine.

Uwe


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1183) TRStringDistance uses way too much memory (with patch)

2009-10-20 Thread JIRA


[ 
https://issues.apache.org/jira/browse/LUCENE-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12767805#action_12767805
 ] 

Cédrik LIME commented on LUCENE-1183:
-

Any news on the landing of this patch?
Now that Lucene 2.9 is out, the vastly better memory usage and speed-up would 
be a welcome addition to Lucene 3.0's fuzzy search!

> TRStringDistance uses way too much memory (with patch)
> --
>
> Key: LUCENE-1183
> URL: https://issues.apache.org/jira/browse/LUCENE-1183
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/*
>Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3
>Reporter: Cédrik LIME
>Assignee: Otis Gospodnetic
>Priority: Minor
> Attachments: FuzzyTermEnum.patch, TRStringDistance.java, 
> TRStringDistance.patch
>
>   Original Estimate: 0.17h
>  Remaining Estimate: 0.17h
>
> The implementation of TRStringDistance is based on version 2.1 of 
> org.apache.commons.lang.StringUtils#getLevenshteinDistance(String, String), 
> which uses an un-optimized implementation of the Levenshtein Distance 
> algorithm (it uses way too much memory). Please see Bug 38911 
> (http://issues.apache.org/bugzilla/show_bug.cgi?id=38911) for more 
> information.
> The commons-lang implementation has been heavily optimized as of version 2.2 
> (3x speed-up). I have reported the new implementation to TRStringDistance.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-666) TERM1 OR NOT TERM2 does not perform as expected

2009-10-20 Thread Siddharth Gargate (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12767802#action_12767802
 ] 

Siddharth Gargate commented on LUCENE-666:
--

Can we rewrite the query (A OR NOT B) to NOT(NOT(A) AND B) to solve this issue?

> TERM1 OR NOT TERM2 does not perform as expected
> ---
>
> Key: LUCENE-666
> URL: https://issues.apache.org/jira/browse/LUCENE-666
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: QueryParser
>Affects Versions: 2.0.0
> Environment: Windows XP, JavaCC 4.0, JDK 1.5
>Reporter: Dejan Nenov
> Attachments: TestAornotB.java
>
>
> test:
> [junit] Testsuite: org.apache.lucene.search.TestAornotB
> [junit] Tests run: 3, Failures: 1, Errors: 0, Time elapsed: 0.39 sec
> [junit] - Standard Output ---
> [junit] Doc1 = A B C
> [junit] Doc2 = A B C D
> [junit] Doc3 = A   C D
> [junit] Doc4 =   B C D
> [junit] Doc5 = C D
> [junit] -
> [junit] With query "A OR NOT B" we expect to hit
> [junit] all documents EXCEPT Doc4, instead we only match on Doc3.
> [junit] While LUCENE currently explicitly does not support queries of
> [junit] the type "find docs that do not contain TERM" - this explains
> [junit] not finding Doc5, but does not justify elimnating Doc1 and Doc2
> [junit] -
> [junit]  the fix shoould likely require a modification to QueryParser.jj
> [junit]  around the method:
> [junit]  protected void addClause(Vector clauses, int conj, int mods, 
> Query q)
> [junit] Query:c:a -c:b hits.length=1
> [junit] Query Found:Doc[0]= A C D
> [junit] 0.0 = (NON-MATCH) Failure to meet condition(s) of 
> required/prohibited clause(s)
> [junit]   0.6115718 = (MATCH) fieldWeight(c:a in 1), product of:
> [junit] 1.0 = tf(termFreq(c:a)=1)
> [junit] 1.2231436 = idf(docFreq=3)
> [junit] 0.5 = fieldNorm(field=c, doc=1)
> [junit]   0.0 = match on prohibited clause (c:b)
> [junit] 0.6115718 = (MATCH) fieldWeight(c:b in 1), product of:
> [junit]   1.0 = tf(termFreq(c:b)=1)
> [junit]   1.2231436 = idf(docFreq=3)
> [junit]   0.5 = fieldNorm(field=c, doc=1)
> [junit] 0.6115718 = (MATCH) sum of:
> [junit]   0.6115718 = (MATCH) fieldWeight(c:a in 2), product of:
> [junit] 1.0 = tf(termFreq(c:a)=1)
> [junit] 1.2231436 = idf(docFreq=3)
> [junit] 0.5 = fieldNorm(field=c, doc=2)
> [junit] 0.0 = (NON-MATCH) Failure to meet condition(s) of 
> required/prohibited clause(s)
> [junit]   0.0 = match on prohibited clause (c:b)
> [junit] 0.6115718 = (MATCH) fieldWeight(c:b in 3), product of:
> [junit]   1.0 = tf(termFreq(c:b)=1)
> [junit]   1.2231436 = idf(docFreq=3)
> [junit]   0.5 = fieldNorm(field=c, doc=3)
> [junit] Query:c:a (-c:b) hits.length=3
> [junit] Query Found:Doc[0]= A B C
> [junit] Query Found:Doc[1]= A B C D
> [junit] Query Found:Doc[2]= A C D
> [junit] 0.3057859 = (MATCH) product of:
> [junit]   0.6115718 = (MATCH) sum of:
> [junit] 0.6115718 = (MATCH) fieldWeight(c:a in 1), product of:
> [junit]   1.0 = tf(termFreq(c:a)=1)
> [junit]   1.2231436 = idf(docFreq=3)
> [junit]   0.5 = fieldNorm(field=c, doc=1)
> [junit]   0.5 = coord(1/2)
> [junit] 0.3057859 = (MATCH) product of:
> [junit]   0.6115718 = (MATCH) sum of:
> [junit] 0.6115718 = (MATCH) fieldWeight(c:a in 2), product of:
> [junit]   1.0 = tf(termFreq(c:a)=1)
> [junit]   1.2231436 = idf(docFreq=3)
> [junit]   0.5 = fieldNorm(field=c, doc=2)
> [junit]   0.5 = coord(1/2)
> [junit] 0.0 = (NON-MATCH) product of:
> [junit]   0.0 = (NON-MATCH) sum of:
> [junit]   0.0 = coord(0/2)
> [junit] -  ---
> [junit] Testcase: testFAIL(org.apache.lucene.search.TestAornotB):   FAILED
> [junit] resultDocs =A C D expected:<3> but was:<1>
> [junit] junit.framework.AssertionFailedError: resultDocs =A C D 
> expected:<3> but was:<1>
> [junit] at 
> org.apache.lucene.search.TestAornotB.testFAIL(TestAornotB.java:137)
> [junit] Test org.apache.lucene.search.TestAornotB FAILED

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: lucene 2.9 sorting algorithm

2009-10-20 Thread Mark Miller

Actually though - how are we supposed to get back there? I don't think
its as simple as just not removing the deprecated API's. Doesn't even
seem close to that simple. Its another nightmare. It would have to be
some serious wins to go through that pain starting at a 3.0 release
wouldn't it? We just had a ton of people switch. We would have to
deprecate a bunch of stuff. Hard to imagine wanting to switch now - the
new API is certainly not that bad.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: lucene 2.9 sorting algorithm

2009-10-20 Thread Mark Miller

Uwe Schindler wrote:
>> On Tue, Oct 20, 2009 at 8:08 AM, Mark Miller 
>> wrote:
>> 
>>> Hmm - perhaps I'm not remembering right. Or perhaps we had different
>>> motivations ;) I never did anything in 1483 based on search perf - and I
>>> took your tests as testing that we didn't lose perf, not that we gained
>>> any. The fact that there were some wins was just a nice surprise from my
>>> perspective.
>>>
>>> A quote from you in that issue:
>>>
>>> "I didn't expect such performance gain (I was hoping for not much
>>> performance loss, actually). I think it may be that although the
>>> initial value copy adds some cost, the within-queue comparsions are
>>> then faster because you don't have to deref back to the fieldcache
>>> array. It seems we keep accidentally discovering performance gains
>>> here"
>>>
>>> My whole memory of that issue is that we didn't do anything for
>>> performance gains. We just happened to measure a few. It was just to get
>>> to per segment. Was a long time ago though.
>>>   
>> Right, our original motitivation was fast reopen time, by doing
>> searching (and collection) per-segment so that field cache only used
>> at the segment level.
>>
>> But, that required cutting over field sorting, which was tricky.
>>
>> Our first go at it was the multi PQ approach (copying MultiSearcher),
>> but I believe that showed poor performance.  I remember being
>> depressed about it :)  So that poor performance pushed us to work out
>> the new comparator API that use a single PQ, and, after much
>> iterating, we saw better performance net/net.
>> 
>
> And the new sorting API is in line with the new Collector API! You have a
> setNextReader() method, where you e.g. load the FieldCache for the next
> segment and provide the compare functions, also you can get the scorer. My
> question: What is so hard to use this API? OK its more work when
> implementing the Comparator, but it is more intuitive for me if you think in
> terms of per-segment searches. For new users only the bottom comparison and
> so on is strange, the other is straightforward.
>
> I do not know how this flexibility can be implemented with the old API
> (scorer, reader switch)? If we want to switch back to a more simplier API,
> we should not switch back to the strange old one (I never understood it
> completely, the new one I understand)! Maybe we can provide an easy-to-use
> default implementation for Comparables in addition to custom sort, which may
> help lots of people that used Comparables with the old API. This impl may be
> slower and more memory intensive than directly implementing the new API, but
> may help.
>
> Uwe
>
>
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>   
That wording had me caught up to Uwe - "switch back to the old API" -
brings a lot of baggage with it :) More accurately, they mean switch to
using multiple p-queues rather than a single p-queue. The switch to
using a single p-queue is why we had to bring in setNextReader and all
of that to begin with. The original approach was actually to work as
MultiSearcher works and do a merge after using a p-queue for each
segment. So if we went back to that, many of the "new apis" wouldn't be
needed anymore.

I agree that the new API is not too dreadful, but I think part of that
may be from being a more advanced user. Lets just not go back to caching
comparators :)

-- 
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

RE: lucene 2.9 sorting algorithm

2009-10-20 Thread Uwe Schindler

> On Tue, Oct 20, 2009 at 8:08 AM, Mark Miller 
> wrote:
> > Hmm - perhaps I'm not remembering right. Or perhaps we had different
> > motivations ;) I never did anything in 1483 based on search perf - and I
> > took your tests as testing that we didn't lose perf, not that we gained
> > any. The fact that there were some wins was just a nice surprise from my
> > perspective.
> >
> > A quote from you in that issue:
> >
> > "I didn't expect such performance gain (I was hoping for not much
> > performance loss, actually). I think it may be that although the
> > initial value copy adds some cost, the within-queue comparsions are
> > then faster because you don't have to deref back to the fieldcache
> > array. It seems we keep accidentally discovering performance gains
> > here"
> >
> > My whole memory of that issue is that we didn't do anything for
> > performance gains. We just happened to measure a few. It was just to get
> > to per segment. Was a long time ago though.
> 
> Right, our original motitivation was fast reopen time, by doing
> searching (and collection) per-segment so that field cache only used
> at the segment level.
> 
> But, that required cutting over field sorting, which was tricky.
> 
> Our first go at it was the multi PQ approach (copying MultiSearcher),
> but I believe that showed poor performance.  I remember being
> depressed about it :)  So that poor performance pushed us to work out
> the new comparator API that use a single PQ, and, after much
> iterating, we saw better performance net/net.

And the new sorting API is in line with the new Collector API! You have a
setNextReader() method, where you e.g. load the FieldCache for the next
segment and provide the compare functions, also you can get the scorer. My
question: What is so hard to use this API? OK its more work when
implementing the Comparator, but it is more intuitive for me if you think in
terms of per-segment searches. For new users only the bottom comparison and
so on is strange, the other is straightforward.

I do not know how this flexibility can be implemented with the old API
(scorer, reader switch)? If we want to switch back to a more simplier API,
we should not switch back to the strange old one (I never understood it
completely, the new one I understand)! Maybe we can provide an easy-to-use
default implementation for Comparables in addition to custom sort, which may
help lots of people that used Comparables with the old API. This impl may be
slower and more memory intensive than directly implementing the new API, but
may help.

Uwe

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: lucene 2.9 sorting algorithm

2009-10-20 Thread Michael McCandless

On Tue, Oct 20, 2009 at 8:21 AM, Mark Miller  wrote:
> Ahhh - I see - way at the top. Man that was early. Had forgotten about
> that stuff even before the issue was finished.

Tell me about it -- impossible to remember these things :)  I wish I
could upgrade the RAM in my brain the way I can in my computers...

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: lucene 2.9 sorting algorithm

2009-10-20 Thread Michael McCandless

On Tue, Oct 20, 2009 at 8:08 AM, Mark Miller  wrote:
> Hmm - perhaps I'm not remembering right. Or perhaps we had different
> motivations ;) I never did anything in 1483 based on search perf - and I
> took your tests as testing that we didn't lose perf, not that we gained
> any. The fact that there were some wins was just a nice surprise from my
> perspective.
>
> A quote from you in that issue:
>
> "I didn't expect such performance gain (I was hoping for not much
> performance loss, actually). I think it may be that although the
> initial value copy adds some cost, the within-queue comparsions are
> then faster because you don't have to deref back to the fieldcache
> array. It seems we keep accidentally discovering performance gains
> here"
>
> My whole memory of that issue is that we didn't do anything for
> performance gains. We just happened to measure a few. It was just to get
> to per segment. Was a long time ago though.

Right, our original motitivation was fast reopen time, by doing
searching (and collection) per-segment so that field cache only used
at the segment level.

But, that required cutting over field sorting, which was tricky.

Our first go at it was the multi PQ approach (copying MultiSearcher),
but I believe that showed poor performance.  I remember being
depressed about it :)  So that poor performance pushed us to work out
the new comparator API that use a single PQ, and, after much
iterating, we saw better performance net/net.

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: lucene 2.9 sorting algorithm

2009-10-20 Thread Mark Miller

Ahhh - I see - way at the top. Man that was early. Had forgotten about
that stuff even before the issue was finished.

Mark Miller wrote:
> Hmm - perhaps I'm not remembering right. Or perhaps we had different
> motivations ;) I never did anything in 1483 based on search perf - and I
> took your tests as testing that we didn't lose perf, not that we gained
> any. The fact that there were some wins was just a nice surprise from my
> perspective.
>
> A quote from you in that issue:
>
> "I didn't expect such performance gain (I was hoping for not much
> performance loss, actually). I think it may be that although the
> initial value copy adds some cost, the within-queue comparsions are
> then faster because you don't have to deref back to the fieldcache
> array. It seems we keep accidentally discovering performance gains
> here"
>
> My whole memory of that issue is that we didn't do anything for
> performance gains. We just happened to measure a few. It was just to get
> to per segment. Was a long time ago though.
>
> Michael McCandless wrote:
>   
>> On Tue, Oct 20, 2009 at 6:51 AM, Mark Miller  wrote:
>>   
>> 
>>> I didn't really follow that thread either - but we didn't move to the new
>>> Comp Api because of it's perfomance vs the old.
>>> 
>>>   
>> We did (LUCENE-1483), but those perf tests mixed in a number of other
>> improvements (eg, searching by segment avoids the 2nd pass of
>> MultiTermDocs.read(int[], int[]), whereas John's tests more
>> specifically test the perf difference between single-PQ vs
>> multi-PQ-then-merge (much simpler comparator API).
>>
>> So we are re-scrutinizing that difference... and if the perf gains are
>> minimal or non-existent I think we should still consider going back to
>> the simpler API.
>>
>> I'm working now to set up a full benchmark across real (wikipedia) /
>> synthetic source, different queries, different sorts, balanced vs
>> unbalanced segment sizes, etc.
>>
>> Mike
>>
>> -
>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>>
>>   
>> 
>
>
>   


-- 
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: lucene 2.9 sorting algorithm

2009-10-20 Thread Mark Miller

Hmm - perhaps I'm not remembering right. Or perhaps we had different
motivations ;) I never did anything in 1483 based on search perf - and I
took your tests as testing that we didn't lose perf, not that we gained
any. The fact that there were some wins was just a nice surprise from my
perspective.

A quote from you in that issue:

"I didn't expect such performance gain (I was hoping for not much
performance loss, actually). I think it may be that although the
initial value copy adds some cost, the within-queue comparsions are
then faster because you don't have to deref back to the fieldcache
array. It seems we keep accidentally discovering performance gains
here"

My whole memory of that issue is that we didn't do anything for
performance gains. We just happened to measure a few. It was just to get
to per segment. Was a long time ago though.

Michael McCandless wrote:
> On Tue, Oct 20, 2009 at 6:51 AM, Mark Miller  wrote:
>   
>> I didn't really follow that thread either - but we didn't move to the new
>> Comp Api because of it's perfomance vs the old.
>> 
>
> We did (LUCENE-1483), but those perf tests mixed in a number of other
> improvements (eg, searching by segment avoids the 2nd pass of
> MultiTermDocs.read(int[], int[]), whereas John's tests more
> specifically test the perf difference between single-PQ vs
> multi-PQ-then-merge (much simpler comparator API).
>
> So we are re-scrutinizing that difference... and if the perf gains are
> minimal or non-existent I think we should still consider going back to
> the simpler API.
>
> I'm working now to set up a full benchmark across real (wikipedia) /
> synthetic source, different queries, different sorts, balanced vs
> unbalanced segment sizes, etc.
>
> Mike
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>   

-- 
- Mark

http://www.lucidimagination.com

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Resolved: (LUCENE-1993) MoreLikeThis - allow to exclude terms that appear in too many documents (patch included)

2009-10-20 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-1993.


   Resolution: Fixed
Fix Version/s: 3.0

Thanks Christian!

> MoreLikeThis - allow to exclude terms that appear in too many documents 
> (patch included)
> 
>
> Key: LUCENE-1993
> URL: https://issues.apache.org/jira/browse/LUCENE-1993
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/*
>Affects Versions: 2.9
>Reporter: Christian Steinert
>Assignee: Michael McCandless
> Fix For: 3.0
>
> Attachments: MoreLikeThis.java.patch
>
>   Original Estimate: 0.17h
>  Remaining Estimate: 0.17h
>
> The MoreLikeThis class allows to generate a likeness query based on a given 
> document. So far, it is impossible to suppress words from the likeness query, 
> that appear in almost all documents, making it necessary to use extensive 
> lists of stop words.
> Therefore I suggest to allow excluding words for which a certain absolute 
> document count or a certain percentage of documents is exceeded. Depending on 
> the corpus of text, words that appear in more than 50 or even 70% of 
> documents can usually be considered insignificant for classifying a document. 
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1993) MoreLikeThis - allow to exclude terms that appear in too many documents (patch included)

2009-10-20 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12767781#action_12767781
 ] 

Michael McCandless commented on LUCENE-1993:


Patch looks good... I'll commit shortly.

> MoreLikeThis - allow to exclude terms that appear in too many documents 
> (patch included)
> 
>
> Key: LUCENE-1993
> URL: https://issues.apache.org/jira/browse/LUCENE-1993
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/*
>Affects Versions: 2.9
>Reporter: Christian Steinert
>Assignee: Michael McCandless
> Attachments: MoreLikeThis.java.patch
>
>   Original Estimate: 0.17h
>  Remaining Estimate: 0.17h
>
> The MoreLikeThis class allows to generate a likeness query based on a given 
> document. So far, it is impossible to suppress words from the likeness query, 
> that appear in almost all documents, making it necessary to use extensive 
> lists of stop words.
> Therefore I suggest to allow excluding words for which a certain absolute 
> document count or a certain percentage of documents is exceeded. Depending on 
> the corpus of text, words that appear in more than 50 or even 70% of 
> documents can usually be considered insignificant for classifying a document. 
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Assigned: (LUCENE-1993) MoreLikeThis - allow to exclude terms that appear in too many documents (patch included)

2009-10-20 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reassigned LUCENE-1993:
--

Assignee: Michael McCandless

> MoreLikeThis - allow to exclude terms that appear in too many documents 
> (patch included)
> 
>
> Key: LUCENE-1993
> URL: https://issues.apache.org/jira/browse/LUCENE-1993
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/*
>Affects Versions: 2.9
>Reporter: Christian Steinert
>Assignee: Michael McCandless
> Attachments: MoreLikeThis.java.patch
>
>   Original Estimate: 0.17h
>  Remaining Estimate: 0.17h
>
> The MoreLikeThis class allows to generate a likeness query based on a given 
> document. So far, it is impossible to suppress words from the likeness query, 
> that appear in almost all documents, making it necessary to use extensive 
> lists of stop words.
> Therefore I suggest to allow excluding words for which a certain absolute 
> document count or a certain percentage of documents is exceeded. Depending on 
> the corpus of text, words that appear in more than 50 or even 70% of 
> documents can usually be considered insignificant for classifying a document. 
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1987) Remove rest of analysis deprecations (Token, CharacterCache)

2009-10-20 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12767779#action_12767779
 ] 

Michael McCandless commented on LUCENE-1987:


bq. How to handle the problem with LUCENE_29 setting and the posIncr of 
stopwords together with QueryParser that has a default setting of ignoring 
posIncr?

How about adding required Version to QP ctor?

> Remove rest of analysis deprecations (Token, CharacterCache)
> 
>
> Key: LUCENE-1987
> URL: https://issues.apache.org/jira/browse/LUCENE-1987
> Project: Lucene - Java
>  Issue Type: Task
>  Components: Analysis
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
> Fix For: 2.9.1, 3.0
>
> Attachments: LUCENE-1987-StopFilter-backport29.patch, 
> LUCENE-1987-StopFilter-BW.patch, LUCENE-1987-StopFilter.patch, 
> LUCENE-1987-StopFilter.patch, LUCENE-1987-StopFilter.patch, 
> LUCENE-1987-StopFilter.patch, LUCENE-1987.patch, LUCENE-1987.patch, 
> LUCENE-1987.patch
>
>
> These removes the rest of the deprecations in the analysis package:
> - -Token's termText field-- (DONE)
> - -eventually un-deprecate ctors of Token taking Strings (they are still 
> useful) -> if yes remove deprec in 2.9.1- (DONE)
> - -remove CharacterCache and use Character.valueOf() from Java5- (DONE)
> - Stopwords lists
> - Remove the backwards settings from analyzers (acronym, posIncr,...). They 
> are deprecated, but we still have the VERSION constants. Do not know, how to 
> proceed. Keep the settings alive for index compatibility? Or remove it 
> together with the version constants (which were undeprecated).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: lucene 2.9 sorting algorithm

2009-10-20 Thread Michael McCandless

On Tue, Oct 20, 2009 at 6:51 AM, Mark Miller  wrote:
> I didn't really follow that thread either - but we didn't move to the new
> Comp Api because of it's perfomance vs the old.

We did (LUCENE-1483), but those perf tests mixed in a number of other
improvements (eg, searching by segment avoids the 2nd pass of
MultiTermDocs.read(int[], int[]), whereas John's tests more
specifically test the perf difference between single-PQ vs
multi-PQ-then-merge (much simpler comparator API).

So we are re-scrutinizing that difference... and if the perf gains are
minimal or non-existent I think we should still consider going back to
the simpler API.

I'm working now to set up a full benchmark across real (wikipedia) /
synthetic source, different queries, different sorts, balanced vs
unbalanced segment sizes, etc.

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: lucene 2.9 sorting algorithm

2009-10-20 Thread Mark Miller

I didn't really follow that thread either - but we didn't move to the  
new Comp Api because of it's perfomance vs the old.


- Mark

http://www.lucidimagination.com (mobile)

On Oct 20, 2009, at 4:22 AM, "Uwe Schindler"  wrote:

I did not follow the whole thread, but I do not understand what’s ba 
d with the new API that rectifies to preserve the old one. The old A 
PI does not fit very well with the segment based search and a lot of 
 ugly stuff was done around to make both APIs work the same.


For me it is not very complicated to create a new-style Comparator.  
The only difference is that you have to implement more methods for  
the comparison, but if you e.g. take the provided comparators for  
the basic data types as a base, it is easy to understand how it  
works and you can modify the examples.


And: as far as I know, the old API is not really segment wise, so  
reopen() cost is much higher and FieldCache gets slower, because the  
top level reader must be reloaded into cache not the segments.


Uwe
-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

From: Jake Mannix [mailto:jake.man...@gmail.com]
Sent: Tuesday, October 20, 2009 8:37 AM
To: java-dev@lucene.apache.org
Subject: Re: lucene 2.9 sorting algorithm

Given that this new API is pretty unweildy, and seems to not  
actually perform any better than the old one... are we going to  
consider revisiting that?


  -jake

On Mon, Oct 19, 2009 at 11:27 PM, Uwe Schindler   
wrote:

The old search API is already removed in trunk…

Uwe
-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

From: John Wang [mailto:john.w...@gmail.com]
Sent: Tuesday, October 20, 2009 3:28 AM
To: java-dev@lucene.apache.org
Subject: Re: lucene 2.9 sorting algorithm

Hi Michael:

 Was wondering if you got a chance to take a look at this.

 Since deprecated APIs are being removed in 3.0, I was wondering  
if/when we would decide on keeping the ScoreDocComparator API and  
thus would be kept for Lucene 3.0.


Thanks

-John

On Fri, Oct 16, 2009 at 9:53 AM, Michael McCandless > wrote:

Oh, no problem...

Mike

On Fri, Oct 16, 2009 at 12:33 PM, John Wang   
wrote:

> Mike, just a clarification on my first perf report email.
> The first section, numHits is incorrectly labeled, it should be 20  
instead

> of 50. Sorry about the possible confusion.
> Thanks
> -John
>
> On Fri, Oct 16, 2009 at 3:21 AM, Michael McCandless
>  wrote:
>>
>> Thanks John; I'll have a look.
>>
>> Mike
>>
>> On Fri, Oct 16, 2009 at 12:57 AM, John Wang   
wrote:

>> > Hi Michael:
>> > I added classes: ScoreDocComparatorQueue and  
OneSortNoScoreCollector

>> > as
>> > a more general case. I think keeping the old api for  
ScoreDocComparator

>> > and
>> > SortComparatorSource would work.
>> >   Please take a look.
>> > Thanks
>> > -John
>> >
>> > On Thu, Oct 15, 2009 at 6:52 PM, John Wang  
 wrote:

>> >>
>> >> Hi Michael:
>> >>  It is open, http://code.google.com/p/lucene-book/source/checkout
>> >>  I think I sent the https url instead, sorry.
>> >> The multi PQ sorting is fairly self-contained, I have 2  
versions, 1

>> >> for string and 1 for int, each are Collector impls.
>> >>  I shouldn't say the Multi Q is faster on int sort, it is  
within

>> >> the
>> >> error boundary. The diff is very very small, I would stay they  
are more

>> >> equal.
>> >>  If you think it is a good thing to go this way, (if not  
for the

>> >> perf,
>> >> just for the simpler api) I'd be happy to work on a patch.
>> >> Thanks
>> >> -John
>> >> On Thu, Oct 15, 2009 at 5:18 PM, Michael McCandless
>> >>  wrote:
>> >>>
>> >>> John, looks like this requires login -- any plans to open  
that up, or,

>> >>> post the code on an issue?
>> >>>
>> >>> How self-contained is your Multi PQ sorting?  EG is it a  
standalone

>> >>> Collector impl that I can test?
>> >>>
>> >>> Mike
>> >>>
>> >>> On Thu, Oct 15, 2009 at 6:33 PM, John Wang  


>> >>> wrote:
>> >>> > BTW, we are have a little sandbox for these experiments.  
And all my

>> >>> > testcode
>> >>> > are at. They are not very polished.
>> >>> >
>> >>> > https://lucene-book.googlecode.com/svn/trunk
>> >>> >
>> >>> > -John
>> >>> >
>> >>> > On Thu, Oct 15, 2009 at 3:29 PM, John Wang >

>> >>> > wrote:
>> >>> >>
>> >>> >> Numbers Mike requested for Int types:
>> >>> >>
>> >>> >> only the time/cputime are posted, others are all the same  
since the

>> >>> >> algorithm is the same.
>> >>> >>
>> >>> >> Lucene 2.9:
>> >>> >> numhits: 10
>> >>> >> time: 14619495
>> >>> >> cpu: 146126
>> >>> >>
>> >>> >> numhits: 20
>> >>> >> time: 14550568
>> >>> >> cpu: 163242
>> >>> >>
>> >>> >> numhits: 100
>> >>> >> time: 16467647
>> >>> >> cpu: 178379
>> >>> >>
>> >>> >>
>> >>> >> my test:
>> >>> >> numHits: 10
>> >>> >> time: 14101094
>> >>> >> cpu: 144715
>> >>> >>
>> >>> >> numHits: 20
>> >>> >> time: 14804821
>> >>> >> cpu: 151305
>> >>> >>
>> >>> >> nu

Re: [Lucene-java Wiki] Update of "LuceneAtApacheConUs2009" by HossMan

2009-10-20 Thread Karl Wettin



20 okt 2009 kl. 07.15 skrev Apache Wiki:

+ There will be a Lucene/Search !MeetUp on Tuesday night at 8PM.   
'This event is open to anyone who wants to come, even if you are  
not registered for the conference'.


That is a really nice thing, and completely new if I'm not misstaken.  
Perhaps even worth advertise as news on the front page.



   karl

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Resolved: (LUCENE-1995) ArrayIndexOutOfBoundsException during indexing

2009-10-20 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-1995.


Resolution: Fixed

Thanks Aaron!  Maybe someday Lucene will allow a larger RAM buffer than 2GB...

> ArrayIndexOutOfBoundsException during indexing
> --
>
> Key: LUCENE-1995
> URL: https://issues.apache.org/jira/browse/LUCENE-1995
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 2.9
>Reporter: Yonik Seeley
>Assignee: Michael McCandless
> Fix For: 2.9.1
>
>
> http://search.lucidimagination.com/search/document/f29fc52348ab9b63/arrayindexoutofboundsexception_during_indexing

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: lucene 2.9 sorting algorithm

2009-10-20 Thread Michael McCandless

Sorry, I have been digging into it, just didn't get far enough to post
patch/results.  I'll try to do so today.

I did find one bug in OneSortNoScoreCollector, in the getTop() method
in the inner compare() method, to break ties it should be:

  if (v==0 {
v = o1.doc + o1.comparatorQueue._base - o2.doc - o2.comparatorQueue._base;
  }

not

  if (v==0) {
v=o1.doc+o2.comparatorQueue._base-o2.doc+o2.comparatorQueue._base;
  }

I've folded the "multi PQ" approach into contrib/benchmark so we can
run our tests...

Mike

On Mon, Oct 19, 2009 at 9:28 PM, John Wang  wrote:
> Hi Michael:
>      Was wondering if you got a chance to take a look at this.
>      Since deprecated APIs are being removed in 3.0, I was wondering if/when
> we would decide on keeping the ScoreDocComparator API and thus would be kept
> for Lucene 3.0.
> Thanks
> -John
>
> On Fri, Oct 16, 2009 at 9:53 AM, Michael McCandless
>  wrote:
>>
>> Oh, no problem...
>>
>> Mike
>>
>> On Fri, Oct 16, 2009 at 12:33 PM, John Wang  wrote:
>> > Mike, just a clarification on my first perf report email.
>> > The first section, numHits is incorrectly labeled, it should be 20
>> > instead
>> > of 50. Sorry about the possible confusion.
>> > Thanks
>> > -John
>> >
>> > On Fri, Oct 16, 2009 at 3:21 AM, Michael McCandless
>> >  wrote:
>> >>
>> >> Thanks John; I'll have a look.
>> >>
>> >> Mike
>> >>
>> >> On Fri, Oct 16, 2009 at 12:57 AM, John Wang 
>> >> wrote:
>> >> > Hi Michael:
>> >> >     I added classes: ScoreDocComparatorQueue
>> >> > and OneSortNoScoreCollector
>> >> > as
>> >> > a more general case. I think keeping the old api for
>> >> > ScoreDocComparator
>> >> > and
>> >> > SortComparatorSource would work.
>> >> >   Please take a look.
>> >> > Thanks
>> >> > -John
>> >> >
>> >> > On Thu, Oct 15, 2009 at 6:52 PM, John Wang 
>> >> > wrote:
>> >> >>
>> >> >> Hi Michael:
>> >> >>      It is
>> >> >> open, http://code.google.com/p/lucene-book/source/checkout
>> >> >>      I think I sent the https url instead, sorry.
>> >> >>     The multi PQ sorting is fairly self-contained, I have 2
>> >> >> versions, 1
>> >> >> for string and 1 for int, each are Collector impls.
>> >> >>      I shouldn't say the Multi Q is faster on int sort, it is within
>> >> >> the
>> >> >> error boundary. The diff is very very small, I would stay they are
>> >> >> more
>> >> >> equal.
>> >> >>      If you think it is a good thing to go this way, (if not for the
>> >> >> perf,
>> >> >> just for the simpler api) I'd be happy to work on a patch.
>> >> >> Thanks
>> >> >> -John
>> >> >> On Thu, Oct 15, 2009 at 5:18 PM, Michael McCandless
>> >> >>  wrote:
>> >> >>>
>> >> >>> John, looks like this requires login -- any plans to open that up,
>> >> >>> or,
>> >> >>> post the code on an issue?
>> >> >>>
>> >> >>> How self-contained is your Multi PQ sorting?  EG is it a standalone
>> >> >>> Collector impl that I can test?
>> >> >>>
>> >> >>> Mike
>> >> >>>
>> >> >>> On Thu, Oct 15, 2009 at 6:33 PM, John Wang 
>> >> >>> wrote:
>> >> >>> > BTW, we are have a little sandbox for these experiments. And all
>> >> >>> > my
>> >> >>> > testcode
>> >> >>> > are at. They are not very polished.
>> >> >>> >
>> >> >>> > https://lucene-book.googlecode.com/svn/trunk
>> >> >>> >
>> >> >>> > -John
>> >> >>> >
>> >> >>> > On Thu, Oct 15, 2009 at 3:29 PM, John Wang 
>> >> >>> > wrote:
>> >> >>> >>
>> >> >>> >> Numbers Mike requested for Int types:
>> >> >>> >>
>> >> >>> >> only the time/cputime are posted, others are all the same since
>> >> >>> >> the
>> >> >>> >> algorithm is the same.
>> >> >>> >>
>> >> >>> >> Lucene 2.9:
>> >> >>> >> numhits: 10
>> >> >>> >> time: 14619495
>> >> >>> >> cpu: 146126
>> >> >>> >>
>> >> >>> >> numhits: 20
>> >> >>> >> time: 14550568
>> >> >>> >> cpu: 163242
>> >> >>> >>
>> >> >>> >> numhits: 100
>> >> >>> >> time: 16467647
>> >> >>> >> cpu: 178379
>> >> >>> >>
>> >> >>> >>
>> >> >>> >> my test:
>> >> >>> >> numHits: 10
>> >> >>> >> time: 14101094
>> >> >>> >> cpu: 144715
>> >> >>> >>
>> >> >>> >> numHits: 20
>> >> >>> >> time: 14804821
>> >> >>> >> cpu: 151305
>> >> >>> >>
>> >> >>> >> numHits: 100
>> >> >>> >> time: 15372157
>> >> >>> >> cpu time: 158842
>> >> >>> >>
>> >> >>> >> Conclusions:
>> >> >>> >> The are very similar, the differences are all within error
>> >> >>> >> bounds,
>> >> >>> >> especially with lower PQ sizes, which second sort alg again
>> >> >>> >> slightly
>> >> >>> >> faster.
>> >> >>> >>
>> >> >>> >> Hope this helps.
>> >> >>> >>
>> >> >>> >> -John
>> >> >>> >>
>> >> >>> >>
>> >> >>> >> On Thu, Oct 15, 2009 at 3:04 PM, Yonik Seeley
>> >> >>> >> 
>> >> >>> >> wrote:
>> >> >>> >>>
>> >> >>> >>> On Thu, Oct 15, 2009 at 5:33 PM, Michael McCandless
>> >> >>> >>>  wrote:
>> >> >>> >>> > Though it'd be odd if the switch to searching by segment
>> >> >>> >>> > really was most of the gains here.
>> >> >>> >>>
>> >> >>> >>> I had assumed that much of the improvement was due to ditching
>> >> >>> >>> MultiTermEnum/MultiTermDocs.
>> >> >>> >>> Note that LUCENE-1

RE: lucene 2.9 sorting algorithm

2009-10-20 Thread Uwe Schindler

I did not follow the whole thread, but I do not understand what's bad with
the new API that rectifies to preserve the old one. The old API does not fit
very well with the segment based search and a lot of ugly stuff was done
around to make both APIs work the same.

 

For me it is not very complicated to create a new-style Comparator. The only
difference is that you have to implement more methods for the comparison,
but if you e.g. take the provided comparators for the basic data types as a
base, it is easy to understand how it works and you can modify the examples.

 

And: as far as I know, the old API is not really segment wise, so reopen()
cost is much higher and FieldCache gets slower, because the top level reader
must be reloaded into cache not the segments.

 

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

  _  

From: Jake Mannix [mailto:jake.man...@gmail.com] 
Sent: Tuesday, October 20, 2009 8:37 AM
To: java-dev@lucene.apache.org
Subject: Re: lucene 2.9 sorting algorithm

 

Given that this new API is pretty unweildy, and seems to not actually
perform any better than the old one... are we going to consider revisiting
that?

  -jake

On Mon, Oct 19, 2009 at 11:27 PM, Uwe Schindler  wrote:

The old search API is already removed in trunk.

 

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

  _  

From: John Wang [mailto:john.w...@gmail.com] 
Sent: Tuesday, October 20, 2009 3:28 AM
To: java-dev@lucene.apache.org
Subject: Re: lucene 2.9 sorting algorithm

 

Hi Michael:

 

 Was wondering if you got a chance to take a look at this.

 

 Since deprecated APIs are being removed in 3.0, I was wondering if/when
we would decide on keeping the ScoreDocComparator API and thus would be kept
for Lucene 3.0.

 

Thanks

 

-John

On Fri, Oct 16, 2009 at 9:53 AM, Michael McCandless
 wrote:

Oh, no problem...

Mike


On Fri, Oct 16, 2009 at 12:33 PM, John Wang  wrote:
> Mike, just a clarification on my first perf report email.
> The first section, numHits is incorrectly labeled, it should be 20 instead
> of 50. Sorry about the possible confusion.
> Thanks
> -John
>
> On Fri, Oct 16, 2009 at 3:21 AM, Michael McCandless
>  wrote:
>>
>> Thanks John; I'll have a look.
>>
>> Mike
>>
>> On Fri, Oct 16, 2009 at 12:57 AM, John Wang  wrote:
>> > Hi Michael:
>> > I added classes: ScoreDocComparatorQueue and
OneSortNoScoreCollector
>> > as
>> > a more general case. I think keeping the old api for ScoreDocComparator
>> > and
>> > SortComparatorSource would work.
>> >   Please take a look.
>> > Thanks
>> > -John
>> >
>> > On Thu, Oct 15, 2009 at 6:52 PM, John Wang  wrote:
>> >>
>> >> Hi Michael:
>> >>  It is open, http://code.google.com/p/lucene-book/source/checkout
>> >>  I think I sent the https url instead, sorry.
>> >> The multi PQ sorting is fairly self-contained, I have 2 versions,
1
>> >> for string and 1 for int, each are Collector impls.
>> >>  I shouldn't say the Multi Q is faster on int sort, it is within
>> >> the
>> >> error boundary. The diff is very very small, I would stay they are
more
>> >> equal.
>> >>  If you think it is a good thing to go this way, (if not for the
>> >> perf,
>> >> just for the simpler api) I'd be happy to work on a patch.
>> >> Thanks
>> >> -John
>> >> On Thu, Oct 15, 2009 at 5:18 PM, Michael McCandless
>> >>  wrote:
>> >>>
>> >>> John, looks like this requires login -- any plans to open that up,
or,
>> >>> post the code on an issue?
>> >>>
>> >>> How self-contained is your Multi PQ sorting?  EG is it a standalone
>> >>> Collector impl that I can test?
>> >>>
>> >>> Mike
>> >>>
>> >>> On Thu, Oct 15, 2009 at 6:33 PM, John Wang 
>> >>> wrote:
>> >>> > BTW, we are have a little sandbox for these experiments. And all my
>> >>> > testcode
>> >>> > are at. They are not very polished.
>> >>> >
>> >>> > https://lucene-book.googlecode.com/svn/trunk
>> >>> >
>> >>> > -John
>> >>> >
>> >>> > On Thu, Oct 15, 2009 at 3:29 PM, John Wang 
>> >>> > wrote:
>> >>> >>
>> >>> >> Numbers Mike requested for Int types:
>> >>> >>
>> >>> >> only the time/cputime are posted, others are all the same since
the
>> >>> >> algorithm is the same.
>> >>> >>
>> >>> >> Lucene 2.9:
>> >>> >> numhits: 10
>> >>> >> time: 14619495
>> >>> >> cpu: 146126
>> >>> >>
>> >>> >> numhits: 20
>> >>> >> time: 14550568
>> >>> >> cpu: 163242
>> >>> >>
>> >>> >> numhits: 100
>> >>> >> time: 16467647
>> >>> >> cpu: 178379
>> >>> >>
>> >>> >>
>> >>> >> my test:
>> >>> >> numHits: 10
>> >>> >> time: 14101094
>> >>> >> cpu: 144715
>> >>> >>
>> >>> >> numHits: 20
>> >>> >> time: 14804821
>> >>> >> cpu: 151305
>> >>> >>
>> >>> >> numHits: 100
>> >>> >> time: 15372157
>> >>> >> cpu time: 158842
>> >>> >>
>> >>> >> Conclusions:
>> >>> >> The are very similar, the differences are all within error bounds,
>> >>> >> especially with lower PQ sizes, which second sort alg again
>> >>> >> s

60 matches

Mail list logo