[jira] Commented: (LUCENE-1410) PFOR implementation

2009-03-23 Thread Eks Dev (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12688284#action_12688284
 ] 

Eks Dev commented on LUCENE-1410:
-

It looks like Google went there as well (Block encoding), 

see: Blog http://blogs.sun.com/searchguy/entry/google_s_postings_format
http://research.google.com/people/jeff/WSDM09-keynote.pdf (Slides 47-63)



 PFOR implementation
 ---

 Key: LUCENE-1410
 URL: https://issues.apache.org/jira/browse/LUCENE-1410
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Other
Reporter: Paul Elschot
Priority: Minor
 Attachments: autogen.tgz, LUCENE-1410b.patch, LUCENE-1410c.patch, 
 LUCENE-1410d.patch, LUCENE-1410e.patch, TermQueryTests.tgz, TestPFor2.java, 
 TestPFor2.java, TestPFor2.java

   Original Estimate: 21840h
  Remaining Estimate: 21840h

 Implementation of Patched Frame of Reference.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Is TopDocCollector's collect() implementation correct?

2009-03-23 Thread Michael McCandless
 If we're already creating a new TopScoreDocCollector (when was it
 added?  I must have been dozing off while this happened...)

This was LUCENE-1483.

 How about if we introduce an abstract ScoringCollector (about the
 name later) which implements topDocs() and getTotalHits() and there
 will be several implementations of it, such as:
 TopScoreDocCollector, which sorts the documents by their score, in
 descending order only, TopFieldDocCollector - for sorting by fields,
 and additional sort-by collectors.

This sounds good... but the challenge is we also need to get both
HitCollector and MultiReaderHitCollector in there.

HitCollector is the simplest way to create a custom collector.
MultiReaderHitCollector (added with LUCENE-1483) is the more
performant way, since it lets your collector operate per-segment.  All
non-deprecated core collectors in Lucene now subclass
MultiReaderHitCollector.

So would we make separate subclasses for each of them to add
getTotalHits() / topDocs()?  EG TopDocsHitCollector and
TopDocsMultiReaderHitCollector?  It's getting confusing.

Or maybe we just add totalHits() and topDocs() to HitCollector even
though for advanced case (non-top-N-collection) the methods would not
be used?

Or... maybe this is a time when an interface is the lesser evil: we
could make a TopDocs interface that the necessary classes implement?

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Modularization

2009-03-23 Thread Michael McCandless
Michael Busch busch...@gmail.com wrote:

 And I don't think the sudden separation of core vs contrib
 should be so prominent (or even visible); it's really a detail of
 how we manage source control.

 When looking at the website I'd like read that Lucene can do hit
 highlighting, powerful query parsing, spell checking, analyze
 different languages, etc.  I could care less that some of these
 happen to live under a contrib subdirectory somewhere in the
 source control system.

 OK, so I think we all agree about the packaging. But I believe it is
 also important how the source code is organized. Maybe Lucene
 consumers don't care too much, however, Lucene is an open source
 project. So we also want to attract possible contributors with a
 nicely organized code base. If there is a clear separation between
 the different components on a source code level, becoming familiar
 with Lucene as a contributor might not be so overwhelming.

+1

We want the source code to be well organized: consumability by Lucene
developers (not just Lucene users) is also important for Lucene's
future growth.

 Besides that, I think a one-to-one mapping between the packaging and
 the source code has no disadvantages. (and it would certainly make
 the build scripts easier!)

Right.

So, towards that... why even break out contrib vs core, in source
control?  Can't we simply migrate contrib/* into core, in the right
places?

 Could we, instead, adopt some standard way (in the package
 javadocs) of stating the maturity/activity/back compat policies/etc
 of a given package?

 This makes sense; e.g. we could release new modules as beta versions
 (= use at own risk, no backwards-compatibility).

In fact we already have a 2.9 Jira issue opened to better document the
back-compat/JDK version requirements of all packages.

I think, like we've done with core lately when a new feature is added,
we could have the default assumption be full back compatibility, but
then those classes/methods/packages that are very new and may change
simply say so clearly in their javadocs.

 And if we start a new module (e.g. a GSoC project) we could exclude
 it from a release easily if it's truly experimental and not in a
 release-able state.

Right.

 So I think the beginnings of a rough proposal is taking shape, for
3.0:

   1. Fix web site to give a better intro to Lucene's features,
   without exposing core vs. contrib false (to the Lucene
   consumer)  distinction

   2. When releasing, we make a single JAR holding core  contrib
   classes for a given area.  The final JAR files don't contain a
   core vs contrib distinction.

   3. We create a bundled JAR that has the common packages
   typically needed (index/search core, analyzers, queries,
   highlighter, spellchecker)

 +1 to all three points.

OK.

So I guess I'm proposing adding:

   4. Move contrib/* under src/java/*, updating the javadocs to state
   back compatibility promises per class/package.

I think net/net this'd be a great simplification?

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Modularization

2009-03-23 Thread Yonik Seeley
On Mon, Mar 23, 2009 at 11:10 AM, Michael McCandless
luc...@mikemccandless.com wrote:
   4. Move contrib/* under src/java/*, updating the javadocs to state
       back compatibility promises per class/package.

- contrib has always had a lower bar and stuff was committed under
that lower bar - there should be no blanket promotion.
- contrib items may have different dependencies... putting it all
under the same source root can make a developers job harder
- many contrib items are less related to lucene-java core indexing and
searching... if there is no contrib, then they don't belong in the
lucene-java project at all.
- right now it's clear - core can't have dependencies on non-core
classes.  If everything is stuck in the same source tree, that goes
away.

I think there are a lot of benefits to continue considering very
carefully if something is core or not.

-Yonik

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-1555) Deadlock while optimize

2009-03-23 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-1555.


Resolution: Incomplete

Need more details here.

 Deadlock while optimize
 ---

 Key: LUCENE-1555
 URL: https://issues.apache.org/jira/browse/LUCENE-1555
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.4
 Environment: ubuntu 8.04, java 1.6 update 07, Lucene 2.4.0
Reporter: Stefan Heidrich
Assignee: Michael McCandless

 Sometimes after starting the thread with the indexer, the thread will hang in 
 the following threads.
 Thread [Lucene Merge Thread #0] (Ausgesetzt)  
   IndexWriter.commitMerge(MergePolicy$OneMerge, SegmentMerger, int) Line: 
 3751
   IndexWriter.mergeMiddle(MergePolicy$OneMerge) Line: 4240
   IndexWriter.merge(MergePolicy$OneMerge) Line: 3877  
   ConcurrentMergeScheduler.doMerge(MergePolicy$OneMerge) Line: 205
   ConcurrentMergeScheduler$MergeThread.run() Line: 260
 Thread [Indexer] (Ausgesetzt) 
   Object.wait(long) Line: not available [native Methode]  
   IndexWriter.doWait() Line: 4491 
   IndexWriter.optimize(int, boolean) Line: 2268   
   IndexWriter.optimize(boolean) Line: 2203
   IndexWriter.optimize() Line: 2183   
   Indexer.run() Line: 263 
 If you need more informations, please let me know.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Is TopDocCollector's collect() implementation correct?

2009-03-23 Thread Shai Erera
ok I missed 1483 completely.

As a side comment, why not add setNextReader to HitCollector and then a
getDocId(int doc) method which will do the doc + base arithmetic? I think
it's very easy for someone to forget to add that (+ base) to doc. You could
then just change TopDocCollector to call getDocId() instead of duplicating
it into TopScoreDocCollector.

Isn't that something you'd want all HitCollector implementations to use? I
consider some extensions of HitCollector we have - we now will probably want
to change them to extend MultiReaderHitCollector, but we'll have to remember
to do that +base arithmatic everywhere, instead of calling getDocId(). I
understand that changing the call to getDocId is the same as adding +
base, from an effort perspective, but I think it's better this way. It does
involve an additional method call, but I wonder how good compilers will
handle that.

Anyway, I don't want to add topDocs and getTotalHits to HitCollector, it
will destroy its generic purpose. An interface is also problematic, as it
just means all of these collectors have these methods declared, but they
need to implement them. An abstract class grants you w/ both.

So in case you agree that the logic of MultiReaderHitCollector can (and
should?) be in HitCollector, we can create an abstract class called
ScoringCollector (or if nobody objects TopDocsCollector) which will
implement these two methods.
In case you disagree, we can have that abstract class extend
MultiReaderHitCollector instead.

I'm in favor of the first option as at least as it looks in the code,
HitCollector is not extended by any class anymore, except TopDocCollector
which is marked as deprecated, and 3 anonymous implementations. So it looks
like HitCollector itself is deprecated as far as the Lucene core code sees
it.

What do you think?

Shai

On Mon, Mar 23, 2009 at 4:43 PM, Michael McCandless 
luc...@mikemccandless.com wrote:

  If we're already creating a new TopScoreDocCollector (when was it
  added?  I must have been dozing off while this happened...)

 This was LUCENE-1483.

  How about if we introduce an abstract ScoringCollector (about the
  name later) which implements topDocs() and getTotalHits() and there
  will be several implementations of it, such as:
  TopScoreDocCollector, which sorts the documents by their score, in
  descending order only, TopFieldDocCollector - for sorting by fields,
  and additional sort-by collectors.

 This sounds good... but the challenge is we also need to get both
 HitCollector and MultiReaderHitCollector in there.

 HitCollector is the simplest way to create a custom collector.
 MultiReaderHitCollector (added with LUCENE-1483) is the more
 performant way, since it lets your collector operate per-segment.  All
 non-deprecated core collectors in Lucene now subclass
 MultiReaderHitCollector.

 So would we make separate subclasses for each of them to add
 getTotalHits() / topDocs()?  EG TopDocsHitCollector and
 TopDocsMultiReaderHitCollector?  It's getting confusing.

 Or maybe we just add totalHits() and topDocs() to HitCollector even
 though for advanced case (non-top-N-collection) the methods would not
 be used?

 Or... maybe this is a time when an interface is the lesser evil: we
 could make a TopDocs interface that the necessary classes implement?

 Mike

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org




Re: Is TopDocCollector's collect() implementation correct?

2009-03-23 Thread Michael McCandless
Shai Erera ser...@gmail.com wrote:

 As a side comment, why not add setNextReader to HitCollector and
 then a getDocId(int doc) method which will do the doc + base
 arithmetic?

One problem is this breaks back compatibility on any current
subclasses of HitCollector.

Another problem is: not all collectors would need to add the base on
each doc.  EG a collector that puts hits into separate pqueues per
segment could defer the addition until the end when only the top
results are pulled out of each pqueue.

Also, I am concerned about the method call overhead.  This is the
absolute ultimate hot spot for Lucene and we should worry about
causing even a single added instruction in this path.

That said... I would like to [eventually] change the collection API
along the lines of what Marvin proposed for Matcher in Lucy, here:

  http://markmail.org/message/jxshhiqr6wvq77xu

Specifically, I think it should be the collector's job to ask for the
score for this doc, rather than Lucene's job to pre-compute it, so
that collectors that don't need the score won't waste CPU.  EG, if you
are sorting by field (and don't present the relevance score) you
shouldn't compute it.

Then, we could add other somewhat expensive things you might
retrieve, such as a way to ask which terms participated in the match
(discussed today on java-user), and/or all term positions that
participated (discussed in LUCENE-1522).  EG, a top doc collector
could choose to call these methods only when the doc was competitive.

 Anyway, I don't want to add topDocs and getTotalHits to
 HitCollector, it will destroy its generic purpose.

I agree.

 An interface is also problematic, as it just means all of these
 collectors have these methods declared, but they need to implement
 them. An abstract class grants you w/ both.

I'm confused on this objection -- only collectors that do let you ask
for the top N set of docs would implement this interface?  (Ie it'd
only be the TopXXXCollector's that'd implement the interface).  While
interfaces clearly have the future problem of back-compatibility, this
case may be simple enough to make an exception.

 So it looks like HitCollector itself is deprecated as far as the
 Lucene core code sees it.

I think HitCollector has a purpose, which is to be the simplest way to
make a custom collector.  Ie I think it makes sense to offer a simple
way and a high performance way.

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Modularization

2009-03-23 Thread Mark Miller
Are you arguing for no change Yonik? I agree with all of your points in 
any case.


What appeals to me most so far is:

Take the best of contrib and up its status to something like modules. 
Equal to core, different requirements, dependencies, etc. Perhaps take 
queryparser out of core, but frankly I'd wouldn't mind just leaving core 
as it is.


Reintroduce the sandbox (I believe core was sandbox, part of the lower 
bar history) and put lesser contrib there and new stuff thats unproven. 
Contrib doesn't appeal to me as a name anyway.


That would give core, modules, and the sandbox (perhaps sandbox is a 
module?). Things could move from sandbox to core or the modules. Modules 
get new requirements similar to core - back compat guarantees and 
changes.txt per module.



Yonik Seeley wrote:

On Mon, Mar 23, 2009 at 11:10 AM, Michael McCandless
luc...@mikemccandless.com wrote:
  

  4. Move contrib/* under src/java/*, updating the javadocs to state
  back compatibility promises per class/package.



- contrib has always had a lower bar and stuff was committed under
that lower bar - there should be no blanket promotion.
- contrib items may have different dependencies... putting it all
under the same source root can make a developers job harder
- many contrib items are less related to lucene-java core indexing and
searching... if there is no contrib, then they don't belong in the
lucene-java project at all.
- right now it's clear - core can't have dependencies on non-core
classes.  If everything is stuck in the same source tree, that goes
away.

I think there are a lot of benefits to continue considering very
carefully if something is core or not.

-Yonik

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

  



--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Modularization

2009-03-23 Thread Earwin Burrfoot
 - contrib has always had a lower bar and stuff was committed under
 that lower bar - there should be no blanket promotion.
 - contrib items may have different dependencies... putting it all
 under the same source root can make a developers job harder
 - many contrib items are less related to lucene-java core indexing and
 searching... if there is no contrib, then they don't belong in the
 lucene-java project at all.
 - right now it's clear - core can't have dependencies on non-core
 classes.  If everything is stuck in the same source tree, that goes
 away.
Adding to this, afaik contribs have no java 1.4 restriction. If you
merge them into the core, you must either enforce it for contribs, or
lift it from the core. I think both variants may be a reason for
several heart attacks :)
One could argue that five years after 1.5 was released Lucene is going
to use it, so the point is no longer relevant. Sorry, 1.7 is just
behind the door.

-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Modularization

2009-03-23 Thread Mark Miller

Earwin Burrfoot wrote:

- contrib has always had a lower bar and stuff was committed under
that lower bar - there should be no blanket promotion.
- contrib items may have different dependencies... putting it all
under the same source root can make a developers job harder
- many contrib items are less related to lucene-java core indexing and
searching... if there is no contrib, then they don't belong in the
lucene-java project at all.
- right now it's clear - core can't have dependencies on non-core
classes.  If everything is stuck in the same source tree, that goes
away.


Adding to this, afaik contribs have no java 1.4 restriction. If you
merge them into the core, you must either enforce it for contribs, or
lift it from the core. I think both variants may be a reason for
several heart attacks :)
One could argue that five years after 1.5 was released Lucene is going
to use it, so the point is no longer relevant. Sorry, 1.7 is just
behind the door.

  
I think we are considering this for Lucene 3.0 (should be the release 
after next) which will allow Java 1.5.


- Mark

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Modularization

2009-03-23 Thread Earwin Burrfoot
On Mon, Mar 23, 2009 at 22:13, Mark Miller markrmil...@gmail.com wrote:
 Earwin Burrfoot wrote:

 - contrib has always had a lower bar and stuff was committed under
 that lower bar - there should be no blanket promotion.
 - contrib items may have different dependencies... putting it all
 under the same source root can make a developers job harder
 - many contrib items are less related to lucene-java core indexing and
 searching... if there is no contrib, then they don't belong in the
 lucene-java project at all.
 - right now it's clear - core can't have dependencies on non-core
 classes.  If everything is stuck in the same source tree, that goes
 away.


 Adding to this, afaik contribs have no java 1.4 restriction. If you
 merge them into the core, you must either enforce it for contribs, or
 lift it from the core. I think both variants may be a reason for
 several heart attacks :)
 One could argue that five years after 1.5 was released Lucene is going
 to use it, so the point is no longer relevant. Sorry, 1.7 is just
 behind the door.



 I think we are considering this for Lucene 3.0 (should be the release after
 next) which will allow Java 1.5.

So where are you going to put 1.6 and 1.7 contribs?

-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1561) Maybe rename Field.omitTf, and strengthen the javadocs

2009-03-23 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12688385#action_12688385
 ] 

Otis Gospodnetic commented on LUCENE-1561:
--

Might be good to keep a consistent name across Lucene/Solr.
More info coming up in SOLR-1079.


 Maybe rename Field.omitTf, and strengthen the javadocs
 --

 Key: LUCENE-1561
 URL: https://issues.apache.org/jira/browse/LUCENE-1561
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.4.1
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1561.patch


 Spinoff from here:
   
 http://www.nabble.com/search-problem-when-indexed-using-Field.setOmitTf()-td22456141.html
 Maybe rename omitTf to something like omitTermPositions, and make it clear 
 what queries will silently fail to work as a result.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1561) Maybe rename Field.omitTf, and strengthen the javadocs

2009-03-23 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12688387#action_12688387
 ] 

Michael McCandless commented on LUCENE-1561:


Naming is the hardest part!!

 Maybe rename Field.omitTf, and strengthen the javadocs
 --

 Key: LUCENE-1561
 URL: https://issues.apache.org/jira/browse/LUCENE-1561
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.4.1
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1561.patch


 Spinoff from here:
   
 http://www.nabble.com/search-problem-when-indexed-using-Field.setOmitTf()-td22456141.html
 Maybe rename omitTf to something like omitTermPositions, and make it clear 
 what queries will silently fail to work as a result.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1522) another highlighter

2009-03-23 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12688408#action_12688408
 ] 

Michael McCandless commented on LUCENE-1522:


Randomly searching in Google I came across this:


http://stackoverflow.com/questions/82151/is-there-a-fast-accurate-highlighter-for-lucene

...which emphasizes how important it is that the highlighter only highlight 
matching fragdocs when possible.

(Meaning, if you were to copy  paste the full excerpt you are looking at, 
index it as a document, would your current search match it).

 another highlighter
 ---

 Key: LUCENE-1522
 URL: https://issues.apache.org/jira/browse/LUCENE-1522
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/highlighter
Reporter: Koji Sekiguchi
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: colored-tag-sample.png, LUCENE-1522.patch, 
 LUCENE-1522.patch


 I've written this highlighter for my project to support bi-gram token stream 
 (general token stream (e.g. WhitespaceTokenizer) also supported. see test 
 code in patch). The idea was inherited from my previous project with my 
 colleague and LUCENE-644. This approach needs highlight fields to be 
 TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This 
 depends on LUCENE-1448 to get refined term offsets.
 usage:
 {code:java}
 TopDocs docs = searcher.search( query, 10 );
 Highlighter h = new Highlighter();
 FieldQuery fq = h.getFieldQuery( query );
 for( ScoreDoc scoreDoc : docs.scoreDocs ){
   // fieldName=content, fragCharSize=100, numFragments=3
   String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, 
 content, 100, 3 );
   if( fragments != null ){
 for( String fragment : fragments )
   System.out.println( fragment );
   }
 }
 {code}
 features:
 - fast for large docs
 - supports not only whitespace-based token stream, but also fixed size 
 N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489)
 - supports PhraseQuery, phrase-unit highlighting with slops
 {noformat}
 q=w1 w2
 bw1 w2/b
 ---
 q=w1 w2~1
 bw1/b w3 bw2/b w3 bw1 w2/b
 {noformat}
 - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
 - easy to apply patch due to independent package (contrib/highlighter2)
 - uses Java 1.5
 - looks query boost to score fragments (currently doesn't see idf, but it 
 should be possible)
 - pluggable FragListBuilder
 - pluggable FragmentsBuilder
 to do:
 - term positions can be unnecessary when phraseHighlight==false
 - collects performance numbers

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1410) PFOR implementation

2009-03-23 Thread Paul Elschot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12688409#action_12688409
 ] 

Paul Elschot commented on LUCENE-1410:
--

The encoding in the google research slides is another one.
They use 2 bits prefixing the first byte and indicating the number of bytes 
used for the encoded number (1-4), and then they group 4 of those prefixes 
together to get a single byte of 4 prefixes followed by the non prefixed bytes 
of the 4 encoded numbers.
This requires a 256 way switch (indexed jump) for every 4 encoded numbers, and 
I would expect that jump to limit performance somewhat when compared to pfor 
that has a 32 way switch for 32/64/128 encoded numbers.
But since the prefixes only indicate the numbers of bytes used for the encoded 
numbers, no shifts and masks are needed, only byte moves.
So it could well be wortwhile to give this encoding a try, too, especially for 
lists of numbers shorter than 16 or 32.

 PFOR implementation
 ---

 Key: LUCENE-1410
 URL: https://issues.apache.org/jira/browse/LUCENE-1410
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Other
Reporter: Paul Elschot
Priority: Minor
 Attachments: autogen.tgz, LUCENE-1410b.patch, LUCENE-1410c.patch, 
 LUCENE-1410d.patch, LUCENE-1410e.patch, TermQueryTests.tgz, TestPFor2.java, 
 TestPFor2.java, TestPFor2.java

   Original Estimate: 21840h
  Remaining Estimate: 21840h

 Implementation of Patched Frame of Reference.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1522) another highlighter

2009-03-23 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12688419#action_12688419
 ] 

Mark Miller commented on LUCENE-1522:
-

I think you are reading more into that than I see - that guy is just frustrated 
that PhraseQueries don't highlight correctly. That was/is a common occurrence 
and you can find tons of examples. There are one or two JIRA highlighters that 
address it, and the their is the Span highlighter (more interestingly, there is 
a link to the birth of the Span highlighter idea on that page - thanks M. 
Harwood).

When users see the PhraseQuery look right, I havn't seen any other repeated 
complaints really. While it would be nice to match boolean logic fully, I 
almost don't think its worth the effort. You likely have an interest in those 
terms anyway - its not a given that the terms that caused the match (non 
positional) matter. I have not seen a complaint on that one - mostly just 
positional type stuff. And I think we have positional solved fairly well with 
the current API - its just too darn slow. Not that I am against things being 
sweet and perfect, and getting exact matches, but there has been lots of talk 
in the past about integrating the highlighter into core and making things 
really fast and efficient - and generally it comes down to what work actually 
gets done (and all this stuff ends up at the hard end of the pool).

When I wrote the SpanScorer, many times it was discussed how things should 
*really* be done. Most methods involved working with core - but what has been 
there for a couple years now is the SpanScorer that plugs into the current 
highlighter API and nothing else has made any progress. Not really an argument, 
just kind of thinking out loud at this point...

I'm all for improving the speed and accuracy of the highlighter at the end of 
the day, but its a tall order considering how much attention the Highlighter 
has managed to receive in the past. Its large on ideas and low on sweat.

 another highlighter
 ---

 Key: LUCENE-1522
 URL: https://issues.apache.org/jira/browse/LUCENE-1522
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/highlighter
Reporter: Koji Sekiguchi
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: colored-tag-sample.png, LUCENE-1522.patch, 
 LUCENE-1522.patch


 I've written this highlighter for my project to support bi-gram token stream 
 (general token stream (e.g. WhitespaceTokenizer) also supported. see test 
 code in patch). The idea was inherited from my previous project with my 
 colleague and LUCENE-644. This approach needs highlight fields to be 
 TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This 
 depends on LUCENE-1448 to get refined term offsets.
 usage:
 {code:java}
 TopDocs docs = searcher.search( query, 10 );
 Highlighter h = new Highlighter();
 FieldQuery fq = h.getFieldQuery( query );
 for( ScoreDoc scoreDoc : docs.scoreDocs ){
   // fieldName=content, fragCharSize=100, numFragments=3
   String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, 
 content, 100, 3 );
   if( fragments != null ){
 for( String fragment : fragments )
   System.out.println( fragment );
   }
 }
 {code}
 features:
 - fast for large docs
 - supports not only whitespace-based token stream, but also fixed size 
 N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489)
 - supports PhraseQuery, phrase-unit highlighting with slops
 {noformat}
 q=w1 w2
 bw1 w2/b
 ---
 q=w1 w2~1
 bw1/b w3 bw2/b w3 bw1 w2/b
 {noformat}
 - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
 - easy to apply patch due to independent package (contrib/highlighter2)
 - uses Java 1.5
 - looks query boost to score fragments (currently doesn't see idf, but it 
 should be possible)
 - pluggable FragListBuilder
 - pluggable FragmentsBuilder
 to do:
 - term positions can be unnecessary when phraseHighlight==false
 - collects performance numbers

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-1522) another highlighter

2009-03-23 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12688419#action_12688419
 ] 

Mark Miller edited comment on LUCENE-1522 at 3/23/09 2:12 PM:
--

I think you are reading more into that than I see - that guy is just frustrated 
that PhraseQueries don't highlight correctly. That was/is a common occurrence 
and you can find tons of examples. There are one or two JIRA highlighters that 
address it, and the their is the Span highlighter (more interestingly, there is 
a link to the birth of the Span highlighter idea on that page - thanks M. 
Harwood).

When users see the PhraseQuery look right, I havn't seen any other repeated 
complaints really. While it would be nice to match boolean logic fully, I 
almost don't think its worth the effort. You likely have an interest in those 
terms anyway - its not a given that the terms that caused the match (non 
positional) matter. I have not seen a complaint on that one - mostly just 
positional type stuff. And I think we have positional solved fairly well with 
the current API - its just too darn slow. Not that I am against things being 
sweet and perfect, and getting exact matches, but there has been lots of talk 
in the past about integrating the highlighter into core and making things 
really fast and efficient - and generally it comes down to what work actually 
gets done (and all this stuff ends up at the hard end of the pool).

When I wrote the SpanScorer, many times it was discussed how things should 
*really* be done. Most methods involved working with core - but what has been 
there for a couple years now is the SpanScorer that plugs into the current 
highlighter API and nothing else has made any progress. Not really an argument, 
just kind of thinking out loud at this point...

I'm all for improving the speed and accuracy of the highlighter at the end of 
the day, but its a tall order considering how much attention the Highlighter 
has managed to receive in the past. Its large on ideas and low on sweat.

*edit*
A lot of the sweat that is given has been fragmented by the 3 or 4 alternate 
highlighters.

  was (Author: markrmil...@gmail.com):
I think you are reading more into that than I see - that guy is just 
frustrated that PhraseQueries don't highlight correctly. That was/is a common 
occurrence and you can find tons of examples. There are one or two JIRA 
highlighters that address it, and the their is the Span highlighter (more 
interestingly, there is a link to the birth of the Span highlighter idea on 
that page - thanks M. Harwood).

When users see the PhraseQuery look right, I havn't seen any other repeated 
complaints really. While it would be nice to match boolean logic fully, I 
almost don't think its worth the effort. You likely have an interest in those 
terms anyway - its not a given that the terms that caused the match (non 
positional) matter. I have not seen a complaint on that one - mostly just 
positional type stuff. And I think we have positional solved fairly well with 
the current API - its just too darn slow. Not that I am against things being 
sweet and perfect, and getting exact matches, but there has been lots of talk 
in the past about integrating the highlighter into core and making things 
really fast and efficient - and generally it comes down to what work actually 
gets done (and all this stuff ends up at the hard end of the pool).

When I wrote the SpanScorer, many times it was discussed how things should 
*really* be done. Most methods involved working with core - but what has been 
there for a couple years now is the SpanScorer that plugs into the current 
highlighter API and nothing else has made any progress. Not really an argument, 
just kind of thinking out loud at this point...

I'm all for improving the speed and accuracy of the highlighter at the end of 
the day, but its a tall order considering how much attention the Highlighter 
has managed to receive in the past. Its large on ideas and low on sweat.
  
 another highlighter
 ---

 Key: LUCENE-1522
 URL: https://issues.apache.org/jira/browse/LUCENE-1522
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/highlighter
Reporter: Koji Sekiguchi
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: colored-tag-sample.png, LUCENE-1522.patch, 
 LUCENE-1522.patch


 I've written this highlighter for my project to support bi-gram token stream 
 (general token stream (e.g. WhitespaceTokenizer) also supported. see test 
 code in patch). The idea was inherited from my previous project with my 
 colleague and LUCENE-644. This approach needs highlight fields to be 
 TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This 
 depends on 

[jira] Commented: (LUCENE-1561) Maybe rename Field.omitTf, and strengthen the javadocs

2009-03-23 Thread Eks Dev (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12688429#action_12688429
 ] 

Eks Dev commented on LUCENE-1561:
-

maybe something along the lines:

usePureBooleanPostings()
minimalInvertedList()




 Maybe rename Field.omitTf, and strengthen the javadocs
 --

 Key: LUCENE-1561
 URL: https://issues.apache.org/jira/browse/LUCENE-1561
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.4.1
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1561.patch


 Spinoff from here:
   
 http://www.nabble.com/search-problem-when-indexed-using-Field.setOmitTf()-td22456141.html
 Maybe rename omitTf to something like omitTermPositions, and make it clear 
 what queries will silently fail to work as a result.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Modularization

2009-03-23 Thread Michael McCandless
 I think we are considering this for Lucene 3.0 (should be the
 release after next) which will allow Java 1.5.

 So where are you going to put 1.6 and 1.7 contribs?

This is a good point: core Lucene must remain on old JREs, but we
should not force all contrib packages to do so.

 - contrib has always had a lower bar and stuff was committed under
 that lower bar - there should be no blanket promotion.

OK so that was the past, and I agree.

I assume by this you're also advocating that going forward this is an
ongoing reason to put something into contrib?  I agree with that. Ie,
if a contribution is made, but it's not clear the quality is up to
core's standards, I would much rather have some place to commit it
(contrib) than to reject it, because once it has a home here, it has a
chance to gain interest, grow, improve, etc.

But: do you think, for this reason, the web site should continue to
present the dichotomy?

 - contrib items may have different dependencies... putting it all
 under the same source root can make a developers job harder

That's a good point  criterion for leaving something in contrib.

 - many contrib items are less related to lucene-java core indexing
 and searching... if there is no contrib, then they don't belong in
 the lucene-java project at all.

But most contrib packages are very related to Lucene.

Though I agree some contrib packages likely have very narrow
appeal/usage (eg, contrib/db, for using BDB as the raw store for an
index).

And I agree (as above): I would like to have somewhere for
contributions to go, rather than reject them.

 - right now it's clear - core can't have dependencies on non-core
 classes.  If everything is stuck in the same source tree, that goes
 away.

Well... this gets to Hoss's motivation, which I appreciate, to keep
the core tiny.

But that's just good software design and you don't need a divorced
directory structure to achieve that.

 I think there are a lot of benefits to continue considering very
 carefully if something is core or not.

I agree, but at least we need some clear criteria so the future
decision process is more straightforward.  Towards that... it seems
like there are good reasons why something should be put into contrib:

  * It uses a version of JDK higher than what core can allow

  * It has external dependencies

  * Its quality is debatable (or at least not proven)

  * It's of somewhat narrow usage/interest (eg: contrib/bdb)

But I don't think it doesn't have to be in core (the software
modularity goal) is the right reason to put something in contrib.

Getting back to the original topic: Trie(Numeric)RangeFilter runs on
JDK 1.4, has no external dependencies, looks to be high quality, and
likely will have wide appeal.  Doesn't it belong in core?

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1522) another highlighter

2009-03-23 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12688439#action_12688439
 ] 

Michael McCandless commented on LUCENE-1522:


bq. I think you are reading more into that than I see - that guy is just 
frustrated that PhraseQueries don't highlight correctly

But that's really quite a serious problem; it's the kind that
immediately erodes user's trust.  Though if this user had used
SpanScorer it would have been fixed (right?).

Is there any reason not to use SpanScorer (vs QueryScorer)?

The final inch (search UI) is exceptionally important!

bq. When users see the PhraseQuery look right, I havn't seen any other repeated 
complaints really.

OK.

bq. And I think we have positional solved fairly well with the current API - 
its just too darn slow.

Well... I'd still like to explore some way to better integrate w/ core
(just don't have enough time, but maybe if I keep talking about it
here, someone else will get the itch + time ;).

I think an IndexReader impl around loaded TermVectors can get us OK
performance (no re-analysis nor linear scan of resynthesized
TokenStream).

bq. Not that I am against things being sweet and perfect, and getting exact 
matches, but there has been lots of talk in the past about integrating the 
highlighter into core and making things really fast and efficient - and 
generally it comes down to what work actually gets done (and all this stuff 
ends up at the hard end of the pool).

Well this is open source after all.  Things get naturally
prioritized.

bq. A lot of the sweat that is given has been fragmented by the 3 or 4 
alternate highlighters.

Yeah also another common theme in open-source development, though it's
in good company: evolution and capitalism share the same flaw.


 another highlighter
 ---

 Key: LUCENE-1522
 URL: https://issues.apache.org/jira/browse/LUCENE-1522
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/highlighter
Reporter: Koji Sekiguchi
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: colored-tag-sample.png, LUCENE-1522.patch, 
 LUCENE-1522.patch


 I've written this highlighter for my project to support bi-gram token stream 
 (general token stream (e.g. WhitespaceTokenizer) also supported. see test 
 code in patch). The idea was inherited from my previous project with my 
 colleague and LUCENE-644. This approach needs highlight fields to be 
 TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This 
 depends on LUCENE-1448 to get refined term offsets.
 usage:
 {code:java}
 TopDocs docs = searcher.search( query, 10 );
 Highlighter h = new Highlighter();
 FieldQuery fq = h.getFieldQuery( query );
 for( ScoreDoc scoreDoc : docs.scoreDocs ){
   // fieldName=content, fragCharSize=100, numFragments=3
   String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, 
 content, 100, 3 );
   if( fragments != null ){
 for( String fragment : fragments )
   System.out.println( fragment );
   }
 }
 {code}
 features:
 - fast for large docs
 - supports not only whitespace-based token stream, but also fixed size 
 N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489)
 - supports PhraseQuery, phrase-unit highlighting with slops
 {noformat}
 q=w1 w2
 bw1 w2/b
 ---
 q=w1 w2~1
 bw1/b w3 bw2/b w3 bw1 w2/b
 {noformat}
 - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
 - easy to apply patch due to independent package (contrib/highlighter2)
 - uses Java 1.5
 - looks query boost to score fragments (currently doesn't see idf, but it 
 should be possible)
 - pluggable FragListBuilder
 - pluggable FragmentsBuilder
 to do:
 - term positions can be unnecessary when phraseHighlight==false
 - collects performance numbers

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Modularization

2009-03-23 Thread Mike Klaas


On 23-Mar-09, at 2:41 PM, Michael McCandless wrote:


I agree, but at least we need some clear criteria so the future
decision process is more straightforward.  Towards that... it seems
like there are good reasons why something should be put into contrib:

 * It uses a version of JDK higher than what core can allow

 * It has external dependencies

 * Its quality is debatable (or at least not proven)

 * It's of somewhat narrow usage/interest (eg: contrib/bdb)

But I don't think it doesn't have to be in core (the software
modularity goal) is the right reason to put something in contrib.


Agreed.  I don't think that building on the existing 'contrib' is the  
way to go.  Frequently-used, high-quality components should be more  
properly part of Lucene, whether that means that they move to core,  
or in a new blessed modules section.



Getting back to the original topic: Trie(Numeric)RangeFilter runs on
JDK 1.4, has no external dependencies, looks to be high quality, and
likely will have wide appeal.  Doesn't it belong in core?


+1.  It is important that Lucene come blessed with very good quality  
defaults.  Fast range queries are a common requirement.  Similarly, I  
wouldn't be happy to have a new, wicked QueryParser be relegated to  
contrib where it is unlikely to be found by non-savvy users.  At the  
very least, I agree with Michael that it should be findable in the  
same place.


It does make sense to separate the machinery/building blocks (base  
Query, Weight, Scorer, Filter classes, Similarity interface, etc.)  
from the Query/Filter implementations that use them.  But whether this  
is done by putting them in separate directories or via global core/ 
modules distinction seems unimportant.


-Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1522) another highlighter

2009-03-23 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12688448#action_12688448
 ] 

Mark Miller commented on LUCENE-1522:
-

{quote}But that's really quite a serious problem; it's the kind that
immediately erodes user's trust. Though if this user had used
SpanScorer it would have been fixed (right?).{quote}

Right - my point was more that it was a common complaint and has been solved in 
one way or another for a long time. Even back when that post occured, there was 
a JIRA highlighter that worked with phrase queries I think. There have been at 
least one or two besides the SpanScorer.

{quote}Is there any reason not to use SpanScorer (vs QueryScorer)?{quote}

It is slower when working with position sensitive clauses - because it actually 
does some work. For non position sensitive terms, its the same speed as the 
standard. Makes sense to me to always use it, but if you don't care and want 
every term highlighted, why pay the price I guess...

{quote}
Well... I'd still like to explore some way to better integrate w/ core
(just don't have enough time, but maybe if I keep talking about it
here, someone else will get the itch + time .
{quote}

Right - don't get me wrong - I was just getting thoughts in my head down. These 
types of brain dumps you higher level guys do def leads to work getting done - 
the SpanScorer came directly from these types of discussions, and quite a bit 
later - the original discussion happened before my time.

{quote}
Well this is open source after all. Things get naturally
prioritized.

A lot of the sweat that is given has been fragmented by the 3 or 4 
alternate highlighters.

Yeah also another common theme in open-source development, though it's
in good company: evolution and capitalism share the same flaw.
{quote}

Right. I suppose I was just suggesting that something more practical might make 
more sense (more musing than suggesting). And practical in terms of how much 
activity we have seen in the highlighter area (fairly low, and not usually to 
the extent needed to get something committed and in use).

And the split work on the highlighters is fine - but if we had the right 
highlighter base, more work could have been concentrated on the highlighter 
thats most used. Not really a complaint, but idea for the future. If we can get 
something better going, perhaps we can get to the point were people work with 
the current implementation rather than creating a new one every time.

 another highlighter
 ---

 Key: LUCENE-1522
 URL: https://issues.apache.org/jira/browse/LUCENE-1522
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/highlighter
Reporter: Koji Sekiguchi
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: colored-tag-sample.png, LUCENE-1522.patch, 
 LUCENE-1522.patch


 I've written this highlighter for my project to support bi-gram token stream 
 (general token stream (e.g. WhitespaceTokenizer) also supported. see test 
 code in patch). The idea was inherited from my previous project with my 
 colleague and LUCENE-644. This approach needs highlight fields to be 
 TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This 
 depends on LUCENE-1448 to get refined term offsets.
 usage:
 {code:java}
 TopDocs docs = searcher.search( query, 10 );
 Highlighter h = new Highlighter();
 FieldQuery fq = h.getFieldQuery( query );
 for( ScoreDoc scoreDoc : docs.scoreDocs ){
   // fieldName=content, fragCharSize=100, numFragments=3
   String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, 
 content, 100, 3 );
   if( fragments != null ){
 for( String fragment : fragments )
   System.out.println( fragment );
   }
 }
 {code}
 features:
 - fast for large docs
 - supports not only whitespace-based token stream, but also fixed size 
 N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489)
 - supports PhraseQuery, phrase-unit highlighting with slops
 {noformat}
 q=w1 w2
 bw1 w2/b
 ---
 q=w1 w2~1
 bw1/b w3 bw2/b w3 bw1 w2/b
 {noformat}
 - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
 - easy to apply patch due to independent package (contrib/highlighter2)
 - uses Java 1.5
 - looks query boost to score fragments (currently doesn't see idf, but it 
 should be possible)
 - pluggable FragListBuilder
 - pluggable FragmentsBuilder
 to do:
 - term positions can be unnecessary when phraseHighlight==false
 - collects performance numbers

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: 

[jira] Commented: (LUCENE-1561) Maybe rename Field.omitTf, and strengthen the javadocs

2009-03-23 Thread Mike Klaas (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12688449#action_12688449
 ] 

Mike Klaas commented on LUCENE-1561:


I agree that it is going to be almost impossible to convey that phrase queries 
don't work by renaming the flag.  I agree with Eks Dev that a positive 
formulation is the only chance, although this deviates from the current omit* 
flags.

termPresenceOnly()
trackTermPresenceOnly()
onlyTermPresence()
omitEverythingButTermPresence() // just kidding


 Maybe rename Field.omitTf, and strengthen the javadocs
 --

 Key: LUCENE-1561
 URL: https://issues.apache.org/jira/browse/LUCENE-1561
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.4.1
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1561.patch


 Spinoff from here:
   
 http://www.nabble.com/search-problem-when-indexed-using-Field.setOmitTf()-td22456141.html
 Maybe rename omitTf to something like omitTermPositions, and make it clear 
 what queries will silently fail to work as a result.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1522) another highlighter

2009-03-23 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12688451#action_12688451
 ] 

Michael Busch commented on LUCENE-1522:
---

{quote}
(Meaning, if you were to copy  paste the full excerpt you are looking at, 
index it as a document, would your current search match it).
{quote}

I think this is an unrealistic requirement in some cases (e.g. AND queries). I 
agree it makes sense for phrases to show them entirely in a fragment (even if 
that means not to show the beginning of a sentence). But often you have only 
one or two lines of text to display an extract. Then it might be a better 
choice to show two decently sized fragments with some context around the 
highlighted terms, rather than showing e.g. 4 short fragments just to show all 
4 highlighted query terms (e.g. for query '+a +b +c +d')



 another highlighter
 ---

 Key: LUCENE-1522
 URL: https://issues.apache.org/jira/browse/LUCENE-1522
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/highlighter
Reporter: Koji Sekiguchi
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: colored-tag-sample.png, LUCENE-1522.patch, 
 LUCENE-1522.patch


 I've written this highlighter for my project to support bi-gram token stream 
 (general token stream (e.g. WhitespaceTokenizer) also supported. see test 
 code in patch). The idea was inherited from my previous project with my 
 colleague and LUCENE-644. This approach needs highlight fields to be 
 TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This 
 depends on LUCENE-1448 to get refined term offsets.
 usage:
 {code:java}
 TopDocs docs = searcher.search( query, 10 );
 Highlighter h = new Highlighter();
 FieldQuery fq = h.getFieldQuery( query );
 for( ScoreDoc scoreDoc : docs.scoreDocs ){
   // fieldName=content, fragCharSize=100, numFragments=3
   String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, 
 content, 100, 3 );
   if( fragments != null ){
 for( String fragment : fragments )
   System.out.println( fragment );
   }
 }
 {code}
 features:
 - fast for large docs
 - supports not only whitespace-based token stream, but also fixed size 
 N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489)
 - supports PhraseQuery, phrase-unit highlighting with slops
 {noformat}
 q=w1 w2
 bw1 w2/b
 ---
 q=w1 w2~1
 bw1/b w3 bw2/b w3 bw1 w2/b
 {noformat}
 - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
 - easy to apply patch due to independent package (contrib/highlighter2)
 - uses Java 1.5
 - looks query boost to score fragments (currently doesn't see idf, but it 
 should be possible)
 - pluggable FragListBuilder
 - pluggable FragmentsBuilder
 to do:
 - term positions can be unnecessary when phraseHighlight==false
 - collects performance numbers

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1522) another highlighter

2009-03-23 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12688454#action_12688454
 ] 

Michael McCandless commented on LUCENE-1522:


bq. I think this is an unrealistic requirement in some cases (e.g. AND queries).

I agree.

 another highlighter
 ---

 Key: LUCENE-1522
 URL: https://issues.apache.org/jira/browse/LUCENE-1522
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/highlighter
Reporter: Koji Sekiguchi
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: colored-tag-sample.png, LUCENE-1522.patch, 
 LUCENE-1522.patch


 I've written this highlighter for my project to support bi-gram token stream 
 (general token stream (e.g. WhitespaceTokenizer) also supported. see test 
 code in patch). The idea was inherited from my previous project with my 
 colleague and LUCENE-644. This approach needs highlight fields to be 
 TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This 
 depends on LUCENE-1448 to get refined term offsets.
 usage:
 {code:java}
 TopDocs docs = searcher.search( query, 10 );
 Highlighter h = new Highlighter();
 FieldQuery fq = h.getFieldQuery( query );
 for( ScoreDoc scoreDoc : docs.scoreDocs ){
   // fieldName=content, fragCharSize=100, numFragments=3
   String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, 
 content, 100, 3 );
   if( fragments != null ){
 for( String fragment : fragments )
   System.out.println( fragment );
   }
 }
 {code}
 features:
 - fast for large docs
 - supports not only whitespace-based token stream, but also fixed size 
 N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489)
 - supports PhraseQuery, phrase-unit highlighting with slops
 {noformat}
 q=w1 w2
 bw1 w2/b
 ---
 q=w1 w2~1
 bw1/b w3 bw2/b w3 bw1 w2/b
 {noformat}
 - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
 - easy to apply patch due to independent package (contrib/highlighter2)
 - uses Java 1.5
 - looks query boost to score fragments (currently doesn't see idf, but it 
 should be possible)
 - pluggable FragListBuilder
 - pluggable FragmentsBuilder
 to do:
 - term positions can be unnecessary when phraseHighlight==false
 - collects performance numbers

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Improve worst-case performance of TrieRange queries

2009-03-23 Thread Michael Busch

Let me give an example to explain my idea - I'm using dates in my
example, because it's easier to imagine :)

Let's say we have the following posting lists. There are 20 docs in the
index and an X means that a doc contains the corresponding term:

JanX   X
Feb XX  X
Mar  X
AprXX
MayX
Jun
Jul   XX
Aug   X  X
Sep   X
Oct   X
Nov  X  X
Dec X X

Then we index another term 'ALL'. It gets added for any document that 
has a numeric value in this bucket:


All X XX

If the query is [Jun TO Jul], then we process the query normally (ORing 
terms Jun and Jul). If the query is [Feb TO Nov], then we basically 
translate that into All AND NOT (Jan OR Dec).


Since you only evaluate the complement of the terms, you can (almost) 
double the worst case performance.


Downsides:
- you have to have another BitSet in memory to perform the AND NOT 
operation, so it needs more memory
- this complement approach is only this simple for numeric fields where 
one document has only a single value; similar things are doable for 
multi-valued numeric fields, however more complex and possibly with less 
performance gain
- you need to index an additional term per bucket, so the index size 
increases slightly


Does this make sense? Maybe this has even been discussed already?

-Michael

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



LocalLucene sorting issue

2009-03-23 Thread Ryan McKinley
In order to get spatial lucene into solr, we need to figure out how to  
fix the memory leak described in:

https://issues.apache.org/jira/browse/LUCENE-1304

Reading the posts on LUCENE-1304, it seems to point to LUCENE-1483 as  
the _real_ solution while LUCENE-1304 would just be a deprecated band- 
aid (for the record, band-aids are quite useful).


Before delving into this again, it looks like LUCENE-1483 is finished,  
but I don't understand how it fixes the CustomSort stuff.  Also, I  
don't see what the deprecated sorting stuff should be replaced with...


thanks for any pointers

ryan

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1570) QueryParser.setAllowLeadingWildcard could provide finer granularity

2009-03-23 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12688482#action_12688482
 ] 

Yonik Seeley commented on LUCENE-1570:
--

This is pretty easy to implement by overriding QueryParser.getWildcardQuery().

 QueryParser.setAllowLeadingWildcard could provide finer granularity
 ---

 Key: LUCENE-1570
 URL: https://issues.apache.org/jira/browse/LUCENE-1570
 Project: Lucene - Java
  Issue Type: Improvement
  Components: QueryParser
Affects Versions: 2.4.1
Reporter: Jonathan Watt

 It's great that Lucene now allows support for leading wildcards to be turned 
 on. However, leading wildcard searches are more expensive, so it would be 
 useful to be able to turn it on only for certain search fields. I'm 
 specifically thinking of wiki searches where it may be too expensive to allow 
 leading wildcards in the 'content:' field, but it would still be very useful 
 to be able to selectively turn on support for 'path:' and perhaps other 
 fields such as 'title:'. Would this be possible?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1570) QueryParser.setAllowLeadingWildcard could provide finer granularity

2009-03-23 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12688486#action_12688486
 ] 

Mark Miller commented on LUCENE-1570:
-

I've wanted this in the past. Its certainly possible, but I am not sure how 
easy it would be to do with the current queryparser (been a long time since I 
have been there). There appears to be a new parser on the horizon though, and 
it sounds as if it will allow these types of additions much more elegantly (the 
current queryparser does not use a syntax tree representation, and its kind of 
hairy to build on).

If I remember right, the current QueryParser simply attaches semantic actions 
to grammar production rules - difficult to read, edit, and maintain - has not 
been super friendly for building upon.

Also if I remember right, I think this new parser will use abstract syntax 
trees, which lets you split up syntax and semantics, and also keep things a bit 
more modular - you can do things like have pluggable syntax reader that feeds 
pluggable query output writer. At least for the basics - it sounds like these 
guys have made something pretty cool, but I have not seen the code yet and have 
only a brief memory of its description.

Point being, it can be done, I think its useful, but it might make sense to see 
how much easier it can be done with this new parser.

 QueryParser.setAllowLeadingWildcard could provide finer granularity
 ---

 Key: LUCENE-1570
 URL: https://issues.apache.org/jira/browse/LUCENE-1570
 Project: Lucene - Java
  Issue Type: Improvement
  Components: QueryParser
Affects Versions: 2.4.1
Reporter: Jonathan Watt

 It's great that Lucene now allows support for leading wildcards to be turned 
 on. However, leading wildcard searches are more expensive, so it would be 
 useful to be able to turn it on only for certain search fields. I'm 
 specifically thinking of wiki searches where it may be too expensive to allow 
 leading wildcards in the 'content:' field, but it would still be very useful 
 to be able to selectively turn on support for 'path:' and perhaps other 
 fields such as 'title:'. Would this be possible?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1570) QueryParser.setAllowLeadingWildcard could provide finer granularity

2009-03-23 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12688494#action_12688494
 ] 

Mark Miller commented on LUCENE-1570:
-

Yonik spit out a bit of a better answer while I typed - right, you do have 
access to the field in getWildcardQuery, and the leading check happens there, 
so you can override it. My brain always runs towards building the support in, 
but in this case it may be clear to leave it out anyway. Its somewhat of a 
niche concern. Just had the new QueryParser on my mind.

 QueryParser.setAllowLeadingWildcard could provide finer granularity
 ---

 Key: LUCENE-1570
 URL: https://issues.apache.org/jira/browse/LUCENE-1570
 Project: Lucene - Java
  Issue Type: Improvement
  Components: QueryParser
Affects Versions: 2.4.1
Reporter: Jonathan Watt

 It's great that Lucene now allows support for leading wildcards to be turned 
 on. However, leading wildcard searches are more expensive, so it would be 
 useful to be able to turn it on only for certain search fields. I'm 
 specifically thinking of wiki searches where it may be too expensive to allow 
 leading wildcards in the 'content:' field, but it would still be very useful 
 to be able to selectively turn on support for 'path:' and perhaps other 
 fields such as 'title:'. Would this be possible?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-1570) QueryParser.setAllowLeadingWildcard could provide finer granularity

2009-03-23 Thread Jonathan Watt (JIRA)
QueryParser.setAllowLeadingWildcard could provide finer granularity
---

 Key: LUCENE-1570
 URL: https://issues.apache.org/jira/browse/LUCENE-1570
 Project: Lucene - Java
  Issue Type: Improvement
  Components: QueryParser
Affects Versions: 2.4.1
Reporter: Jonathan Watt


It's great that Lucene now allows support for leading wildcards to be turned 
on. However, leading wildcard searches are more expensive, so it would be 
useful to be able to turn it on only for certain search fields. I'm 
specifically thinking of wiki searches where it may be too expensive to allow 
leading wildcards in the 'content:' field, but it would still be very useful to 
be able to selectively turn on support for 'path:' and perhaps other fields 
such as 'title:'. Would this be possible?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: LocalLucene sorting issue

2009-03-23 Thread Mark Miller

Ryan McKinley wrote:
In order to get spatial lucene into solr, we need to figure out how to 
fix the memory leak described in:

https://issues.apache.org/jira/browse/LUCENE-1304

Reading the posts on LUCENE-1304, it seems to point to LUCENE-1483 as 
the _real_ solution while LUCENE-1304 would just be a deprecated 
band-aid (for the record, band-aids are quite useful).


Before delving into this again, it looks like LUCENE-1483 is finished, 
but I don't understand how it fixes the CustomSort stuff.  Also, I 
don't see what the deprecated sorting stuff should be replaced with...
The fix should be that comparators are no longer cached with LUCENE-1483 
as long as you use the new API. The new API is the FieldComparator, and 
you supply one with a FieldComparatorSource. The FieldComparator may 
look a little complicated, but its fairly straightforward for the 
primitive (non String) types - you should be able to roughly copy one.


org.apache.lucene.search.FieldComparator

There is a new SortField constructor that takes a FieldComparatorSource.


thanks for any pointers

ryan

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org




--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org