Re: extending the query parser

2009-03-12 Thread Earwin Burrfoot
Take ANTLR and roll your own query parser from scratch? It's pretty easy.

On Thu, Mar 12, 2009 at 04:24, Candide Kemmler cand...@palacehotel.org wrote:
 Hello,

 I'm looking at a way to extend the lucene query parser to allow for semantic
 computations in IEML space (see http://ieml.org). What I'd like to know is:
 how difficult it would be to be able to add clauses to query like: ... AND (
 some_IEML_expression) AND ...

 some_IEML_expression would involve a reference to some field that would
 contain metadata expressed in that format.

 Thanks in advance for you insights.

 Candide

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org





-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1522) another highlighter

2009-03-12 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12681249#action_12681249
 ] 

Michael McCandless commented on LUCENE-1522:



This highlighter looks very interesting!  I love the colored tags, and
the fast performance on large docs, and the extensive unit tests.

When I applied the patch to current trunk, I see some tests failing,
eg:

{code}
[junit] Testcase: 
test1PhraseLongMVB(org.apache.lucene.search.highlight2.FieldPhraseListTest):
  FAILED
[junit] expected:sppd(1.0)((8[8,93])) but 
was:sppd(1.0)((8[7,92]))
[junit] junit.framework.ComparisonFailure: 
expected:sppd(1.0)((8[8,93])) but was:sppd(1.0)((8[7,92]))
[junit] at 
org.apache.lucene.search.highlight2.FieldPhraseListTest.test1PhraseLongMVB(FieldPhraseListTest.java:175)
{code}

Is this approach guaranteed to only highlight term occurrences that
actually contribute to the document match?  Can it handle all /
arbitrary Query subclasses?  How does it score fragments?

I also like that you first generate hits in the document, and from
those hits you generate fragments (if I'm reading the code correctly);
this is a nicely scalable approach.


 another highlighter
 ---

 Key: LUCENE-1522
 URL: https://issues.apache.org/jira/browse/LUCENE-1522
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/highlighter
Reporter: Koji Sekiguchi
Priority: Minor
 Attachments: colored-tag-sample.png, LUCENE-1522.patch, 
 LUCENE-1522.patch


 I've written this highlighter for my project to support bi-gram token stream 
 (general token stream (e.g. WhitespaceTokenizer) also supported. see test 
 code in patch). The idea was inherited from my previous project with my 
 colleague and LUCENE-644. This approach needs highlight fields to be 
 TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This 
 depends on LUCENE-1448 to get refined term offsets.
 usage:
 {code:java}
 TopDocs docs = searcher.search( query, 10 );
 Highlighter h = new Highlighter();
 FieldQuery fq = h.getFieldQuery( query );
 for( ScoreDoc scoreDoc : docs.scoreDocs ){
   // fieldName=content, fragCharSize=100, numFragments=3
   String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, 
 content, 100, 3 );
   if( fragments != null ){
 for( String fragment : fragments )
   System.out.println( fragment );
   }
 }
 {code}
 features:
 - fast for large docs
 - supports not only whitespace-based token stream, but also fixed size 
 N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489)
 - supports PhraseQuery, phrase-unit highlighting with slops
 {noformat}
 q=w1 w2
 bw1 w2/b
 ---
 q=w1 w2~1
 bw1/b w3 bw2/b w3 bw1 w2/b
 {noformat}
 - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
 - easy to apply patch due to independent package (contrib/highlighter2)
 - uses Java 1.5
 - looks query boost to score fragments (currently doesn't see idf, but it 
 should be possible)
 - pluggable FragListBuilder
 - pluggable FragmentsBuilder
 to do:
 - term positions can be unnecessary when phraseHighlight==false
 - collects performance numbers

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1522) another highlighter

2009-03-12 Thread Koji Sekiguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12681264#action_12681264
 ] 

Koji Sekiguchi commented on LUCENE-1522:


{quote}
This highlighter looks very interesting! I love the colored tags, and
the fast performance on large docs, and the extensive unit tests.
{quote}

Thank you for paying attention on this issue, Mike!

bq. When I applied the patch to current trunk, I see some tests failing,

Note that this issue depends on LUCENE-1448, so you apply LUCENE-1448.patch 
first, then apply LUCENE-1522.patch.

{noformat}
# To apply LUCENE-1448.patch, check out revision 713975!!!
$ svn co -r713975 http://svn.apache.org/repos/asf/lucene/java/trunk
$ cd trunk
$ patch -p0  LUCENE-1448.patch
$ patch -p0  LUCENE-1522.patch
{noformat}

I'll post comment later for the rest of your questions. :)

 another highlighter
 ---

 Key: LUCENE-1522
 URL: https://issues.apache.org/jira/browse/LUCENE-1522
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/highlighter
Reporter: Koji Sekiguchi
Priority: Minor
 Attachments: colored-tag-sample.png, LUCENE-1522.patch, 
 LUCENE-1522.patch


 I've written this highlighter for my project to support bi-gram token stream 
 (general token stream (e.g. WhitespaceTokenizer) also supported. see test 
 code in patch). The idea was inherited from my previous project with my 
 colleague and LUCENE-644. This approach needs highlight fields to be 
 TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This 
 depends on LUCENE-1448 to get refined term offsets.
 usage:
 {code:java}
 TopDocs docs = searcher.search( query, 10 );
 Highlighter h = new Highlighter();
 FieldQuery fq = h.getFieldQuery( query );
 for( ScoreDoc scoreDoc : docs.scoreDocs ){
   // fieldName=content, fragCharSize=100, numFragments=3
   String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, 
 content, 100, 3 );
   if( fragments != null ){
 for( String fragment : fragments )
   System.out.println( fragment );
   }
 }
 {code}
 features:
 - fast for large docs
 - supports not only whitespace-based token stream, but also fixed size 
 N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489)
 - supports PhraseQuery, phrase-unit highlighting with slops
 {noformat}
 q=w1 w2
 bw1 w2/b
 ---
 q=w1 w2~1
 bw1/b w3 bw2/b w3 bw1 w2/b
 {noformat}
 - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
 - easy to apply patch due to independent package (contrib/highlighter2)
 - uses Java 1.5
 - looks query boost to score fragments (currently doesn't see idf, but it 
 should be possible)
 - pluggable FragListBuilder
 - pluggable FragmentsBuilder
 to do:
 - term positions can be unnecessary when phraseHighlight==false
 - collects performance numbers

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



RE: FIPS compliance?

2009-03-12 Thread Digy
Or a home made md5 (without using
System.Security.Cryptography.MD5/java.security.MessageDigest) ?

DIGY

-Original Message-
From: Michael McCandless [mailto:luc...@mikemccandless.com] 
Sent: Wednesday, March 11, 2009 11:08 PM
To: java-dev@lucene.apache.org
Subject: Re: FIPS compliance?


So... I think this is a .NET specific issue at this point?

Or.. if we could find some common digest that is *not* used for crypto  
(so .NET won't reject it as insecure), but still has low risk of  
collision, that seems best.  Maybe just CRC32?

Mike

DIGY wrote:

 Thanks Mike.

 DIGY

 -Original Message-
 From: Michael McCandless [mailto:luc...@mikemccandless.com]
 Sent: Tuesday, March 10, 2009 10:43 PM
 To: java-dev@lucene.apache.org
 Subject: Re: FIPS compliance?


 Interesting... I wonder if in any java runtime there's ever a
 rejection of a
 known-insecure crypto digest alg.  I don't think that's come up on
 java-user/dev that I've seen.

 But it's certainly possible, but it should be rare because we now  
 simply
 default to write.lock in the index directory (getLockID is only used
 if
 you override the LockFactory).

 Really we want a digest that doesn't not need to be secure, here, but
 I don't
 think Java APIs differentiate.  (We don't care if someone can reverse
 the
 mapping of lock ID -- directory name; we simply want low risk of
 collision).

 Do .NET APIs offer a give me a digest and it doesn't have to be
 secure?
 If so that's probably the best solution.

 That said... we could change this to SHA-1, to be safe, but then in
 another
 few years we'd probably be having this discussion again when SHA-1 is
 fully cracked ;)

 I don't think there's a back-compat issue since it's use only for the
 naming of the lock file, which is transient.

 Mike

 de...@ttnet wrote:

 Hi All,

 There is a discussion about FIPS compliance(using MD5 Hash algorithm
 in FSDirectory) in Lucene.Net.



http://mail-archives.apache.org/mod_mbox/incubator-lucene-net-user/200903.mb
 ox/%3c006101c99f4e$7bdd3590$7397a0...@rendelmann@gmx.net%3e
 https://issues.apache.org/jira/browse/LUCENENET-175

 In fact, if the system wide policy (HKLM\System\CurrentControlSet
 \Control\Lsa\FIPSAlgorithmPolicy) is set, then trying to use MD5
 (which is not FIPS compliant) to compute the hash causes exception.

 So, Is a change in Lucene possible to use SHA1 in computing hash for
 FIPS compliance (I can see the backward compatibility problems)
 Or
 is this problem specific to Lucene.Net?

 What do you think?

 DIGY





 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-1559) Highlighting not working in some instances even though indexsearcher returns result.

2009-03-12 Thread Amin Mohammed-Coleman (JIRA)
Highlighting not working in some instances even though indexsearcher returns 
result.


 Key: LUCENE-1559
 URL: https://issues.apache.org/jira/browse/LUCENE-1559
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.4
 Environment: Mac OS 1.5
Eclipse 3.4

Reporter: Amin Mohammed-Coleman


In some instances highlighting does not return a result.  However when you use 
a different term for teh same document you get results.

Please see attach testcase and template file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1559) Highlighting not working in some instances even though indexsearcher returns result.

2009-03-12 Thread Amin Mohammed-Coleman (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amin Mohammed-Coleman updated LUCENE-1559:
--

Attachment: HighLightingSummaryTest.java
AJiA CH 02.doc

 Highlighting not working in some instances even though indexsearcher returns 
 result.
 

 Key: LUCENE-1559
 URL: https://issues.apache.org/jira/browse/LUCENE-1559
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.4
 Environment: Mac OS 1.5
 Eclipse 3.4
Reporter: Amin Mohammed-Coleman
 Attachments: AJiA CH 02.doc, HighLightingSummaryTest.java


 In some instances highlighting does not return a result.  However when you 
 use a different term for teh same document you get results.
 Please see attach testcase and template file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1522) another highlighter

2009-03-12 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12681282#action_12681282
 ] 

Michael McCandless commented on LUCENE-1522:


bq. Note that this issue depends on LUCENE-1448

Woops, right I had skipped that step.

 another highlighter
 ---

 Key: LUCENE-1522
 URL: https://issues.apache.org/jira/browse/LUCENE-1522
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/highlighter
Reporter: Koji Sekiguchi
Priority: Minor
 Attachments: colored-tag-sample.png, LUCENE-1522.patch, 
 LUCENE-1522.patch


 I've written this highlighter for my project to support bi-gram token stream 
 (general token stream (e.g. WhitespaceTokenizer) also supported. see test 
 code in patch). The idea was inherited from my previous project with my 
 colleague and LUCENE-644. This approach needs highlight fields to be 
 TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This 
 depends on LUCENE-1448 to get refined term offsets.
 usage:
 {code:java}
 TopDocs docs = searcher.search( query, 10 );
 Highlighter h = new Highlighter();
 FieldQuery fq = h.getFieldQuery( query );
 for( ScoreDoc scoreDoc : docs.scoreDocs ){
   // fieldName=content, fragCharSize=100, numFragments=3
   String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, 
 content, 100, 3 );
   if( fragments != null ){
 for( String fragment : fragments )
   System.out.println( fragment );
   }
 }
 {code}
 features:
 - fast for large docs
 - supports not only whitespace-based token stream, but also fixed size 
 N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489)
 - supports PhraseQuery, phrase-unit highlighting with slops
 {noformat}
 q=w1 w2
 bw1 w2/b
 ---
 q=w1 w2~1
 bw1/b w3 bw2/b w3 bw1 w2/b
 {noformat}
 - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
 - easy to apply patch due to independent package (contrib/highlighter2)
 - uses Java 1.5
 - looks query boost to score fragments (currently doesn't see idf, but it 
 should be possible)
 - pluggable FragListBuilder
 - pluggable FragmentsBuilder
 to do:
 - term positions can be unnecessary when phraseHighlight==false
 - collects performance numbers

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1559) Highlighting not working in some instances even though indexsearcher returns result.

2009-03-12 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12681287#action_12681287
 ] 

Mark Harwood commented on LUCENE-1559:
--

Sorry to be picky but can you submit a self-contained test with no external 
dependencies other than Lucene+Highlighter+JUnit

I don't want POI versions to be a factor here.

Cheers
Mark

 Highlighting not working in some instances even though indexsearcher returns 
 result.
 

 Key: LUCENE-1559
 URL: https://issues.apache.org/jira/browse/LUCENE-1559
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.4
 Environment: Mac OS 1.5
 Eclipse 3.4
Reporter: Amin Mohammed-Coleman
 Attachments: AJiA CH 02.doc, HighLightingSummaryTest.java


 In some instances highlighting does not return a result.  However when you 
 use a different term for teh same document you get results.
 Please see attach testcase and template file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: FIPS compliance?

2009-03-12 Thread Michael McCandless


That'd work too.

In which, I think we should simply leave Lucene using the builtin MD5  
(since JREs don't seem to reject it as insecure).


Mike

Digy wrote:


Or a home made md5 (without using
System.Security.Cryptography.MD5/java.security.MessageDigest) ?

DIGY

-Original Message-
From: Michael McCandless [mailto:luc...@mikemccandless.com]
Sent: Wednesday, March 11, 2009 11:08 PM
To: java-dev@lucene.apache.org
Subject: Re: FIPS compliance?


So... I think this is a .NET specific issue at this point?

Or.. if we could find some common digest that is *not* used for crypto
(so .NET won't reject it as insecure), but still has low risk of
collision, that seems best.  Maybe just CRC32?

Mike

DIGY wrote:


Thanks Mike.

DIGY

-Original Message-
From: Michael McCandless [mailto:luc...@mikemccandless.com]
Sent: Tuesday, March 10, 2009 10:43 PM
To: java-dev@lucene.apache.org
Subject: Re: FIPS compliance?


Interesting... I wonder if in any java runtime there's ever a
rejection of a
known-insecure crypto digest alg.  I don't think that's come up on
java-user/dev that I've seen.

But it's certainly possible, but it should be rare because we now
simply
default to write.lock in the index directory (getLockID is only  
used

if
you override the LockFactory).

Really we want a digest that doesn't not need to be secure, here, but
I don't
think Java APIs differentiate.  (We don't care if someone can reverse
the
mapping of lock ID -- directory name; we simply want low risk of
collision).

Do .NET APIs offer a give me a digest and it doesn't have to be
secure?
If so that's probably the best solution.

That said... we could change this to SHA-1, to be safe, but then in
another
few years we'd probably be having this discussion again when SHA-1 is
fully cracked ;)

I don't think there's a back-compat issue since it's use only for the
naming of the lock file, which is transient.

Mike

de...@ttnet wrote:


Hi All,

There is a discussion about FIPS compliance(using MD5 Hash algorithm
in FSDirectory) in Lucene.Net.





http://mail-archives.apache.org/mod_mbox/incubator-lucene-net-user/200903.mb

ox/%3c006101c99f4e$7bdd3590$7397a0...@rendelmann@gmx.net%3e

https://issues.apache.org/jira/browse/LUCENENET-175

In fact, if the system wide policy (HKLM\System\CurrentControlSet
\Control\Lsa\FIPSAlgorithmPolicy) is set, then trying to use MD5
(which is not FIPS compliant) to compute the hash causes exception.

So, Is a change in Lucene possible to use SHA1 in computing hash for
FIPS compliance (I can see the backward compatibility problems)
Or
is this problem specific to Lucene.Net?

What do you think?

DIGY






-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1458) Further steps towards flexible indexing

2009-03-12 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1458:
---

Fix Version/s: (was: 2.9)

Clearing fix version.

 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of
 term-doc-position level, you could just add a new attribute.
   * Test performance  iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1522) another highlighter

2009-03-12 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1522:
---

Fix Version/s: 2.9

 another highlighter
 ---

 Key: LUCENE-1522
 URL: https://issues.apache.org/jira/browse/LUCENE-1522
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/highlighter
Reporter: Koji Sekiguchi
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: colored-tag-sample.png, LUCENE-1522.patch, 
 LUCENE-1522.patch


 I've written this highlighter for my project to support bi-gram token stream 
 (general token stream (e.g. WhitespaceTokenizer) also supported. see test 
 code in patch). The idea was inherited from my previous project with my 
 colleague and LUCENE-644. This approach needs highlight fields to be 
 TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This 
 depends on LUCENE-1448 to get refined term offsets.
 usage:
 {code:java}
 TopDocs docs = searcher.search( query, 10 );
 Highlighter h = new Highlighter();
 FieldQuery fq = h.getFieldQuery( query );
 for( ScoreDoc scoreDoc : docs.scoreDocs ){
   // fieldName=content, fragCharSize=100, numFragments=3
   String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, 
 content, 100, 3 );
   if( fragments != null ){
 for( String fragment : fragments )
   System.out.println( fragment );
   }
 }
 {code}
 features:
 - fast for large docs
 - supports not only whitespace-based token stream, but also fixed size 
 N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489)
 - supports PhraseQuery, phrase-unit highlighting with slops
 {noformat}
 q=w1 w2
 bw1 w2/b
 ---
 q=w1 w2~1
 bw1/b w3 bw2/b w3 bw1 w2/b
 {noformat}
 - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
 - easy to apply patch due to independent package (contrib/highlighter2)
 - uses Java 1.5
 - looks query boost to score fragments (currently doesn't see idf, but it 
 should be possible)
 - pluggable FragListBuilder
 - pluggable FragmentsBuilder
 to do:
 - term positions can be unnecessary when phraseHighlight==false
 - collects performance numbers

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Assigned: (LUCENE-1522) another highlighter

2009-03-12 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reassigned LUCENE-1522:
--

Assignee: Michael McCandless

 another highlighter
 ---

 Key: LUCENE-1522
 URL: https://issues.apache.org/jira/browse/LUCENE-1522
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/highlighter
Reporter: Koji Sekiguchi
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: colored-tag-sample.png, LUCENE-1522.patch, 
 LUCENE-1522.patch


 I've written this highlighter for my project to support bi-gram token stream 
 (general token stream (e.g. WhitespaceTokenizer) also supported. see test 
 code in patch). The idea was inherited from my previous project with my 
 colleague and LUCENE-644. This approach needs highlight fields to be 
 TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This 
 depends on LUCENE-1448 to get refined term offsets.
 usage:
 {code:java}
 TopDocs docs = searcher.search( query, 10 );
 Highlighter h = new Highlighter();
 FieldQuery fq = h.getFieldQuery( query );
 for( ScoreDoc scoreDoc : docs.scoreDocs ){
   // fieldName=content, fragCharSize=100, numFragments=3
   String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, 
 content, 100, 3 );
   if( fragments != null ){
 for( String fragment : fragments )
   System.out.println( fragment );
   }
 }
 {code}
 features:
 - fast for large docs
 - supports not only whitespace-based token stream, but also fixed size 
 N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489)
 - supports PhraseQuery, phrase-unit highlighting with slops
 {noformat}
 q=w1 w2
 bw1 w2/b
 ---
 q=w1 w2~1
 bw1/b w3 bw2/b w3 bw1 w2/b
 {noformat}
 - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
 - easy to apply patch due to independent package (contrib/highlighter2)
 - uses Java 1.5
 - looks query boost to score fragments (currently doesn't see idf, but it 
 should be possible)
 - pluggable FragListBuilder
 - pluggable FragmentsBuilder
 to do:
 - term positions can be unnecessary when phraseHighlight==false
 - collects performance numbers

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-979) Remove Deprecated Benchmarking Utilities from contrib/benchmark

2009-03-12 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12681308#action_12681308
 ] 

Grant Ingersoll commented on LUCENE-979:


I see no reason why it can't happen w/ any release.  contrib's don't need to 
have the same back compat, and I seriously doubt anyone is using the old way.

 Remove Deprecated Benchmarking Utilities from contrib/benchmark
 ---

 Key: LUCENE-979
 URL: https://issues.apache.org/jira/browse/LUCENE-979
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark
Reporter: Grant Ingersoll
Priority: Minor
 Fix For: 3.0


 The old Benchmark utilities in contrib/benchmark have been deprecated and 
 should be removed in 2.9 of Lucene.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-979) Remove Deprecated Benchmarking Utilities from contrib/benchmark

2009-03-12 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-979:
--

Fix Version/s: (was: 3.0)
   2.9

OK, moving back to 2.9.

 Remove Deprecated Benchmarking Utilities from contrib/benchmark
 ---

 Key: LUCENE-979
 URL: https://issues.apache.org/jira/browse/LUCENE-979
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark
Reporter: Grant Ingersoll
Priority: Minor
 Fix For: 2.9


 The old Benchmark utilities in contrib/benchmark have been deprecated and 
 should be removed in 2.9 of Lucene.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1559) Highlighting not working in some instances even though indexsearcher returns result.

2009-03-12 Thread Amin Mohammed-Coleman (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12681310#action_12681310
 ] 

Amin Mohammed-Coleman commented on LUCENE-1559:
---

This problem occurs when using this exact document and other document which is 
pdf.  I'm not sure the test will be valid if i just use a normal test file.  
The version of POI am currently using is :

3.1-Final
poi-scratchpad-3.1-final

I can try to extract the test with no other libraries but I;'m not sure if it 
will work.  

 Highlighting not working in some instances even though indexsearcher returns 
 result.
 

 Key: LUCENE-1559
 URL: https://issues.apache.org/jira/browse/LUCENE-1559
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.4
 Environment: Mac OS 1.5
 Eclipse 3.4
Reporter: Amin Mohammed-Coleman
 Attachments: AJiA CH 02.doc, HighLightingSummaryTest.java


 In some instances highlighting does not return a result.  However when you 
 use a different term for teh same document you get results.
 Please see attach testcase and template file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1559) Highlighting not working in some instances even though indexsearcher returns result.

2009-03-12 Thread Amin Mohammed-Coleman (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amin Mohammed-Coleman updated LUCENE-1559:
--

Attachment: HighLightingSummaryTest(2).java

Updated test case with no external dependencies

 Highlighting not working in some instances even though indexsearcher returns 
 result.
 

 Key: LUCENE-1559
 URL: https://issues.apache.org/jira/browse/LUCENE-1559
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.4
 Environment: Mac OS 1.5
 Eclipse 3.4
Reporter: Amin Mohammed-Coleman
 Attachments: AJiA CH 02.doc, HighLightingSummaryTest(2).java, 
 HighLightingSummaryTest.java


 In some instances highlighting does not return a result.  However when you 
 use a different term for teh same document you get results.
 Please see attach testcase and template file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-1559) Highlighting not working in some instances even though indexsearcher returns result.

2009-03-12 Thread Amin Mohammed-Coleman (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12681315#action_12681315
 ] 

Amin Mohammed-Coleman edited comment on LUCENE-1559 at 3/12/09 6:43 AM:


Updated test case with no external dependencies

 HighLightingSummaryTest(2).java

  was (Author: amin):
Updated test case with no external dependencies
  
 Highlighting not working in some instances even though indexsearcher returns 
 result.
 

 Key: LUCENE-1559
 URL: https://issues.apache.org/jira/browse/LUCENE-1559
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.4
 Environment: Mac OS 1.5
 Eclipse 3.4
Reporter: Amin Mohammed-Coleman
 Attachments: AJiA CH 02.doc, HighLightingSummaryTest(2).java, 
 HighLightingSummaryTest.java


 In some instances highlighting does not return a result.  However when you 
 use a different term for teh same document you get results.
 Please see attach testcase and template file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1559) Highlighting not working in some instances even though indexsearcher returns result.

2009-03-12 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12681323#action_12681323
 ] 

Mark Harwood commented on LUCENE-1559:
--

Your code still imports POI and is now importing a .DOC file without parsing, 
producing garbage.

You'll need to supply an example Junit which illustrates this problem with 
plain text before we can look at it.

You should be able to turn the .Doc into text at your end using POI and then 
supply the file.

Are you sure there isn't a problem with POI failing to parse the file 
correctly? 


 Highlighting not working in some instances even though indexsearcher returns 
 result.
 

 Key: LUCENE-1559
 URL: https://issues.apache.org/jira/browse/LUCENE-1559
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.4
 Environment: Mac OS 1.5
 Eclipse 3.4
Reporter: Amin Mohammed-Coleman
 Attachments: AJiA CH 02.doc, HighLightingSummaryTest(2).java, 
 HighLightingSummaryTest.java


 In some instances highlighting does not return a result.  However when you 
 use a different term for teh same document you get results.
 Please see attach testcase and template file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1559) Highlighting not working in some instances even though indexsearcher returns result.

2009-03-12 Thread Amin Mohammed-Coleman (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12681329#action_12681329
 ] 

Amin Mohammed-Coleman commented on LUCENE-1559:
---

I don't think there is an error with POI parsing the document as summary is 
generated when I use the term aspectj.  I will modify the code to use an rtf 
file and see if this problem still occurs.

 Highlighting not working in some instances even though indexsearcher returns 
 result.
 

 Key: LUCENE-1559
 URL: https://issues.apache.org/jira/browse/LUCENE-1559
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.4
 Environment: Mac OS 1.5
 Eclipse 3.4
Reporter: Amin Mohammed-Coleman
 Attachments: AJiA CH 02.doc, HighLightingSummaryTest(2).java, 
 HighLightingSummaryTest.java


 In some instances highlighting does not return a result.  However when you 
 use a different term for teh same document you get results.
 Please see attach testcase and template file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-1559) Highlighting not working in some instances even though indexsearcher returns result.

2009-03-12 Thread Amin Mohammed-Coleman (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12681329#action_12681329
 ] 

Amin Mohammed-Coleman edited comment on LUCENE-1559 at 3/12/09 7:34 AM:


Ok.  So it looks like there is an issue when POI extracts the text.  

I don't understand this to be honest.  When indexing obviously I am indexing 
the word document and when I perform the search with the term document I 
get the correct result.  

It seems strange that I cannot have the term document in the file.  This also 
happens for a pdf file which makes it even more confusing.  

  was (Author: amin):
I don't think there is an error with POI parsing the document as summary is 
generated when I use the term aspectj.  I will modify the code to use an rtf 
file and see if this problem still occurs.
  
 Highlighting not working in some instances even though indexsearcher returns 
 result.
 

 Key: LUCENE-1559
 URL: https://issues.apache.org/jira/browse/LUCENE-1559
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.4
 Environment: Mac OS 1.5
 Eclipse 3.4
Reporter: Amin Mohammed-Coleman
 Attachments: AJiA CH 02.doc, HighLightingSummaryTest(2).java, 
 HighLightingSummaryTest.java


 In some instances highlighting does not return a result.  However when you 
 use a different term for teh same document you get results.
 Please see attach testcase and template file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1559) Highlighting not working in some instances even though indexsearcher returns result.

2009-03-12 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12681336#action_12681336
 ] 

Mark Harwood commented on LUCENE-1559:
--

Can I close this then as it appears to be an issue with your parser, not Lucene?

 Highlighting not working in some instances even though indexsearcher returns 
 result.
 

 Key: LUCENE-1559
 URL: https://issues.apache.org/jira/browse/LUCENE-1559
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.4
 Environment: Mac OS 1.5
 Eclipse 3.4
Reporter: Amin Mohammed-Coleman
 Attachments: AJiA CH 02.doc, HighLightingSummaryTest(2).java, 
 HighLightingSummaryTest.java


 In some instances highlighting does not return a result.  However when you 
 use a different term for teh same document you get results.
 Please see attach testcase and template file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1559) Highlighting not working in some instances even though indexsearcher returns result.

2009-03-12 Thread Amin Mohammed-Coleman (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12681339#action_12681339
 ] 

Amin Mohammed-Coleman commented on LUCENE-1559:
---

Yep.  I'm still confused and I don't understand how Lucene indexes the term 
document and I can perform the search.  The content of the file is stored in 
the document compressed (I'm not reparsing the file for highlighting).  The 
document must be in the Lucene document otherwise I would not be able to find 
the document from the search. 

Sorry...I don't know what I should do at this stage (as I mentioned earlier 
it's also happening to a certain pdf document (unless something is being 
chooped off during compression).



 Highlighting not working in some instances even though indexsearcher returns 
 result.
 

 Key: LUCENE-1559
 URL: https://issues.apache.org/jira/browse/LUCENE-1559
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.4
 Environment: Mac OS 1.5
 Eclipse 3.4
Reporter: Amin Mohammed-Coleman
 Attachments: AJiA CH 02.doc, HighLightingSummaryTest(2).java, 
 HighLightingSummaryTest.java


 In some instances highlighting does not return a result.  However when you 
 use a different term for teh same document you get results.
 Please see attach testcase and template file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1559) Highlighting not working in some instances even though indexsearcher returns result.

2009-03-12 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12681344#action_12681344
 ] 

Mark Harwood commented on LUCENE-1559:
--

Sorry...I don't know what I should do at this stage

Give us a Junit example of your problem code when working with plain text (Not 
PDF, word or .doc) that clearly demonstrates where Lucene fails to index/search 
or highlight this text correctly.

 Highlighting not working in some instances even though indexsearcher returns 
 result.
 

 Key: LUCENE-1559
 URL: https://issues.apache.org/jira/browse/LUCENE-1559
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.4
 Environment: Mac OS 1.5
 Eclipse 3.4
Reporter: Amin Mohammed-Coleman
 Attachments: AJiA CH 02.doc, HighLightingSummaryTest(2).java, 
 HighLightingSummaryTest.java


 In some instances highlighting does not return a result.  However when you 
 use a different term for teh same document you get results.
 Please see attach testcase and template file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1559) Highlighting not working in some instances even though indexsearcher returns result.

2009-03-12 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12681347#action_12681347
 ] 

Uwe Schindler commented on LUCENE-1559:
---

The problems with POI often come from the fact, that POI does not filter the 
outputted characters and sometimes even generates non Unicode conform char 
values (0xd000). E.g. you sometimes have non-breaking-spaces instead of normal 
spaces or other things. Depending on the Lucene Analyzer you use, there may be 
problems. E.g., TIKA uses a filter that maps all incorrect characters coming 
from POI according to aloowed chars in XML (because it generates XHTML from the 
docs that can be indexed using TikaAnalyzer).
I think, your problem is invalid plain text content coming from POI.

 Highlighting not working in some instances even though indexsearcher returns 
 result.
 

 Key: LUCENE-1559
 URL: https://issues.apache.org/jira/browse/LUCENE-1559
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.4
 Environment: Mac OS 1.5
 Eclipse 3.4
Reporter: Amin Mohammed-Coleman
 Attachments: AJiA CH 02.doc, HighLightingSummaryTest(2).java, 
 HighLightingSummaryTest.java


 In some instances highlighting does not return a result.  However when you 
 use a different term for teh same document you get results.
 Please see attach testcase and template file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1559) Highlighting not working in some instances even though indexsearcher returns result.

2009-03-12 Thread Amin Mohammed-Coleman (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12681351#action_12681351
 ] 

Amin Mohammed-Coleman commented on LUCENE-1559:
---

Seems to make sense.   I am using the StandardAnaylzer when indexing.  I can 
understand that there maybe an issue with POI, my only concern is how come 
Lucene managed to index the term document in the first place?  The term 
document is in the content of the word document.  If there was a problem as 
you mentioned then I would expect that the document would not be indexed.

I am toying with the idea of using TIKA, however I can't find an example from 
which I could work from.  I know the new Lucene In Action book uses TIKA, does 
anyone have some sample code that I could look at?

I presume I should bring this up in the lucene mailing rather than adding to 
the JIRA.

Cheers

 Highlighting not working in some instances even though indexsearcher returns 
 result.
 

 Key: LUCENE-1559
 URL: https://issues.apache.org/jira/browse/LUCENE-1559
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.4
 Environment: Mac OS 1.5
 Eclipse 3.4
Reporter: Amin Mohammed-Coleman
 Attachments: AJiA CH 02.doc, HighLightingSummaryTest(2).java, 
 HighLightingSummaryTest.java


 In some instances highlighting does not return a result.  However when you 
 use a different term for teh same document you get results.
 Please see attach testcase and template file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: extending the query parser

2009-03-12 Thread Candide Kemmler


On 11 Mar 2009, at 23:21, Earwin Burrfoot wrote:

Take ANTLR and roll your own query parser from scratch? It's pretty  
easy.




Hi Earwin,

That would be fantastic, since our parser is already specified as an  
ANTLR grammar. However, I can't seem to find an antlr grammar in the  
lucene source. Obviously what we want is to extend the existing query  
support, not just create a new one from scratch.


Regards,

Candide

On Thu, Mar 12, 2009 at 04:24, Candide Kemmler cand...@palacehotel.org 
 wrote:

Hello,

I'm looking at a way to extend the lucene query parser to allow for  
semantic
computations in IEML space (see http://ieml.org). What I'd like to  
know is:
how difficult it would be to be able to add clauses to query  
like: ... AND (

some_IEML_expression) AND ...

some_IEML_expression would involve a reference to some field that  
would

contain metadata expressed in that format.

Thanks in advance for you insights.

Candide

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org






--
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1559) Highlighting not working in some instances even though indexsearcher returns result.

2009-03-12 Thread Amin Mohammed-Coleman (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amin Mohammed-Coleman updated LUCENE-1559:
--

Attachment: HighLightingSummaryTestV3.java
fileToSearch.txt

Updated test case with no external dependencies except for lucene and junit.  

 Highlighting not working in some instances even though indexsearcher returns 
 result.
 

 Key: LUCENE-1559
 URL: https://issues.apache.org/jira/browse/LUCENE-1559
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.4
 Environment: Mac OS 1.5
 Eclipse 3.4
Reporter: Amin Mohammed-Coleman
 Attachments: AJiA CH 02.doc, fileToSearch.txt, 
 HighLightingSummaryTest(2).java, HighLightingSummaryTest.java, 
 HighLightingSummaryTestV3.java


 In some instances highlighting does not return a result.  However when you 
 use a different term for teh same document you get results.
 Please see attach testcase and template file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1539) Improve Benchmark

2009-03-12 Thread Jason Rutherglen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Rutherglen updated LUCENE-1539:
-

Attachment: LUCENE-1539.patch

* Added deletepercent.alg as an example of these tasks
* CommitIndexTask commits an IndexWriter using a commit name
* OpenReaderTask opens a specific commit point by name
* FlushReaderTask flushes a reader using a commit name
* DeleteByPercentTask a percentage of reader documents


 Improve Benchmark
 -

 Key: LUCENE-1539
 URL: https://issues.apache.org/jira/browse/LUCENE-1539
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark
Affects Versions: 2.4
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1539.patch, LUCENE-1539.patch, sortBench2.py, 
 sortCollate2.py

   Original Estimate: 336h
  Remaining Estimate: 336h

 Benchmark can be improved by incorporating recent suggestions posted
 on java-dev. M. McCandless' Python scripts that execute multiple
 rounds of tests can either be incorporated into the codebase or
 converted to Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: extending the query parser

2009-03-12 Thread Earwin Burrfoot
On Thu, Mar 12, 2009 at 21:16, Candide Kemmler cand...@palacehotel.org wrote:

 On 11 Mar 2009, at 23:21, Earwin Burrfoot wrote:

 Take ANTLR and roll your own query parser from scratch? It's pretty easy.


 Hi Earwin,

 That would be fantastic, since our parser is already specified as an ANTLR
 grammar. However, I can't seem to find an antlr grammar in the lucene
 source. Obviously what we want is to extend the existing query support, not
 just create a new one from scratch.

Lucene's default QueryParser uses javacc if I'm not mistaken. And I
don't see any way to extend it except by patching and using modified
version.
If you want to explore some existing alternatives, Mark has an article
here - http://www.lucidimagination.com/blog/2009/02/22/exploring-query-parsers/
My personal opinion is that default parser is only suitable for
something that isn't going to see real world use.


 Regards,

 Candide

 On Thu, Mar 12, 2009 at 04:24, Candide Kemmler cand...@palacehotel.org
 wrote:

 Hello,

 I'm looking at a way to extend the lucene query parser to allow for
 semantic
 computations in IEML space (see http://ieml.org). What I'd like to know
 is:
 how difficult it would be to be able to add clauses to query like: ...
 AND (
 some_IEML_expression) AND ...

 some_IEML_expression would involve a reference to some field that would
 contain metadata expressed in that format.

 Thanks in advance for you insights.

 Candide

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org





 --
 Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
 Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
 ICQ: 104465785

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org





-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Nightly log files

2009-03-12 Thread Grant Ingersoll
The log files for the nightly check out are now stored into /tmp/ 
lucene-nightly.log


The Crontab now looks like:
03 6 * * * /home/gsingers/bin/exportLuceneDocs.sh  /tmp/lucene- 
nightly.log 21


Thanks to Otis for pointing out that the nightly was not checking out.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Nightly log files

2009-03-12 Thread Michael McCandless


Can you update the wiki with that?

http://wiki.apache.org/lucene-java/HowToUpdateTheWebsite

Thanks.

Mike

On Mar 12, 2009, at 3:52 PM, Grant Ingersoll wrote:

The log files for the nightly check out are now stored into /tmp/ 
lucene-nightly.log


The Crontab now looks like:
03 6 * * * /home/gsingers/bin/exportLuceneDocs.sh  /tmp/lucene- 
nightly.log 21


Thanks to Otis for pointing out that the nightly was not checking out.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: extending the query parser

2009-03-12 Thread Candide Kemmler

OK great! I'll see what I can do from here.

Thanks!

On 12 Mar 2009, at 12:45, Earwin Burrfoot wrote:

On Thu, Mar 12, 2009 at 21:16, Candide Kemmler cand...@palacehotel.org 
 wrote:


On 11 Mar 2009, at 23:21, Earwin Burrfoot wrote:

Take ANTLR and roll your own query parser from scratch? It's  
pretty easy.




Hi Earwin,

That would be fantastic, since our parser is already specified as  
an ANTLR

grammar. However, I can't seem to find an antlr grammar in the lucene
source. Obviously what we want is to extend the existing query  
support, not

just create a new one from scratch.


Lucene's default QueryParser uses javacc if I'm not mistaken. And I
don't see any way to extend it except by patching and using modified
version.
If you want to explore some existing alternatives, Mark has an article
here - http://www.lucidimagination.com/blog/2009/02/22/exploring-query-parsers/
My personal opinion is that default parser is only suitable for
something that isn't going to see real world use.



Regards,

Candide

On Thu, Mar 12, 2009 at 04:24, Candide Kemmler cand...@palacehotel.org 


wrote:


Hello,

I'm looking at a way to extend the lucene query parser to allow for
semantic
computations in IEML space (see http://ieml.org). What I'd like  
to know

is:
how difficult it would be to be able to add clauses to query  
like: ...

AND (
some_IEML_expression) AND ...

some_IEML_expression would involve a reference to some field that  
would

contain metadata expressed in that format.

Thanks in advance for you insights.

Candide

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org






--
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org






--
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1559) Highlighting not working in some instances even though indexsearcher returns result.

2009-03-12 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12681466#action_12681466
 ] 

Mark Harwood commented on LUCENE-1559:
--

I ran a quick test and I dont  think I could see document in the 
Token.termText() of any tokens in the TokenStream you provide to the 
Highlighter.

It's late and I need to be elsewhere but if you have time to pursue this check 
the above statement is true.
If so, check the body text retrieved from Document.get(body) in the search 
results  is the same as the String you store at index time (just in case the 
act of storing/retrieving has altered the text somehow).

Will look into this more later

 Highlighting not working in some instances even though indexsearcher returns 
 result.
 

 Key: LUCENE-1559
 URL: https://issues.apache.org/jira/browse/LUCENE-1559
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.4
 Environment: Mac OS 1.5
 Eclipse 3.4
Reporter: Amin Mohammed-Coleman
 Attachments: AJiA CH 02.doc, fileToSearch.txt, 
 HighLightingSummaryTest(2).java, HighLightingSummaryTest.java, 
 HighLightingSummaryTestV3.java


 In some instances highlighting does not return a result.  However when you 
 use a different term for teh same document you get results.
 Please see attach testcase and template file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1559) Highlighting not working in some instances even though indexsearcher returns result.

2009-03-12 Thread Amin Mohammed-Coleman (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12681468#action_12681468
 ] 

Amin Mohammed-Coleman commented on LUCENE-1559:
---

Hi Mark

Thanks for looking into this, your help is much appreciated.  I compared the 
body of the file (value to be indexed) against the doc.get(body) and they are 
both the same. 

assertEquals(bodyToBeStored, bodyText);


Cheers

 Highlighting not working in some instances even though indexsearcher returns 
 result.
 

 Key: LUCENE-1559
 URL: https://issues.apache.org/jira/browse/LUCENE-1559
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.4
 Environment: Mac OS 1.5
 Eclipse 3.4
Reporter: Amin Mohammed-Coleman
 Attachments: AJiA CH 02.doc, fileToSearch.txt, 
 HighLightingSummaryTest(2).java, HighLightingSummaryTest.java, 
 HighLightingSummaryTestV3.java


 In some instances highlighting does not return a result.  However when you 
 use a different term for teh same document you get results.
 Please see attach testcase and template file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-1559) Highlighting not working in some instances even though indexsearcher returns result.

2009-03-12 Thread Amin Mohammed-Coleman (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12681468#action_12681468
 ] 

Amin Mohammed-Coleman edited comment on LUCENE-1559 at 3/12/09 1:22 PM:


Hi Mark

Thanks for looking into this, your help is much appreciated.  I compared the 
body of the file (value to be indexed) against the doc.get(body) and they are 
both the same. 

assertEquals(bodyToBeStored, bodyText);

Also

tokenText = text.substring(startOffset, endOffset); line 240 of Highlighter 
doesn't return document all i get is documentation

Cheers

  was (Author: amin):
Hi Mark

Thanks for looking into this, your help is much appreciated.  I compared the 
body of the file (value to be indexed) against the doc.get(body) and they are 
both the same. 

assertEquals(bodyToBeStored, bodyText);


Cheers
  
 Highlighting not working in some instances even though indexsearcher returns 
 result.
 

 Key: LUCENE-1559
 URL: https://issues.apache.org/jira/browse/LUCENE-1559
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.4
 Environment: Mac OS 1.5
 Eclipse 3.4
Reporter: Amin Mohammed-Coleman
 Attachments: AJiA CH 02.doc, fileToSearch.txt, 
 HighLightingSummaryTest(2).java, HighLightingSummaryTest.java, 
 HighLightingSummaryTestV3.java


 In some instances highlighting does not return a result.  However when you 
 use a different term for teh same document you get results.
 Please see attach testcase and template file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1559) Highlighting not working in some instances even though indexsearcher returns result.

2009-03-12 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12681507#action_12681507
 ] 

Mark Harwood commented on LUCENE-1559:
--

Ah. Try set this

highlighter.setMaxDocCharsToAnalyze(Integer.MAX_VALUE);


 Highlighting not working in some instances even though indexsearcher returns 
 result.
 

 Key: LUCENE-1559
 URL: https://issues.apache.org/jira/browse/LUCENE-1559
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.4
 Environment: Mac OS 1.5
 Eclipse 3.4
Reporter: Amin Mohammed-Coleman
 Attachments: AJiA CH 02.doc, fileToSearch.txt, 
 HighLightingSummaryTest(2).java, HighLightingSummaryTest.java, 
 HighLightingSummaryTestV3.java


 In some instances highlighting does not return a result.  However when you 
 use a different term for teh same document you get results.
 Please see attach testcase and template file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Closed: (LUCENE-1559) Highlighting not working in some instances even though indexsearcher returns result.

2009-03-12 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood closed LUCENE-1559.


Resolution: Invalid

Working as designed with feature designed to prevent too-costly analysis

 Highlighting not working in some instances even though indexsearcher returns 
 result.
 

 Key: LUCENE-1559
 URL: https://issues.apache.org/jira/browse/LUCENE-1559
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.4
 Environment: Mac OS 1.5
 Eclipse 3.4
Reporter: Amin Mohammed-Coleman
 Attachments: AJiA CH 02.doc, fileToSearch.txt, 
 HighLightingSummaryTest(2).java, HighLightingSummaryTest.java, 
 HighLightingSummaryTestV3.java


 In some instances highlighting does not return a result.  However when you 
 use a different term for teh same document you get results.
 Please see attach testcase and template file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1559) Highlighting not working in some instances even though indexsearcher returns result.

2009-03-12 Thread Amin Mohammed-Coleman (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12681514#action_12681514
 ] 

Amin Mohammed-Coleman commented on LUCENE-1559:
---

That did the trick.  Thanks.

 Highlighting not working in some instances even though indexsearcher returns 
 result.
 

 Key: LUCENE-1559
 URL: https://issues.apache.org/jira/browse/LUCENE-1559
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.4
 Environment: Mac OS 1.5
 Eclipse 3.4
Reporter: Amin Mohammed-Coleman
 Attachments: AJiA CH 02.doc, fileToSearch.txt, 
 HighLightingSummaryTest(2).java, HighLightingSummaryTest.java, 
 HighLightingSummaryTestV3.java


 In some instances highlighting does not return a result.  However when you 
 use a different term for teh same document you get results.
 Please see attach testcase and template file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1522) another highlighter

2009-03-12 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12681517#action_12681517
 ] 

Michael McCandless commented on LUCENE-1522:


Does this highlighter have a max tokens to analyze setting?  Or does it 
always visit all terms in each document?

 another highlighter
 ---

 Key: LUCENE-1522
 URL: https://issues.apache.org/jira/browse/LUCENE-1522
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/highlighter
Reporter: Koji Sekiguchi
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: colored-tag-sample.png, LUCENE-1522.patch, 
 LUCENE-1522.patch


 I've written this highlighter for my project to support bi-gram token stream 
 (general token stream (e.g. WhitespaceTokenizer) also supported. see test 
 code in patch). The idea was inherited from my previous project with my 
 colleague and LUCENE-644. This approach needs highlight fields to be 
 TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This 
 depends on LUCENE-1448 to get refined term offsets.
 usage:
 {code:java}
 TopDocs docs = searcher.search( query, 10 );
 Highlighter h = new Highlighter();
 FieldQuery fq = h.getFieldQuery( query );
 for( ScoreDoc scoreDoc : docs.scoreDocs ){
   // fieldName=content, fragCharSize=100, numFragments=3
   String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, 
 content, 100, 3 );
   if( fragments != null ){
 for( String fragment : fragments )
   System.out.println( fragment );
   }
 }
 {code}
 features:
 - fast for large docs
 - supports not only whitespace-based token stream, but also fixed size 
 N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489)
 - supports PhraseQuery, phrase-unit highlighting with slops
 {noformat}
 q=w1 w2
 bw1 w2/b
 ---
 q=w1 w2~1
 bw1/b w3 bw2/b w3 bw1 w2/b
 {noformat}
 - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
 - easy to apply patch due to independent package (contrib/highlighter2)
 - uses Java 1.5
 - looks query boost to score fragments (currently doesn't see idf, but it 
 should be possible)
 - pluggable FragListBuilder
 - pluggable FragmentsBuilder
 to do:
 - term positions can be unnecessary when phraseHighlight==false
 - collects performance numbers

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1522) another highlighter

2009-03-12 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12681531#action_12681531
 ] 

Mark Harwood commented on LUCENE-1522:
--

I'm guessing that's not an issue given it uses stored TermVectors rather than 
re-analyzing?

At some stage I hope to take a closer look at this contribution.  I'd be 
interested to see if all the Highlighter1  Junit tests could be adapted to work 
with Highlighter2 and get some comparative benchmarks.

 another highlighter
 ---

 Key: LUCENE-1522
 URL: https://issues.apache.org/jira/browse/LUCENE-1522
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/highlighter
Reporter: Koji Sekiguchi
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: colored-tag-sample.png, LUCENE-1522.patch, 
 LUCENE-1522.patch


 I've written this highlighter for my project to support bi-gram token stream 
 (general token stream (e.g. WhitespaceTokenizer) also supported. see test 
 code in patch). The idea was inherited from my previous project with my 
 colleague and LUCENE-644. This approach needs highlight fields to be 
 TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This 
 depends on LUCENE-1448 to get refined term offsets.
 usage:
 {code:java}
 TopDocs docs = searcher.search( query, 10 );
 Highlighter h = new Highlighter();
 FieldQuery fq = h.getFieldQuery( query );
 for( ScoreDoc scoreDoc : docs.scoreDocs ){
   // fieldName=content, fragCharSize=100, numFragments=3
   String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, 
 content, 100, 3 );
   if( fragments != null ){
 for( String fragment : fragments )
   System.out.println( fragment );
   }
 }
 {code}
 features:
 - fast for large docs
 - supports not only whitespace-based token stream, but also fixed size 
 N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489)
 - supports PhraseQuery, phrase-unit highlighting with slops
 {noformat}
 q=w1 w2
 bw1 w2/b
 ---
 q=w1 w2~1
 bw1/b w3 bw2/b w3 bw1 w2/b
 {noformat}
 - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
 - easy to apply patch due to independent package (contrib/highlighter2)
 - uses Java 1.5
 - looks query boost to score fragments (currently doesn't see idf, but it 
 should be possible)
 - pluggable FragListBuilder
 - pluggable FragmentsBuilder
 to do:
 - term positions can be unnecessary when phraseHighlight==false
 - collects performance numbers

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org