Re: extending the query parser
Take ANTLR and roll your own query parser from scratch? It's pretty easy. On Thu, Mar 12, 2009 at 04:24, Candide Kemmler cand...@palacehotel.org wrote: Hello, I'm looking at a way to extend the lucene query parser to allow for semantic computations in IEML space (see http://ieml.org). What I'd like to know is: how difficult it would be to be able to add clauses to query like: ... AND ( some_IEML_expression) AND ... some_IEML_expression would involve a reference to some field that would contain metadata expressed in that format. Thanks in advance for you insights. Candide - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1522) another highlighter
[ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12681249#action_12681249 ] Michael McCandless commented on LUCENE-1522: This highlighter looks very interesting! I love the colored tags, and the fast performance on large docs, and the extensive unit tests. When I applied the patch to current trunk, I see some tests failing, eg: {code} [junit] Testcase: test1PhraseLongMVB(org.apache.lucene.search.highlight2.FieldPhraseListTest): FAILED [junit] expected:sppd(1.0)((8[8,93])) but was:sppd(1.0)((8[7,92])) [junit] junit.framework.ComparisonFailure: expected:sppd(1.0)((8[8,93])) but was:sppd(1.0)((8[7,92])) [junit] at org.apache.lucene.search.highlight2.FieldPhraseListTest.test1PhraseLongMVB(FieldPhraseListTest.java:175) {code} Is this approach guaranteed to only highlight term occurrences that actually contribute to the document match? Can it handle all / arbitrary Query subclasses? How does it score fragments? I also like that you first generate hits in the document, and from those hits you generate fragments (if I'm reading the code correctly); this is a nicely scalable approach. another highlighter --- Key: LUCENE-1522 URL: https://issues.apache.org/jira/browse/LUCENE-1522 Project: Lucene - Java Issue Type: Improvement Components: contrib/highlighter Reporter: Koji Sekiguchi Priority: Minor Attachments: colored-tag-sample.png, LUCENE-1522.patch, LUCENE-1522.patch I've written this highlighter for my project to support bi-gram token stream (general token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets. usage: {code:java} TopDocs docs = searcher.search( query, 10 ); Highlighter h = new Highlighter(); FieldQuery fq = h.getFieldQuery( query ); for( ScoreDoc scoreDoc : docs.scoreDocs ){ // fieldName=content, fragCharSize=100, numFragments=3 String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, content, 100, 3 ); if( fragments != null ){ for( String fragment : fragments ) System.out.println( fragment ); } } {code} features: - fast for large docs - supports not only whitespace-based token stream, but also fixed size N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489) - supports PhraseQuery, phrase-unit highlighting with slops {noformat} q=w1 w2 bw1 w2/b --- q=w1 w2~1 bw1/b w3 bw2/b w3 bw1 w2/b {noformat} - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS - easy to apply patch due to independent package (contrib/highlighter2) - uses Java 1.5 - looks query boost to score fragments (currently doesn't see idf, but it should be possible) - pluggable FragListBuilder - pluggable FragmentsBuilder to do: - term positions can be unnecessary when phraseHighlight==false - collects performance numbers -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1522) another highlighter
[ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12681264#action_12681264 ] Koji Sekiguchi commented on LUCENE-1522: {quote} This highlighter looks very interesting! I love the colored tags, and the fast performance on large docs, and the extensive unit tests. {quote} Thank you for paying attention on this issue, Mike! bq. When I applied the patch to current trunk, I see some tests failing, Note that this issue depends on LUCENE-1448, so you apply LUCENE-1448.patch first, then apply LUCENE-1522.patch. {noformat} # To apply LUCENE-1448.patch, check out revision 713975!!! $ svn co -r713975 http://svn.apache.org/repos/asf/lucene/java/trunk $ cd trunk $ patch -p0 LUCENE-1448.patch $ patch -p0 LUCENE-1522.patch {noformat} I'll post comment later for the rest of your questions. :) another highlighter --- Key: LUCENE-1522 URL: https://issues.apache.org/jira/browse/LUCENE-1522 Project: Lucene - Java Issue Type: Improvement Components: contrib/highlighter Reporter: Koji Sekiguchi Priority: Minor Attachments: colored-tag-sample.png, LUCENE-1522.patch, LUCENE-1522.patch I've written this highlighter for my project to support bi-gram token stream (general token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets. usage: {code:java} TopDocs docs = searcher.search( query, 10 ); Highlighter h = new Highlighter(); FieldQuery fq = h.getFieldQuery( query ); for( ScoreDoc scoreDoc : docs.scoreDocs ){ // fieldName=content, fragCharSize=100, numFragments=3 String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, content, 100, 3 ); if( fragments != null ){ for( String fragment : fragments ) System.out.println( fragment ); } } {code} features: - fast for large docs - supports not only whitespace-based token stream, but also fixed size N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489) - supports PhraseQuery, phrase-unit highlighting with slops {noformat} q=w1 w2 bw1 w2/b --- q=w1 w2~1 bw1/b w3 bw2/b w3 bw1 w2/b {noformat} - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS - easy to apply patch due to independent package (contrib/highlighter2) - uses Java 1.5 - looks query boost to score fragments (currently doesn't see idf, but it should be possible) - pluggable FragListBuilder - pluggable FragmentsBuilder to do: - term positions can be unnecessary when phraseHighlight==false - collects performance numbers -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
RE: FIPS compliance?
Or a home made md5 (without using System.Security.Cryptography.MD5/java.security.MessageDigest) ? DIGY -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Wednesday, March 11, 2009 11:08 PM To: java-dev@lucene.apache.org Subject: Re: FIPS compliance? So... I think this is a .NET specific issue at this point? Or.. if we could find some common digest that is *not* used for crypto (so .NET won't reject it as insecure), but still has low risk of collision, that seems best. Maybe just CRC32? Mike DIGY wrote: Thanks Mike. DIGY -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Tuesday, March 10, 2009 10:43 PM To: java-dev@lucene.apache.org Subject: Re: FIPS compliance? Interesting... I wonder if in any java runtime there's ever a rejection of a known-insecure crypto digest alg. I don't think that's come up on java-user/dev that I've seen. But it's certainly possible, but it should be rare because we now simply default to write.lock in the index directory (getLockID is only used if you override the LockFactory). Really we want a digest that doesn't not need to be secure, here, but I don't think Java APIs differentiate. (We don't care if someone can reverse the mapping of lock ID -- directory name; we simply want low risk of collision). Do .NET APIs offer a give me a digest and it doesn't have to be secure? If so that's probably the best solution. That said... we could change this to SHA-1, to be safe, but then in another few years we'd probably be having this discussion again when SHA-1 is fully cracked ;) I don't think there's a back-compat issue since it's use only for the naming of the lock file, which is transient. Mike de...@ttnet wrote: Hi All, There is a discussion about FIPS compliance(using MD5 Hash algorithm in FSDirectory) in Lucene.Net. http://mail-archives.apache.org/mod_mbox/incubator-lucene-net-user/200903.mb ox/%3c006101c99f4e$7bdd3590$7397a0...@rendelmann@gmx.net%3e https://issues.apache.org/jira/browse/LUCENENET-175 In fact, if the system wide policy (HKLM\System\CurrentControlSet \Control\Lsa\FIPSAlgorithmPolicy) is set, then trying to use MD5 (which is not FIPS compliant) to compute the hash causes exception. So, Is a change in Lucene possible to use SHA1 in computing hash for FIPS compliance (I can see the backward compatibility problems) Or is this problem specific to Lucene.Net? What do you think? DIGY - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-1559) Highlighting not working in some instances even though indexsearcher returns result.
Highlighting not working in some instances even though indexsearcher returns result. Key: LUCENE-1559 URL: https://issues.apache.org/jira/browse/LUCENE-1559 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.4 Environment: Mac OS 1.5 Eclipse 3.4 Reporter: Amin Mohammed-Coleman In some instances highlighting does not return a result. However when you use a different term for teh same document you get results. Please see attach testcase and template file. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1559) Highlighting not working in some instances even though indexsearcher returns result.
[ https://issues.apache.org/jira/browse/LUCENE-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amin Mohammed-Coleman updated LUCENE-1559: -- Attachment: HighLightingSummaryTest.java AJiA CH 02.doc Highlighting not working in some instances even though indexsearcher returns result. Key: LUCENE-1559 URL: https://issues.apache.org/jira/browse/LUCENE-1559 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.4 Environment: Mac OS 1.5 Eclipse 3.4 Reporter: Amin Mohammed-Coleman Attachments: AJiA CH 02.doc, HighLightingSummaryTest.java In some instances highlighting does not return a result. However when you use a different term for teh same document you get results. Please see attach testcase and template file. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1522) another highlighter
[ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12681282#action_12681282 ] Michael McCandless commented on LUCENE-1522: bq. Note that this issue depends on LUCENE-1448 Woops, right I had skipped that step. another highlighter --- Key: LUCENE-1522 URL: https://issues.apache.org/jira/browse/LUCENE-1522 Project: Lucene - Java Issue Type: Improvement Components: contrib/highlighter Reporter: Koji Sekiguchi Priority: Minor Attachments: colored-tag-sample.png, LUCENE-1522.patch, LUCENE-1522.patch I've written this highlighter for my project to support bi-gram token stream (general token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets. usage: {code:java} TopDocs docs = searcher.search( query, 10 ); Highlighter h = new Highlighter(); FieldQuery fq = h.getFieldQuery( query ); for( ScoreDoc scoreDoc : docs.scoreDocs ){ // fieldName=content, fragCharSize=100, numFragments=3 String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, content, 100, 3 ); if( fragments != null ){ for( String fragment : fragments ) System.out.println( fragment ); } } {code} features: - fast for large docs - supports not only whitespace-based token stream, but also fixed size N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489) - supports PhraseQuery, phrase-unit highlighting with slops {noformat} q=w1 w2 bw1 w2/b --- q=w1 w2~1 bw1/b w3 bw2/b w3 bw1 w2/b {noformat} - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS - easy to apply patch due to independent package (contrib/highlighter2) - uses Java 1.5 - looks query boost to score fragments (currently doesn't see idf, but it should be possible) - pluggable FragListBuilder - pluggable FragmentsBuilder to do: - term positions can be unnecessary when phraseHighlight==false - collects performance numbers -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1559) Highlighting not working in some instances even though indexsearcher returns result.
[ https://issues.apache.org/jira/browse/LUCENE-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12681287#action_12681287 ] Mark Harwood commented on LUCENE-1559: -- Sorry to be picky but can you submit a self-contained test with no external dependencies other than Lucene+Highlighter+JUnit I don't want POI versions to be a factor here. Cheers Mark Highlighting not working in some instances even though indexsearcher returns result. Key: LUCENE-1559 URL: https://issues.apache.org/jira/browse/LUCENE-1559 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.4 Environment: Mac OS 1.5 Eclipse 3.4 Reporter: Amin Mohammed-Coleman Attachments: AJiA CH 02.doc, HighLightingSummaryTest.java In some instances highlighting does not return a result. However when you use a different term for teh same document you get results. Please see attach testcase and template file. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: FIPS compliance?
That'd work too. In which, I think we should simply leave Lucene using the builtin MD5 (since JREs don't seem to reject it as insecure). Mike Digy wrote: Or a home made md5 (without using System.Security.Cryptography.MD5/java.security.MessageDigest) ? DIGY -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Wednesday, March 11, 2009 11:08 PM To: java-dev@lucene.apache.org Subject: Re: FIPS compliance? So... I think this is a .NET specific issue at this point? Or.. if we could find some common digest that is *not* used for crypto (so .NET won't reject it as insecure), but still has low risk of collision, that seems best. Maybe just CRC32? Mike DIGY wrote: Thanks Mike. DIGY -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Tuesday, March 10, 2009 10:43 PM To: java-dev@lucene.apache.org Subject: Re: FIPS compliance? Interesting... I wonder if in any java runtime there's ever a rejection of a known-insecure crypto digest alg. I don't think that's come up on java-user/dev that I've seen. But it's certainly possible, but it should be rare because we now simply default to write.lock in the index directory (getLockID is only used if you override the LockFactory). Really we want a digest that doesn't not need to be secure, here, but I don't think Java APIs differentiate. (We don't care if someone can reverse the mapping of lock ID -- directory name; we simply want low risk of collision). Do .NET APIs offer a give me a digest and it doesn't have to be secure? If so that's probably the best solution. That said... we could change this to SHA-1, to be safe, but then in another few years we'd probably be having this discussion again when SHA-1 is fully cracked ;) I don't think there's a back-compat issue since it's use only for the naming of the lock file, which is transient. Mike de...@ttnet wrote: Hi All, There is a discussion about FIPS compliance(using MD5 Hash algorithm in FSDirectory) in Lucene.Net. http://mail-archives.apache.org/mod_mbox/incubator-lucene-net-user/200903.mb ox/%3c006101c99f4e$7bdd3590$7397a0...@rendelmann@gmx.net%3e https://issues.apache.org/jira/browse/LUCENENET-175 In fact, if the system wide policy (HKLM\System\CurrentControlSet \Control\Lsa\FIPSAlgorithmPolicy) is set, then trying to use MD5 (which is not FIPS compliant) to compute the hash causes exception. So, Is a change in Lucene possible to use SHA1 in computing hash for FIPS compliance (I can see the backward compatibility problems) Or is this problem specific to Lucene.Net? What do you think? DIGY - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1458) Further steps towards flexible indexing
[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1458: --- Fix Version/s: (was: 2.9) Clearing fix version. Further steps towards flexible indexing --- Key: LUCENE-1458 URL: https://issues.apache.org/jira/browse/LUCENE-1458 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: 2.9 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Attachments: LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch I attached a very rough checkpoint of my current patch, to get early feedback. All tests pass, though back compat tests don't pass due to changes to package-private APIs plus certain bugs in tests that happened to work (eg call TermPostions.nextPosition() too many times, which the new API asserts against). [Aside: I think, when we commit changes to package-private APIs such that back-compat tests don't pass, we could go back, make a branch on the back-compat tag, commit changes to the tests to use the new package private APIs on that branch, then fix nightly build to use the tip of that branch?o] There's still plenty to do before this is committable! This is a rather large change: * Switches to a new more efficient terms dict format. This still uses tii/tis files, but the tii only stores term long offset (not a TermInfo). At seek points, tis encodes term freq/prox offsets absolutely instead of with deltas delta. Also, tis/tii are structured by field, so we don't have to record field number in every term. . On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB). . RAM usage when loading terms dict index is significantly less since we only load an array of offsets and an array of String (no more TermInfo array). It should be faster to init too. . This part is basically done. * Introduces modular reader codec that strongly decouples terms dict from docs/positions readers. EG there is no more TermInfo used when reading the new format. . There's nice symmetry now between reading writing in the codec chain -- the current docs/prox format is captured in: {code} FormatPostingsTermsDictWriter/Reader FormatPostingsDocsWriter/Reader (.frq file) and FormatPostingsPositionsWriter/Reader (.prx file). {code} This part is basically done. * Introduces a new flex API for iterating through the fields, terms, docs and positions: {code} FieldProducer - TermsEnum - DocsEnum - PostingsEnum {code} This replaces TermEnum/Docs/Positions. SegmentReader emulates the old API on top of the new API to keep back-compat. Next steps: * Plug in new codecs (pulsing, pfor) to exercise the modularity / fix any hidden assumptions. * Expose new API out of IndexReader, deprecate old API but emulate old API on top of new one, switch all core/contrib users to the new API. * Maybe switch to AttributeSources as the base class for TermsEnum, DocsEnum, PostingsEnum -- this would give readers API flexibility (not just index-file-format flexibility). EG if someone wanted to store payload at the term-doc level instead of term-doc-position level, you could just add a new attribute. * Test performance iterate. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1522) another highlighter
[ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1522: --- Fix Version/s: 2.9 another highlighter --- Key: LUCENE-1522 URL: https://issues.apache.org/jira/browse/LUCENE-1522 Project: Lucene - Java Issue Type: Improvement Components: contrib/highlighter Reporter: Koji Sekiguchi Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: colored-tag-sample.png, LUCENE-1522.patch, LUCENE-1522.patch I've written this highlighter for my project to support bi-gram token stream (general token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets. usage: {code:java} TopDocs docs = searcher.search( query, 10 ); Highlighter h = new Highlighter(); FieldQuery fq = h.getFieldQuery( query ); for( ScoreDoc scoreDoc : docs.scoreDocs ){ // fieldName=content, fragCharSize=100, numFragments=3 String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, content, 100, 3 ); if( fragments != null ){ for( String fragment : fragments ) System.out.println( fragment ); } } {code} features: - fast for large docs - supports not only whitespace-based token stream, but also fixed size N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489) - supports PhraseQuery, phrase-unit highlighting with slops {noformat} q=w1 w2 bw1 w2/b --- q=w1 w2~1 bw1/b w3 bw2/b w3 bw1 w2/b {noformat} - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS - easy to apply patch due to independent package (contrib/highlighter2) - uses Java 1.5 - looks query boost to score fragments (currently doesn't see idf, but it should be possible) - pluggable FragListBuilder - pluggable FragmentsBuilder to do: - term positions can be unnecessary when phraseHighlight==false - collects performance numbers -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Assigned: (LUCENE-1522) another highlighter
[ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless reassigned LUCENE-1522: -- Assignee: Michael McCandless another highlighter --- Key: LUCENE-1522 URL: https://issues.apache.org/jira/browse/LUCENE-1522 Project: Lucene - Java Issue Type: Improvement Components: contrib/highlighter Reporter: Koji Sekiguchi Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: colored-tag-sample.png, LUCENE-1522.patch, LUCENE-1522.patch I've written this highlighter for my project to support bi-gram token stream (general token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets. usage: {code:java} TopDocs docs = searcher.search( query, 10 ); Highlighter h = new Highlighter(); FieldQuery fq = h.getFieldQuery( query ); for( ScoreDoc scoreDoc : docs.scoreDocs ){ // fieldName=content, fragCharSize=100, numFragments=3 String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, content, 100, 3 ); if( fragments != null ){ for( String fragment : fragments ) System.out.println( fragment ); } } {code} features: - fast for large docs - supports not only whitespace-based token stream, but also fixed size N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489) - supports PhraseQuery, phrase-unit highlighting with slops {noformat} q=w1 w2 bw1 w2/b --- q=w1 w2~1 bw1/b w3 bw2/b w3 bw1 w2/b {noformat} - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS - easy to apply patch due to independent package (contrib/highlighter2) - uses Java 1.5 - looks query boost to score fragments (currently doesn't see idf, but it should be possible) - pluggable FragListBuilder - pluggable FragmentsBuilder to do: - term positions can be unnecessary when phraseHighlight==false - collects performance numbers -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-979) Remove Deprecated Benchmarking Utilities from contrib/benchmark
[ https://issues.apache.org/jira/browse/LUCENE-979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12681308#action_12681308 ] Grant Ingersoll commented on LUCENE-979: I see no reason why it can't happen w/ any release. contrib's don't need to have the same back compat, and I seriously doubt anyone is using the old way. Remove Deprecated Benchmarking Utilities from contrib/benchmark --- Key: LUCENE-979 URL: https://issues.apache.org/jira/browse/LUCENE-979 Project: Lucene - Java Issue Type: Improvement Components: contrib/benchmark Reporter: Grant Ingersoll Priority: Minor Fix For: 3.0 The old Benchmark utilities in contrib/benchmark have been deprecated and should be removed in 2.9 of Lucene. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-979) Remove Deprecated Benchmarking Utilities from contrib/benchmark
[ https://issues.apache.org/jira/browse/LUCENE-979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-979: -- Fix Version/s: (was: 3.0) 2.9 OK, moving back to 2.9. Remove Deprecated Benchmarking Utilities from contrib/benchmark --- Key: LUCENE-979 URL: https://issues.apache.org/jira/browse/LUCENE-979 Project: Lucene - Java Issue Type: Improvement Components: contrib/benchmark Reporter: Grant Ingersoll Priority: Minor Fix For: 2.9 The old Benchmark utilities in contrib/benchmark have been deprecated and should be removed in 2.9 of Lucene. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1559) Highlighting not working in some instances even though indexsearcher returns result.
[ https://issues.apache.org/jira/browse/LUCENE-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12681310#action_12681310 ] Amin Mohammed-Coleman commented on LUCENE-1559: --- This problem occurs when using this exact document and other document which is pdf. I'm not sure the test will be valid if i just use a normal test file. The version of POI am currently using is : 3.1-Final poi-scratchpad-3.1-final I can try to extract the test with no other libraries but I;'m not sure if it will work. Highlighting not working in some instances even though indexsearcher returns result. Key: LUCENE-1559 URL: https://issues.apache.org/jira/browse/LUCENE-1559 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.4 Environment: Mac OS 1.5 Eclipse 3.4 Reporter: Amin Mohammed-Coleman Attachments: AJiA CH 02.doc, HighLightingSummaryTest.java In some instances highlighting does not return a result. However when you use a different term for teh same document you get results. Please see attach testcase and template file. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1559) Highlighting not working in some instances even though indexsearcher returns result.
[ https://issues.apache.org/jira/browse/LUCENE-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amin Mohammed-Coleman updated LUCENE-1559: -- Attachment: HighLightingSummaryTest(2).java Updated test case with no external dependencies Highlighting not working in some instances even though indexsearcher returns result. Key: LUCENE-1559 URL: https://issues.apache.org/jira/browse/LUCENE-1559 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.4 Environment: Mac OS 1.5 Eclipse 3.4 Reporter: Amin Mohammed-Coleman Attachments: AJiA CH 02.doc, HighLightingSummaryTest(2).java, HighLightingSummaryTest.java In some instances highlighting does not return a result. However when you use a different term for teh same document you get results. Please see attach testcase and template file. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-1559) Highlighting not working in some instances even though indexsearcher returns result.
[ https://issues.apache.org/jira/browse/LUCENE-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12681315#action_12681315 ] Amin Mohammed-Coleman edited comment on LUCENE-1559 at 3/12/09 6:43 AM: Updated test case with no external dependencies HighLightingSummaryTest(2).java was (Author: amin): Updated test case with no external dependencies Highlighting not working in some instances even though indexsearcher returns result. Key: LUCENE-1559 URL: https://issues.apache.org/jira/browse/LUCENE-1559 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.4 Environment: Mac OS 1.5 Eclipse 3.4 Reporter: Amin Mohammed-Coleman Attachments: AJiA CH 02.doc, HighLightingSummaryTest(2).java, HighLightingSummaryTest.java In some instances highlighting does not return a result. However when you use a different term for teh same document you get results. Please see attach testcase and template file. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1559) Highlighting not working in some instances even though indexsearcher returns result.
[ https://issues.apache.org/jira/browse/LUCENE-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12681323#action_12681323 ] Mark Harwood commented on LUCENE-1559: -- Your code still imports POI and is now importing a .DOC file without parsing, producing garbage. You'll need to supply an example Junit which illustrates this problem with plain text before we can look at it. You should be able to turn the .Doc into text at your end using POI and then supply the file. Are you sure there isn't a problem with POI failing to parse the file correctly? Highlighting not working in some instances even though indexsearcher returns result. Key: LUCENE-1559 URL: https://issues.apache.org/jira/browse/LUCENE-1559 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.4 Environment: Mac OS 1.5 Eclipse 3.4 Reporter: Amin Mohammed-Coleman Attachments: AJiA CH 02.doc, HighLightingSummaryTest(2).java, HighLightingSummaryTest.java In some instances highlighting does not return a result. However when you use a different term for teh same document you get results. Please see attach testcase and template file. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1559) Highlighting not working in some instances even though indexsearcher returns result.
[ https://issues.apache.org/jira/browse/LUCENE-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12681329#action_12681329 ] Amin Mohammed-Coleman commented on LUCENE-1559: --- I don't think there is an error with POI parsing the document as summary is generated when I use the term aspectj. I will modify the code to use an rtf file and see if this problem still occurs. Highlighting not working in some instances even though indexsearcher returns result. Key: LUCENE-1559 URL: https://issues.apache.org/jira/browse/LUCENE-1559 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.4 Environment: Mac OS 1.5 Eclipse 3.4 Reporter: Amin Mohammed-Coleman Attachments: AJiA CH 02.doc, HighLightingSummaryTest(2).java, HighLightingSummaryTest.java In some instances highlighting does not return a result. However when you use a different term for teh same document you get results. Please see attach testcase and template file. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-1559) Highlighting not working in some instances even though indexsearcher returns result.
[ https://issues.apache.org/jira/browse/LUCENE-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12681329#action_12681329 ] Amin Mohammed-Coleman edited comment on LUCENE-1559 at 3/12/09 7:34 AM: Ok. So it looks like there is an issue when POI extracts the text. I don't understand this to be honest. When indexing obviously I am indexing the word document and when I perform the search with the term document I get the correct result. It seems strange that I cannot have the term document in the file. This also happens for a pdf file which makes it even more confusing. was (Author: amin): I don't think there is an error with POI parsing the document as summary is generated when I use the term aspectj. I will modify the code to use an rtf file and see if this problem still occurs. Highlighting not working in some instances even though indexsearcher returns result. Key: LUCENE-1559 URL: https://issues.apache.org/jira/browse/LUCENE-1559 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.4 Environment: Mac OS 1.5 Eclipse 3.4 Reporter: Amin Mohammed-Coleman Attachments: AJiA CH 02.doc, HighLightingSummaryTest(2).java, HighLightingSummaryTest.java In some instances highlighting does not return a result. However when you use a different term for teh same document you get results. Please see attach testcase and template file. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1559) Highlighting not working in some instances even though indexsearcher returns result.
[ https://issues.apache.org/jira/browse/LUCENE-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12681336#action_12681336 ] Mark Harwood commented on LUCENE-1559: -- Can I close this then as it appears to be an issue with your parser, not Lucene? Highlighting not working in some instances even though indexsearcher returns result. Key: LUCENE-1559 URL: https://issues.apache.org/jira/browse/LUCENE-1559 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.4 Environment: Mac OS 1.5 Eclipse 3.4 Reporter: Amin Mohammed-Coleman Attachments: AJiA CH 02.doc, HighLightingSummaryTest(2).java, HighLightingSummaryTest.java In some instances highlighting does not return a result. However when you use a different term for teh same document you get results. Please see attach testcase and template file. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1559) Highlighting not working in some instances even though indexsearcher returns result.
[ https://issues.apache.org/jira/browse/LUCENE-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12681339#action_12681339 ] Amin Mohammed-Coleman commented on LUCENE-1559: --- Yep. I'm still confused and I don't understand how Lucene indexes the term document and I can perform the search. The content of the file is stored in the document compressed (I'm not reparsing the file for highlighting). The document must be in the Lucene document otherwise I would not be able to find the document from the search. Sorry...I don't know what I should do at this stage (as I mentioned earlier it's also happening to a certain pdf document (unless something is being chooped off during compression). Highlighting not working in some instances even though indexsearcher returns result. Key: LUCENE-1559 URL: https://issues.apache.org/jira/browse/LUCENE-1559 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.4 Environment: Mac OS 1.5 Eclipse 3.4 Reporter: Amin Mohammed-Coleman Attachments: AJiA CH 02.doc, HighLightingSummaryTest(2).java, HighLightingSummaryTest.java In some instances highlighting does not return a result. However when you use a different term for teh same document you get results. Please see attach testcase and template file. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1559) Highlighting not working in some instances even though indexsearcher returns result.
[ https://issues.apache.org/jira/browse/LUCENE-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12681344#action_12681344 ] Mark Harwood commented on LUCENE-1559: -- Sorry...I don't know what I should do at this stage Give us a Junit example of your problem code when working with plain text (Not PDF, word or .doc) that clearly demonstrates where Lucene fails to index/search or highlight this text correctly. Highlighting not working in some instances even though indexsearcher returns result. Key: LUCENE-1559 URL: https://issues.apache.org/jira/browse/LUCENE-1559 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.4 Environment: Mac OS 1.5 Eclipse 3.4 Reporter: Amin Mohammed-Coleman Attachments: AJiA CH 02.doc, HighLightingSummaryTest(2).java, HighLightingSummaryTest.java In some instances highlighting does not return a result. However when you use a different term for teh same document you get results. Please see attach testcase and template file. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1559) Highlighting not working in some instances even though indexsearcher returns result.
[ https://issues.apache.org/jira/browse/LUCENE-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12681347#action_12681347 ] Uwe Schindler commented on LUCENE-1559: --- The problems with POI often come from the fact, that POI does not filter the outputted characters and sometimes even generates non Unicode conform char values (0xd000). E.g. you sometimes have non-breaking-spaces instead of normal spaces or other things. Depending on the Lucene Analyzer you use, there may be problems. E.g., TIKA uses a filter that maps all incorrect characters coming from POI according to aloowed chars in XML (because it generates XHTML from the docs that can be indexed using TikaAnalyzer). I think, your problem is invalid plain text content coming from POI. Highlighting not working in some instances even though indexsearcher returns result. Key: LUCENE-1559 URL: https://issues.apache.org/jira/browse/LUCENE-1559 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.4 Environment: Mac OS 1.5 Eclipse 3.4 Reporter: Amin Mohammed-Coleman Attachments: AJiA CH 02.doc, HighLightingSummaryTest(2).java, HighLightingSummaryTest.java In some instances highlighting does not return a result. However when you use a different term for teh same document you get results. Please see attach testcase and template file. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1559) Highlighting not working in some instances even though indexsearcher returns result.
[ https://issues.apache.org/jira/browse/LUCENE-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12681351#action_12681351 ] Amin Mohammed-Coleman commented on LUCENE-1559: --- Seems to make sense. I am using the StandardAnaylzer when indexing. I can understand that there maybe an issue with POI, my only concern is how come Lucene managed to index the term document in the first place? The term document is in the content of the word document. If there was a problem as you mentioned then I would expect that the document would not be indexed. I am toying with the idea of using TIKA, however I can't find an example from which I could work from. I know the new Lucene In Action book uses TIKA, does anyone have some sample code that I could look at? I presume I should bring this up in the lucene mailing rather than adding to the JIRA. Cheers Highlighting not working in some instances even though indexsearcher returns result. Key: LUCENE-1559 URL: https://issues.apache.org/jira/browse/LUCENE-1559 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.4 Environment: Mac OS 1.5 Eclipse 3.4 Reporter: Amin Mohammed-Coleman Attachments: AJiA CH 02.doc, HighLightingSummaryTest(2).java, HighLightingSummaryTest.java In some instances highlighting does not return a result. However when you use a different term for teh same document you get results. Please see attach testcase and template file. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: extending the query parser
On 11 Mar 2009, at 23:21, Earwin Burrfoot wrote: Take ANTLR and roll your own query parser from scratch? It's pretty easy. Hi Earwin, That would be fantastic, since our parser is already specified as an ANTLR grammar. However, I can't seem to find an antlr grammar in the lucene source. Obviously what we want is to extend the existing query support, not just create a new one from scratch. Regards, Candide On Thu, Mar 12, 2009 at 04:24, Candide Kemmler cand...@palacehotel.org wrote: Hello, I'm looking at a way to extend the lucene query parser to allow for semantic computations in IEML space (see http://ieml.org). What I'd like to know is: how difficult it would be to be able to add clauses to query like: ... AND ( some_IEML_expression) AND ... some_IEML_expression would involve a reference to some field that would contain metadata expressed in that format. Thanks in advance for you insights. Candide - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1559) Highlighting not working in some instances even though indexsearcher returns result.
[ https://issues.apache.org/jira/browse/LUCENE-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amin Mohammed-Coleman updated LUCENE-1559: -- Attachment: HighLightingSummaryTestV3.java fileToSearch.txt Updated test case with no external dependencies except for lucene and junit. Highlighting not working in some instances even though indexsearcher returns result. Key: LUCENE-1559 URL: https://issues.apache.org/jira/browse/LUCENE-1559 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.4 Environment: Mac OS 1.5 Eclipse 3.4 Reporter: Amin Mohammed-Coleman Attachments: AJiA CH 02.doc, fileToSearch.txt, HighLightingSummaryTest(2).java, HighLightingSummaryTest.java, HighLightingSummaryTestV3.java In some instances highlighting does not return a result. However when you use a different term for teh same document you get results. Please see attach testcase and template file. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1539) Improve Benchmark
[ https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Rutherglen updated LUCENE-1539: - Attachment: LUCENE-1539.patch * Added deletepercent.alg as an example of these tasks * CommitIndexTask commits an IndexWriter using a commit name * OpenReaderTask opens a specific commit point by name * FlushReaderTask flushes a reader using a commit name * DeleteByPercentTask a percentage of reader documents Improve Benchmark - Key: LUCENE-1539 URL: https://issues.apache.org/jira/browse/LUCENE-1539 Project: Lucene - Java Issue Type: Improvement Components: contrib/benchmark Affects Versions: 2.4 Reporter: Jason Rutherglen Priority: Minor Fix For: 2.9 Attachments: LUCENE-1539.patch, LUCENE-1539.patch, sortBench2.py, sortCollate2.py Original Estimate: 336h Remaining Estimate: 336h Benchmark can be improved by incorporating recent suggestions posted on java-dev. M. McCandless' Python scripts that execute multiple rounds of tests can either be incorporated into the codebase or converted to Java. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: extending the query parser
On Thu, Mar 12, 2009 at 21:16, Candide Kemmler cand...@palacehotel.org wrote: On 11 Mar 2009, at 23:21, Earwin Burrfoot wrote: Take ANTLR and roll your own query parser from scratch? It's pretty easy. Hi Earwin, That would be fantastic, since our parser is already specified as an ANTLR grammar. However, I can't seem to find an antlr grammar in the lucene source. Obviously what we want is to extend the existing query support, not just create a new one from scratch. Lucene's default QueryParser uses javacc if I'm not mistaken. And I don't see any way to extend it except by patching and using modified version. If you want to explore some existing alternatives, Mark has an article here - http://www.lucidimagination.com/blog/2009/02/22/exploring-query-parsers/ My personal opinion is that default parser is only suitable for something that isn't going to see real world use. Regards, Candide On Thu, Mar 12, 2009 at 04:24, Candide Kemmler cand...@palacehotel.org wrote: Hello, I'm looking at a way to extend the lucene query parser to allow for semantic computations in IEML space (see http://ieml.org). What I'd like to know is: how difficult it would be to be able to add clauses to query like: ... AND ( some_IEML_expression) AND ... some_IEML_expression would involve a reference to some field that would contain metadata expressed in that format. Thanks in advance for you insights. Candide - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Nightly log files
The log files for the nightly check out are now stored into /tmp/ lucene-nightly.log The Crontab now looks like: 03 6 * * * /home/gsingers/bin/exportLuceneDocs.sh /tmp/lucene- nightly.log 21 Thanks to Otis for pointing out that the nightly was not checking out. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Nightly log files
Can you update the wiki with that? http://wiki.apache.org/lucene-java/HowToUpdateTheWebsite Thanks. Mike On Mar 12, 2009, at 3:52 PM, Grant Ingersoll wrote: The log files for the nightly check out are now stored into /tmp/ lucene-nightly.log The Crontab now looks like: 03 6 * * * /home/gsingers/bin/exportLuceneDocs.sh /tmp/lucene- nightly.log 21 Thanks to Otis for pointing out that the nightly was not checking out. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: extending the query parser
OK great! I'll see what I can do from here. Thanks! On 12 Mar 2009, at 12:45, Earwin Burrfoot wrote: On Thu, Mar 12, 2009 at 21:16, Candide Kemmler cand...@palacehotel.org wrote: On 11 Mar 2009, at 23:21, Earwin Burrfoot wrote: Take ANTLR and roll your own query parser from scratch? It's pretty easy. Hi Earwin, That would be fantastic, since our parser is already specified as an ANTLR grammar. However, I can't seem to find an antlr grammar in the lucene source. Obviously what we want is to extend the existing query support, not just create a new one from scratch. Lucene's default QueryParser uses javacc if I'm not mistaken. And I don't see any way to extend it except by patching and using modified version. If you want to explore some existing alternatives, Mark has an article here - http://www.lucidimagination.com/blog/2009/02/22/exploring-query-parsers/ My personal opinion is that default parser is only suitable for something that isn't going to see real world use. Regards, Candide On Thu, Mar 12, 2009 at 04:24, Candide Kemmler cand...@palacehotel.org wrote: Hello, I'm looking at a way to extend the lucene query parser to allow for semantic computations in IEML space (see http://ieml.org). What I'd like to know is: how difficult it would be to be able to add clauses to query like: ... AND ( some_IEML_expression) AND ... some_IEML_expression would involve a reference to some field that would contain metadata expressed in that format. Thanks in advance for you insights. Candide - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1559) Highlighting not working in some instances even though indexsearcher returns result.
[ https://issues.apache.org/jira/browse/LUCENE-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12681466#action_12681466 ] Mark Harwood commented on LUCENE-1559: -- I ran a quick test and I dont think I could see document in the Token.termText() of any tokens in the TokenStream you provide to the Highlighter. It's late and I need to be elsewhere but if you have time to pursue this check the above statement is true. If so, check the body text retrieved from Document.get(body) in the search results is the same as the String you store at index time (just in case the act of storing/retrieving has altered the text somehow). Will look into this more later Highlighting not working in some instances even though indexsearcher returns result. Key: LUCENE-1559 URL: https://issues.apache.org/jira/browse/LUCENE-1559 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.4 Environment: Mac OS 1.5 Eclipse 3.4 Reporter: Amin Mohammed-Coleman Attachments: AJiA CH 02.doc, fileToSearch.txt, HighLightingSummaryTest(2).java, HighLightingSummaryTest.java, HighLightingSummaryTestV3.java In some instances highlighting does not return a result. However when you use a different term for teh same document you get results. Please see attach testcase and template file. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1559) Highlighting not working in some instances even though indexsearcher returns result.
[ https://issues.apache.org/jira/browse/LUCENE-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12681468#action_12681468 ] Amin Mohammed-Coleman commented on LUCENE-1559: --- Hi Mark Thanks for looking into this, your help is much appreciated. I compared the body of the file (value to be indexed) against the doc.get(body) and they are both the same. assertEquals(bodyToBeStored, bodyText); Cheers Highlighting not working in some instances even though indexsearcher returns result. Key: LUCENE-1559 URL: https://issues.apache.org/jira/browse/LUCENE-1559 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.4 Environment: Mac OS 1.5 Eclipse 3.4 Reporter: Amin Mohammed-Coleman Attachments: AJiA CH 02.doc, fileToSearch.txt, HighLightingSummaryTest(2).java, HighLightingSummaryTest.java, HighLightingSummaryTestV3.java In some instances highlighting does not return a result. However when you use a different term for teh same document you get results. Please see attach testcase and template file. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-1559) Highlighting not working in some instances even though indexsearcher returns result.
[ https://issues.apache.org/jira/browse/LUCENE-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12681468#action_12681468 ] Amin Mohammed-Coleman edited comment on LUCENE-1559 at 3/12/09 1:22 PM: Hi Mark Thanks for looking into this, your help is much appreciated. I compared the body of the file (value to be indexed) against the doc.get(body) and they are both the same. assertEquals(bodyToBeStored, bodyText); Also tokenText = text.substring(startOffset, endOffset); line 240 of Highlighter doesn't return document all i get is documentation Cheers was (Author: amin): Hi Mark Thanks for looking into this, your help is much appreciated. I compared the body of the file (value to be indexed) against the doc.get(body) and they are both the same. assertEquals(bodyToBeStored, bodyText); Cheers Highlighting not working in some instances even though indexsearcher returns result. Key: LUCENE-1559 URL: https://issues.apache.org/jira/browse/LUCENE-1559 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.4 Environment: Mac OS 1.5 Eclipse 3.4 Reporter: Amin Mohammed-Coleman Attachments: AJiA CH 02.doc, fileToSearch.txt, HighLightingSummaryTest(2).java, HighLightingSummaryTest.java, HighLightingSummaryTestV3.java In some instances highlighting does not return a result. However when you use a different term for teh same document you get results. Please see attach testcase and template file. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1559) Highlighting not working in some instances even though indexsearcher returns result.
[ https://issues.apache.org/jira/browse/LUCENE-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12681507#action_12681507 ] Mark Harwood commented on LUCENE-1559: -- Ah. Try set this highlighter.setMaxDocCharsToAnalyze(Integer.MAX_VALUE); Highlighting not working in some instances even though indexsearcher returns result. Key: LUCENE-1559 URL: https://issues.apache.org/jira/browse/LUCENE-1559 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.4 Environment: Mac OS 1.5 Eclipse 3.4 Reporter: Amin Mohammed-Coleman Attachments: AJiA CH 02.doc, fileToSearch.txt, HighLightingSummaryTest(2).java, HighLightingSummaryTest.java, HighLightingSummaryTestV3.java In some instances highlighting does not return a result. However when you use a different term for teh same document you get results. Please see attach testcase and template file. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Closed: (LUCENE-1559) Highlighting not working in some instances even though indexsearcher returns result.
[ https://issues.apache.org/jira/browse/LUCENE-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood closed LUCENE-1559. Resolution: Invalid Working as designed with feature designed to prevent too-costly analysis Highlighting not working in some instances even though indexsearcher returns result. Key: LUCENE-1559 URL: https://issues.apache.org/jira/browse/LUCENE-1559 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.4 Environment: Mac OS 1.5 Eclipse 3.4 Reporter: Amin Mohammed-Coleman Attachments: AJiA CH 02.doc, fileToSearch.txt, HighLightingSummaryTest(2).java, HighLightingSummaryTest.java, HighLightingSummaryTestV3.java In some instances highlighting does not return a result. However when you use a different term for teh same document you get results. Please see attach testcase and template file. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1559) Highlighting not working in some instances even though indexsearcher returns result.
[ https://issues.apache.org/jira/browse/LUCENE-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12681514#action_12681514 ] Amin Mohammed-Coleman commented on LUCENE-1559: --- That did the trick. Thanks. Highlighting not working in some instances even though indexsearcher returns result. Key: LUCENE-1559 URL: https://issues.apache.org/jira/browse/LUCENE-1559 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.4 Environment: Mac OS 1.5 Eclipse 3.4 Reporter: Amin Mohammed-Coleman Attachments: AJiA CH 02.doc, fileToSearch.txt, HighLightingSummaryTest(2).java, HighLightingSummaryTest.java, HighLightingSummaryTestV3.java In some instances highlighting does not return a result. However when you use a different term for teh same document you get results. Please see attach testcase and template file. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1522) another highlighter
[ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12681517#action_12681517 ] Michael McCandless commented on LUCENE-1522: Does this highlighter have a max tokens to analyze setting? Or does it always visit all terms in each document? another highlighter --- Key: LUCENE-1522 URL: https://issues.apache.org/jira/browse/LUCENE-1522 Project: Lucene - Java Issue Type: Improvement Components: contrib/highlighter Reporter: Koji Sekiguchi Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: colored-tag-sample.png, LUCENE-1522.patch, LUCENE-1522.patch I've written this highlighter for my project to support bi-gram token stream (general token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets. usage: {code:java} TopDocs docs = searcher.search( query, 10 ); Highlighter h = new Highlighter(); FieldQuery fq = h.getFieldQuery( query ); for( ScoreDoc scoreDoc : docs.scoreDocs ){ // fieldName=content, fragCharSize=100, numFragments=3 String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, content, 100, 3 ); if( fragments != null ){ for( String fragment : fragments ) System.out.println( fragment ); } } {code} features: - fast for large docs - supports not only whitespace-based token stream, but also fixed size N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489) - supports PhraseQuery, phrase-unit highlighting with slops {noformat} q=w1 w2 bw1 w2/b --- q=w1 w2~1 bw1/b w3 bw2/b w3 bw1 w2/b {noformat} - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS - easy to apply patch due to independent package (contrib/highlighter2) - uses Java 1.5 - looks query boost to score fragments (currently doesn't see idf, but it should be possible) - pluggable FragListBuilder - pluggable FragmentsBuilder to do: - term positions can be unnecessary when phraseHighlight==false - collects performance numbers -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1522) another highlighter
[ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12681531#action_12681531 ] Mark Harwood commented on LUCENE-1522: -- I'm guessing that's not an issue given it uses stored TermVectors rather than re-analyzing? At some stage I hope to take a closer look at this contribution. I'd be interested to see if all the Highlighter1 Junit tests could be adapted to work with Highlighter2 and get some comparative benchmarks. another highlighter --- Key: LUCENE-1522 URL: https://issues.apache.org/jira/browse/LUCENE-1522 Project: Lucene - Java Issue Type: Improvement Components: contrib/highlighter Reporter: Koji Sekiguchi Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: colored-tag-sample.png, LUCENE-1522.patch, LUCENE-1522.patch I've written this highlighter for my project to support bi-gram token stream (general token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets. usage: {code:java} TopDocs docs = searcher.search( query, 10 ); Highlighter h = new Highlighter(); FieldQuery fq = h.getFieldQuery( query ); for( ScoreDoc scoreDoc : docs.scoreDocs ){ // fieldName=content, fragCharSize=100, numFragments=3 String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, content, 100, 3 ); if( fragments != null ){ for( String fragment : fragments ) System.out.println( fragment ); } } {code} features: - fast for large docs - supports not only whitespace-based token stream, but also fixed size N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489) - supports PhraseQuery, phrase-unit highlighting with slops {noformat} q=w1 w2 bw1 w2/b --- q=w1 w2~1 bw1/b w3 bw2/b w3 bw1 w2/b {noformat} - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS - easy to apply patch due to independent package (contrib/highlighter2) - uses Java 1.5 - looks query boost to score fragments (currently doesn't see idf, but it should be possible) - pluggable FragListBuilder - pluggable FragmentsBuilder to do: - term positions can be unnecessary when phraseHighlight==false - collects performance numbers -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org