[jira] [Commented] (LUCENE-5012) Make graph-based TokenFilters easier
[ https://issues.apache.org/jira/browse/LUCENE-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16294139#comment-16294139 ] Michael McCandless commented on LUCENE-5012: This issue should make it easier to fix the bug you're seeing, but we can also fix the bug (in {{ShingleFilter}} I'm guessing?) before doing this more ambitious change. It sounds like {{ShingleFilter}} is not looking at {{PositionLengthAttribute}}? > Make graph-based TokenFilters easier > > > Key: LUCENE-5012 > URL: https://issues.apache.org/jira/browse/LUCENE-5012 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: Michael McCandless >Assignee: Michael McCandless > Attachments: LUCENE-5012.patch, LUCENE-5012.patch > > > SynonymFilter has two limitations today: > * It cannot create positions, so eg dns -> domain name service > creates blatantly wrong highlights (SOLR-3390, LUCENE-4499 and > others). > * It cannot consume a graph, so e.g. if you try to apply synonyms > after Kuromoji tokenizer I'm not sure what will happen. > I've thought about how to fix these issues but it's really quite > difficult with the current PosInc/PosLen graph representation, so I'd > like to explore an alternative approach. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5012) Make graph-based TokenFilters easier
[ https://issues.apache.org/jira/browse/LUCENE-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16269235#comment-16269235 ] Ryan Pedela commented on LUCENE-5012: - I just have a question. I am using the word delimiter graph set to both split and concat followed by a shingle filter set to bigrams. I am getting what appears to be incorrect results. Is that expected and is the aim of this issue to fix that? > Make graph-based TokenFilters easier > > > Key: LUCENE-5012 > URL: https://issues.apache.org/jira/browse/LUCENE-5012 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: Michael McCandless >Assignee: Michael McCandless > Attachments: LUCENE-5012.patch, LUCENE-5012.patch > > > SynonymFilter has two limitations today: > * It cannot create positions, so eg dns -> domain name service > creates blatantly wrong highlights (SOLR-3390, LUCENE-4499 and > others). > * It cannot consume a graph, so e.g. if you try to apply synonyms > after Kuromoji tokenizer I'm not sure what will happen. > I've thought about how to fix these issues but it's really quite > difficult with the current PosInc/PosLen graph representation, so I'd > like to explore an alternative approach. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5012) Make graph-based TokenFilters easier
[ https://issues.apache.org/jira/browse/LUCENE-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15830136#comment-15830136 ] Matt Weber commented on LUCENE-5012: Thanks [~mikemccand] there was a lot of additional changes! I am going to start getting familiar with this and hopefully will be able to help move it forward as I get time. > Make graph-based TokenFilters easier > > > Key: LUCENE-5012 > URL: https://issues.apache.org/jira/browse/LUCENE-5012 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: Michael McCandless >Assignee: Michael McCandless > Attachments: LUCENE-5012.patch, LUCENE-5012.patch > > > SynonymFilter has two limitations today: > * It cannot create positions, so eg dns -> domain name service > creates blatantly wrong highlights (SOLR-3390, LUCENE-4499 and > others). > * It cannot consume a graph, so e.g. if you try to apply synonyms > after Kuromoji tokenizer I'm not sure what will happen. > I've thought about how to fix these issues but it's really quite > difficult with the current PosInc/PosLen graph representation, so I'd > like to explore an alternative approach. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5012) Make graph-based TokenFilters easier
[ https://issues.apache.org/jira/browse/LUCENE-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15828939#comment-15828939 ] Michael McCandless commented on LUCENE-5012: [~mattweber], I realized I had more private changes that I never pushed to that old branch, so I recovered them, fixed to apply to current master, and pushed here: https://github.com/mikemccand/lucene-solr/commits/graph_token_filters I also removed the controversial {{InsertDeletedPunctuationStage}}. Some tests are still failing ... I'll try to fix them. I think the ideas here are very promising. The write-once attributes (LUCENE-2450, folded into this branch) is cleaner than what Lucene has today, and the ease of making new positions without having to re-number previous ones makes graph token streams much easier. I tried to add the equivalent of {{CharFilter}} here, by using a new {{TextAttribute}} that stages before tokenization can use to read from a {{Reader}} or a {{String}}, and remap; I like that this makes offset correction more local than what the {{correctOffset}} exposes today. And it means char filtering is simply another stage, not a separate class. I also added {{int[] parts}} to {{OffsetAttribute}}; the idea here is to empower token filters (not just tokenizers) to properly correct offsets, so that e.g. WDGF could work "correctly", but I'm not sure it's worth the hassle: I haven't fully implemented it, and doing so is surprisingly tricky. > Make graph-based TokenFilters easier > > > Key: LUCENE-5012 > URL: https://issues.apache.org/jira/browse/LUCENE-5012 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: Michael McCandless >Assignee: Michael McCandless > Attachments: LUCENE-5012.patch, LUCENE-5012.patch > > > SynonymFilter has two limitations today: > * It cannot create positions, so eg dns -> domain name service > creates blatantly wrong highlights (SOLR-3390, LUCENE-4499 and > others). > * It cannot consume a graph, so e.g. if you try to apply synonyms > after Kuromoji tokenizer I'm not sure what will happen. > I've thought about how to fix these issues but it's really quite > difficult with the current PosInc/PosLen graph representation, so I'd > like to explore an alternative approach. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5012) Make graph-based TokenFilters easier
[ https://issues.apache.org/jira/browse/LUCENE-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827939#comment-15827939 ] Michael McCandless commented on LUCENE-5012: Wow, thanks for modernizing the patch [~mattweber]; I'll push this to branch on my github account for easier iterating... > Make graph-based TokenFilters easier > > > Key: LUCENE-5012 > URL: https://issues.apache.org/jira/browse/LUCENE-5012 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: Michael McCandless >Assignee: Michael McCandless > Attachments: LUCENE-5012.patch, LUCENE-5012.patch > > > SynonymFilter has two limitations today: > * It cannot create positions, so eg dns -> domain name service > creates blatantly wrong highlights (SOLR-3390, LUCENE-4499 and > others). > * It cannot consume a graph, so e.g. if you try to apply synonyms > after Kuromoji tokenizer I'm not sure what will happen. > I've thought about how to fix these issues but it's really quite > difficult with the current PosInc/PosLen graph representation, so I'd > like to explore an alternative approach. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5012) Make graph-based TokenFilters easier
[ https://issues.apache.org/jira/browse/LUCENE-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15826675#comment-15826675 ] Michael McCandless commented on LUCENE-5012: I think we really should advance this: the synonym filter here already handles incoming graphs correctly, and the token stream API improvements here make it much easier to consume and create graph token streams. I think this will only get more important with time, e.g. the new {{WordDelimiterGraphFilter}} now creates correct graphs, but if you want to run the current {{SynonymGraphFilter}} after it, it won't work. That said, there is still a lot of work to make this committable. I think we need to find a way to make the stage-based analysis components interchangeable with the current API to give us the freedom to gradually cutover the many tokenizers and tokenfilters Lucene has today. > Make graph-based TokenFilters easier > > > Key: LUCENE-5012 > URL: https://issues.apache.org/jira/browse/LUCENE-5012 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: Michael McCandless >Assignee: Michael McCandless > Attachments: LUCENE-5012.patch > > > SynonymFilter has two limitations today: > * It cannot create positions, so eg dns -> domain name service > creates blatantly wrong highlights (SOLR-3390, LUCENE-4499 and > others). > * It cannot consume a graph, so e.g. if you try to apply synonyms > after Kuromoji tokenizer I'm not sure what will happen. > I've thought about how to fix these issues but it's really quite > difficult with the current PosInc/PosLen graph representation, so I'd > like to explore an alternative approach. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5012) Make graph-based TokenFilters easier
[ https://issues.apache.org/jira/browse/LUCENE-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15826257#comment-15826257 ] Matt Weber commented on LUCENE-5012: [~mikemccand] So I was looking into supporting incoming graphs in {{SynonymGraphFilter}} and found this when you mentioned it in LUCENE-7638. What do you think the state of this patch is? Would it be best to look into advancing this instead of just {{SynonymGraphFilter}} itself? > Make graph-based TokenFilters easier > > > Key: LUCENE-5012 > URL: https://issues.apache.org/jira/browse/LUCENE-5012 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: Michael McCandless >Assignee: Michael McCandless > Attachments: LUCENE-5012.patch > > > SynonymFilter has two limitations today: > * It cannot create positions, so eg dns -> domain name service > creates blatantly wrong highlights (SOLR-3390, LUCENE-4499 and > others). > * It cannot consume a graph, so e.g. if you try to apply synonyms > after Kuromoji tokenizer I'm not sure what will happen. > I've thought about how to fix these issues but it's really quite > difficult with the current PosInc/PosLen graph representation, so I'd > like to explore an alternative approach. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5012) Make graph-based TokenFilters easier
[ https://issues.apache.org/jira/browse/LUCENE-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15670617#comment-15670617 ] David Smiley commented on LUCENE-5012: -- Seems very promising. Is LUCENE-2450 a dependency on this issue? There's no dependency JIRA issue link the first comment suggests it is. > Make graph-based TokenFilters easier > > > Key: LUCENE-5012 > URL: https://issues.apache.org/jira/browse/LUCENE-5012 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: Michael McCandless >Assignee: Michael McCandless > Attachments: LUCENE-5012.patch > > > SynonymFilter has two limitations today: > * It cannot create positions, so eg dns -> domain name service > creates blatantly wrong highlights (SOLR-3390, LUCENE-4499 and > others). > * It cannot consume a graph, so e.g. if you try to apply synonyms > after Kuromoji tokenizer I'm not sure what will happen. > I've thought about how to fix these issues but it's really quite > difficult with the current PosInc/PosLen graph representation, so I'd > like to explore an alternative approach. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5012) Make graph-based TokenFilters easier
[ https://issues.apache.org/jira/browse/LUCENE-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14660091#comment-14660091 ] ASF subversion and git services commented on LUCENE-5012: - Commit 1694511 from [~mikemccand] in branch 'dev/branches/lucene5012' [ https://svn.apache.org/r1694511 ] LUCENE-5012: don't separate interface from impl for attributes Make graph-based TokenFilters easier Key: LUCENE-5012 URL: https://issues.apache.org/jira/browse/LUCENE-5012 Project: Lucene - Core Issue Type: Improvement Components: modules/analysis Reporter: Michael McCandless Assignee: Michael McCandless Attachments: LUCENE-5012.patch SynonymFilter has two limitations today: * It cannot create positions, so eg dns - domain name service creates blatantly wrong highlights (SOLR-3390, LUCENE-4499 and others). * It cannot consume a graph, so e.g. if you try to apply synonyms after Kuromoji tokenizer I'm not sure what will happen. I've thought about how to fix these issues but it's really quite difficult with the current PosInc/PosLen graph representation, so I'd like to explore an alternative approach. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5012) Make graph-based TokenFilters easier
[ https://issues.apache.org/jira/browse/LUCENE-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14008345#comment-14008345 ] ASF subversion and git services commented on LUCENE-5012: - Commit 1597427 from [~mikemccand] in branch 'dev/branches/lucene5012' [ https://svn.apache.org/r1597427 ] LUCENE-5012: get tests passing again Make graph-based TokenFilters easier Key: LUCENE-5012 URL: https://issues.apache.org/jira/browse/LUCENE-5012 Project: Lucene - Core Issue Type: Improvement Components: modules/analysis Reporter: Michael McCandless Assignee: Michael McCandless Attachments: LUCENE-5012.patch SynonymFilter has two limitations today: * It cannot create positions, so eg dns - domain name service creates blatantly wrong highlights (SOLR-3390, LUCENE-4499 and others). * It cannot consume a graph, so e.g. if you try to apply synonyms after Kuromoji tokenizer I'm not sure what will happen. I've thought about how to fix these issues but it's really quite difficult with the current PosInc/PosLen graph representation, so I'd like to explore an alternative approach. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5012) Make graph-based TokenFilters easier
[ https://issues.apache.org/jira/browse/LUCENE-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14007257#comment-14007257 ] ASF subversion and git services commented on LUCENE-5012: - Commit 1597118 from [~mikemccand] in branch 'dev/branches/lucene5012' [ https://svn.apache.org/r1597118 ] LUCENE-5012: merge trunk, but some tests are failing Make graph-based TokenFilters easier Key: LUCENE-5012 URL: https://issues.apache.org/jira/browse/LUCENE-5012 Project: Lucene - Core Issue Type: Improvement Components: modules/analysis Reporter: Michael McCandless Assignee: Michael McCandless Attachments: LUCENE-5012.patch SynonymFilter has two limitations today: * It cannot create positions, so eg dns - domain name service creates blatantly wrong highlights (SOLR-3390, LUCENE-4499 and others). * It cannot consume a graph, so e.g. if you try to apply synonyms after Kuromoji tokenizer I'm not sure what will happen. I've thought about how to fix these issues but it's really quite difficult with the current PosInc/PosLen graph representation, so I'd like to explore an alternative approach. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5012) Make graph-based TokenFilters easier
[ https://issues.apache.org/jira/browse/LUCENE-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13696626#comment-13696626 ] Artem Lukanin commented on LUCENE-5012: --- I guess WordDelimiterFilter is a good candidate for transforming into a graph-based filter (see LUCENE-5051). Make graph-based TokenFilters easier Key: LUCENE-5012 URL: https://issues.apache.org/jira/browse/LUCENE-5012 Project: Lucene - Core Issue Type: Improvement Components: modules/analysis Reporter: Michael McCandless Assignee: Michael McCandless Attachments: LUCENE-5012.patch SynonymFilter has two limitations today: * It cannot create positions, so eg dns - domain name service creates blatantly wrong highlights (SOLR-3390, LUCENE-4499 and others). * It cannot consume a graph, so e.g. if you try to apply synonyms after Kuromoji tokenizer I'm not sure what will happen. I've thought about how to fix these issues but it's really quite difficult with the current PosInc/PosLen graph representation, so I'd like to explore an alternative approach. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5012) Make graph-based TokenFilters easier
[ https://issues.apache.org/jira/browse/LUCENE-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13667418#comment-13667418 ] Commit Tag Bot commented on LUCENE-5012: [lucene5012 commit] mikemccand http://svn.apache.org/viewvc?view=revisionrevision=1486483 LUCENE-5012: add CharFilter, fix some bugs with SynFilter, add new InsertDeletedPunctuationStage Make graph-based TokenFilters easier Key: LUCENE-5012 URL: https://issues.apache.org/jira/browse/LUCENE-5012 Project: Lucene - Core Issue Type: Improvement Components: modules/analysis Reporter: Michael McCandless Assignee: Michael McCandless Attachments: LUCENE-5012.patch SynonymFilter has two limitations today: * It cannot create positions, so eg dns - domain name service creates blatantly wrong highlights (SOLR-3390, LUCENE-4499 and others). * It cannot consume a graph, so e.g. if you try to apply synonyms after Kuromoji tokenizer I'm not sure what will happen. I've thought about how to fix these issues but it's really quite difficult with the current PosInc/PosLen graph representation, so I'd like to explore an alternative approach. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5012) Make graph-based TokenFilters easier
[ https://issues.apache.org/jira/browse/LUCENE-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13667419#comment-13667419 ] Michael McCandless commented on LUCENE-5012: I committed some changes: * Got charFilter working * Fixed a few bugs in SynFilterStage not clearing it's state on end / reset * Created a new fun stage: InsertDeletedPunctuationStage. This stage detects when a tokenizer has skipped over punctuation chars and inserts a deleted token representing the punctuation, e.g. to prevent a synonym or phrase query from matching over the punctuation. I had previously thought we would need to modify Tokenizers to do this but now I think maybe this Stage could do it for any Tokenizer ... Make graph-based TokenFilters easier Key: LUCENE-5012 URL: https://issues.apache.org/jira/browse/LUCENE-5012 Project: Lucene - Core Issue Type: Improvement Components: modules/analysis Reporter: Michael McCandless Assignee: Michael McCandless Attachments: LUCENE-5012.patch SynonymFilter has two limitations today: * It cannot create positions, so eg dns - domain name service creates blatantly wrong highlights (SOLR-3390, LUCENE-4499 and others). * It cannot consume a graph, so e.g. if you try to apply synonyms after Kuromoji tokenizer I'm not sure what will happen. I've thought about how to fix these issues but it's really quite difficult with the current PosInc/PosLen graph representation, so I'd like to explore an alternative approach. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5012) Make graph-based TokenFilters easier
[ https://issues.apache.org/jira/browse/LUCENE-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13666519#comment-13666519 ] Artem Lukanin commented on LUCENE-5012: --- Great! I was asking several people about it at Lucene/Solr Revolution 2013. Just don't forget, that if you set your default operator to AND, you should still use ORs for synonyms. I was trying to solve this problem partly in https://issues.apache.org/jira/browse/SOLR-4533. I'm not sure, how the combination of ORs and ANDs is done in an FSA if you are going to use WordAutomaton right away. Make graph-based TokenFilters easier Key: LUCENE-5012 URL: https://issues.apache.org/jira/browse/LUCENE-5012 Project: Lucene - Core Issue Type: Improvement Components: modules/analysis Reporter: Michael McCandless Assignee: Michael McCandless Attachments: LUCENE-5012.patch SynonymFilter has two limitations today: * It cannot create positions, so eg dns - domain name service creates blatantly wrong highlights (SOLR-3390, LUCENE-4499 and others). * It cannot consume a graph, so e.g. if you try to apply synonyms after Kuromoji tokenizer I'm not sure what will happen. I've thought about how to fix these issues but it's really quite difficult with the current PosInc/PosLen graph representation, so I'd like to explore an alternative approach. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5012) Make graph-based TokenFilters easier
[ https://issues.apache.org/jira/browse/LUCENE-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13663340#comment-13663340 ] Jack Krupansky commented on LUCENE-5012: Will this Jira include some test code that query parsers can use so that they can retrieve the graph for a stream containing multiple multi-term synonyms so that they can then individually sausage the term sequences as well as generate OR operators for string of sausages? Make graph-based TokenFilters easier Key: LUCENE-5012 URL: https://issues.apache.org/jira/browse/LUCENE-5012 Project: Lucene - Core Issue Type: Improvement Components: modules/analysis Reporter: Michael McCandless Assignee: Michael McCandless Attachments: LUCENE-5012.patch SynonymFilter has two limitations today: * It cannot create positions, so eg dns - domain name service creates blatantly wrong highlights (SOLR-3390, LUCENE-4499 and others). * It cannot consume a graph, so e.g. if you try to apply synonyms after Kuromoji tokenizer I'm not sure what will happen. I've thought about how to fix these issues but it's really quite difficult with the current PosInc/PosLen graph representation, so I'd like to explore an alternative approach. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5012) Make graph-based TokenFilters easier
[ https://issues.apache.org/jira/browse/LUCENE-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13663402#comment-13663402 ] Michael McCandless commented on LUCENE-5012: bq. Will this Jira include some test code that query parsers can use so that they can retrieve the graph for a stream containing multiple multi-term synonyms so that they can then individually sausage the term sequences as well as generate OR operators for string of sausages? I think so ... the test case (TestStages) sort of does that already: it turns the tokens into an automaton, to check that the automaton accepts the specified sequence of tokens (and ONLY those sequences of tokens). But, I think ideally we'd have a WordAutomatonQuery, which could take the Automaton directly and match documents using that, instead of enumerating all phrases an OR'ing them (which would be equivalent but presumably slower...). I would do WordAutomatonQuery on a separate issue ... Make graph-based TokenFilters easier Key: LUCENE-5012 URL: https://issues.apache.org/jira/browse/LUCENE-5012 Project: Lucene - Core Issue Type: Improvement Components: modules/analysis Reporter: Michael McCandless Assignee: Michael McCandless Attachments: LUCENE-5012.patch SynonymFilter has two limitations today: * It cannot create positions, so eg dns - domain name service creates blatantly wrong highlights (SOLR-3390, LUCENE-4499 and others). * It cannot consume a graph, so e.g. if you try to apply synonyms after Kuromoji tokenizer I'm not sure what will happen. I've thought about how to fix these issues but it's really quite difficult with the current PosInc/PosLen graph representation, so I'd like to explore an alternative approach. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5012) Make graph-based TokenFilters easier
[ https://issues.apache.org/jira/browse/LUCENE-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13663416#comment-13663416 ] Adrien Grand commented on LUCENE-5012: -- This looks very promising! I've been looking at a few TokenFilters recently and anything that would make working with graphs easier is very welcome! Maybe we should create a branch to make it easier to collaborate and to track incremental updates? Make graph-based TokenFilters easier Key: LUCENE-5012 URL: https://issues.apache.org/jira/browse/LUCENE-5012 Project: Lucene - Core Issue Type: Improvement Components: modules/analysis Reporter: Michael McCandless Assignee: Michael McCandless Attachments: LUCENE-5012.patch SynonymFilter has two limitations today: * It cannot create positions, so eg dns - domain name service creates blatantly wrong highlights (SOLR-3390, LUCENE-4499 and others). * It cannot consume a graph, so e.g. if you try to apply synonyms after Kuromoji tokenizer I'm not sure what will happen. I've thought about how to fix these issues but it's really quite difficult with the current PosInc/PosLen graph representation, so I'd like to explore an alternative approach. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5012) Make graph-based TokenFilters easier
[ https://issues.apache.org/jira/browse/LUCENE-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13663433#comment-13663433 ] Michael McCandless commented on LUCENE-5012: Good idea Adrien! I'll cut a branch and commit the patch ... Make graph-based TokenFilters easier Key: LUCENE-5012 URL: https://issues.apache.org/jira/browse/LUCENE-5012 Project: Lucene - Core Issue Type: Improvement Components: modules/analysis Reporter: Michael McCandless Assignee: Michael McCandless Attachments: LUCENE-5012.patch SynonymFilter has two limitations today: * It cannot create positions, so eg dns - domain name service creates blatantly wrong highlights (SOLR-3390, LUCENE-4499 and others). * It cannot consume a graph, so e.g. if you try to apply synonyms after Kuromoji tokenizer I'm not sure what will happen. I've thought about how to fix these issues but it's really quite difficult with the current PosInc/PosLen graph representation, so I'd like to explore an alternative approach. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5012) Make graph-based TokenFilters easier
[ https://issues.apache.org/jira/browse/LUCENE-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13663434#comment-13663434 ] Commit Tag Bot commented on LUCENE-5012: [lucene5012 commit] mikemccand http://svn.apache.org/viewvc?view=revisionrevision=1484977 LUCENE-5012: create branch Make graph-based TokenFilters easier Key: LUCENE-5012 URL: https://issues.apache.org/jira/browse/LUCENE-5012 Project: Lucene - Core Issue Type: Improvement Components: modules/analysis Reporter: Michael McCandless Assignee: Michael McCandless Attachments: LUCENE-5012.patch SynonymFilter has two limitations today: * It cannot create positions, so eg dns - domain name service creates blatantly wrong highlights (SOLR-3390, LUCENE-4499 and others). * It cannot consume a graph, so e.g. if you try to apply synonyms after Kuromoji tokenizer I'm not sure what will happen. I've thought about how to fix these issues but it's really quite difficult with the current PosInc/PosLen graph representation, so I'd like to explore an alternative approach. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5012) Make graph-based TokenFilters easier
[ https://issues.apache.org/jira/browse/LUCENE-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13663444#comment-13663444 ] Commit Tag Bot commented on LUCENE-5012: [lucene5012 commit] mikemccand http://svn.apache.org/viewvc?view=revisionrevision=1484980 LUCENE-5012: initial prototype Make graph-based TokenFilters easier Key: LUCENE-5012 URL: https://issues.apache.org/jira/browse/LUCENE-5012 Project: Lucene - Core Issue Type: Improvement Components: modules/analysis Reporter: Michael McCandless Assignee: Michael McCandless Attachments: LUCENE-5012.patch SynonymFilter has two limitations today: * It cannot create positions, so eg dns - domain name service creates blatantly wrong highlights (SOLR-3390, LUCENE-4499 and others). * It cannot consume a graph, so e.g. if you try to apply synonyms after Kuromoji tokenizer I'm not sure what will happen. I've thought about how to fix these issues but it's really quite difficult with the current PosInc/PosLen graph representation, so I'd like to explore an alternative approach. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5012) Make graph-based TokenFilters easier
[ https://issues.apache.org/jira/browse/LUCENE-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13663445#comment-13663445 ] Michael McCandless commented on LUCENE-5012: OK I committed the initial patch to this branch: https://svn.apache.org/repos/asf/lucene/dev/branches/lucene5012 Make graph-based TokenFilters easier Key: LUCENE-5012 URL: https://issues.apache.org/jira/browse/LUCENE-5012 Project: Lucene - Core Issue Type: Improvement Components: modules/analysis Reporter: Michael McCandless Assignee: Michael McCandless Attachments: LUCENE-5012.patch SynonymFilter has two limitations today: * It cannot create positions, so eg dns - domain name service creates blatantly wrong highlights (SOLR-3390, LUCENE-4499 and others). * It cannot consume a graph, so e.g. if you try to apply synonyms after Kuromoji tokenizer I'm not sure what will happen. I've thought about how to fix these issues but it's really quite difficult with the current PosInc/PosLen graph representation, so I'd like to explore an alternative approach. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5012) Make graph-based TokenFilters easier
[ https://issues.apache.org/jira/browse/LUCENE-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13663471#comment-13663471 ] Jack Krupansky commented on LUCENE-5012: bq. WordAutomatonQuery Sounds quite promising. Back to the query parsers... So, they would present a term or quoted string - and eventually hopefully a sequence of terms if the query parser sees that there is only white space between them (an issue Robert filed long ago) - and invoke analysis. Then what? Sometimes a single term or a clean sausage string comes out and a TermQuery or simple BooleanQuery or PhraseQuery needs to be generated, but if synonym-like filtering has generated a graph, then the query parser would hand it directly to WordAutomatonQuery, if I understand correctly. Then the question becomes how to tell that a WordAutomatonQuery graph is needed - unless WordAutomatonQuery automatically detects the cases for TermQuery and BooleanQuery/PhraseQuery as internal optimizations. (Well, I don't expect that WordAutomatonQuery would know how to do BooleanQuery vs. PhraseQuery, unless it has a phrase flag.) In short, it would be nice if this issue directly or at least partially produced enough logic for that Term vs. Boolean vs. Phrase vs. WordAutomaton Query generation. Either to actually generate the final query, or at least some example code that documents the design pattern that a query parser needs for consumption of a query phrase graph. In other words, the query parsers should not simply do a next for the entire output of query term analysis. A new design pattern is needed. Also, at index time, the output of analysis is consumed as a single sausage stream, using next and token position increment, but any multiple multi-word synonyms would traditionally get somewhat mangled. There may not be a clean solution for the current index term posting format, but at a minimum we should reconsider how the output of index-time term analysis is consumed and flag potential improvements for the future for posting of multiple multi-term phrases at the same token position. In any case, thanks for moving the multiple multi-term synonym ball forward! Make graph-based TokenFilters easier Key: LUCENE-5012 URL: https://issues.apache.org/jira/browse/LUCENE-5012 Project: Lucene - Core Issue Type: Improvement Components: modules/analysis Reporter: Michael McCandless Assignee: Michael McCandless Attachments: LUCENE-5012.patch SynonymFilter has two limitations today: * It cannot create positions, so eg dns - domain name service creates blatantly wrong highlights (SOLR-3390, LUCENE-4499 and others). * It cannot consume a graph, so e.g. if you try to apply synonyms after Kuromoji tokenizer I'm not sure what will happen. I've thought about how to fix these issues but it's really quite difficult with the current PosInc/PosLen graph representation, so I'd like to explore an alternative approach. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org