[jira] [Commented] (SOLR-5228) Deprecate fields and types tags in schema.xml
[ https://issues.apache.org/jira/browse/SOLR-5228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13944920#comment-13944920 ] Benson Margulies commented on SOLR-5228: Allow the person extending the schema to provide a, well, extended schema. Deprecate fields and types tags in schema.xml - Key: SOLR-5228 URL: https://issues.apache.org/jira/browse/SOLR-5228 Project: Solr Issue Type: Improvement Components: Schema and Analysis Reporter: Hoss Man Assignee: Erick Erickson Fix For: 4.8, 5.0 Attachments: SOLR-5228.patch, SOLR-5228.patch On the solr-user mailing list, Nutan recently mentioned spending days trying to track down a problem that turned out to be because he had attempted to add a {{dynamicField .. /}} that was outside of the {{fields}} block in his schema.xml -- Solr was just silently ignoring it. We have made improvements in other areas of config validation by generating statup errors when tags/attributes are found that are not expected -- but in this case i think we should just stop expecting/requiring that the {{fields}} and {{types}} tags will be used to group these sorts of things. I think schema.xml parsing should just start ignoring them and only care about finding the {{field}}, {{dynamicField}}, and {{fieldType}} tags wherever they may be. If people want to keep using them, fine. If people want to mix fieldTypes and fields side by side (perhaps specify a fieldType, then list all the fields using it) fine. I don't see any value in forcing people to use them, but we definitely shouldn't leave things the way they are with otherwise perfectly valid field/type declarations being silently ignored. --- I'll take this on unless i see any objections. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5228) Deprecate fields and types tags in schema.xml
[ https://issues.apache.org/jira/browse/SOLR-5228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13944971#comment-13944971 ] Benson Margulies commented on SOLR-5228: DTD's are useless. We need to pick one of W3C XML Schema or RNG. RNG is a lot easier to work with. Schematron is another possibility, but I have no experience. See http://docs.oracle.com/javase/7/docs/api/javax/xml/validation/package-summary.html. Choices are: * validation is easy to disable; people who customize disable it * customizers take the entire schema, add to it, and provide their added one. Not so good for multiples. * customizers are constrained to use _namespaces_ -- you customize, you add an XML namespace, and you provide a schema for your namespace. Of course the first time we try this we'll find problems in the test schemas. Has anyone done anything in this area that I could start from if I was inclined to try to work on this? Deprecate fields and types tags in schema.xml - Key: SOLR-5228 URL: https://issues.apache.org/jira/browse/SOLR-5228 Project: Solr Issue Type: Improvement Components: Schema and Analysis Reporter: Hoss Man Assignee: Erick Erickson Fix For: 4.8, 5.0 Attachments: SOLR-5228.patch, SOLR-5228.patch On the solr-user mailing list, Nutan recently mentioned spending days trying to track down a problem that turned out to be because he had attempted to add a {{dynamicField .. /}} that was outside of the {{fields}} block in his schema.xml -- Solr was just silently ignoring it. We have made improvements in other areas of config validation by generating statup errors when tags/attributes are found that are not expected -- but in this case i think we should just stop expecting/requiring that the {{fields}} and {{types}} tags will be used to group these sorts of things. I think schema.xml parsing should just start ignoring them and only care about finding the {{field}}, {{dynamicField}}, and {{fieldType}} tags wherever they may be. If people want to keep using them, fine. If people want to mix fieldTypes and fields side by side (perhaps specify a fieldType, then list all the fields using it) fine. I don't see any value in forcing people to use them, but we definitely shouldn't leave things the way they are with otherwise perfectly valid field/type declarations being silently ignored. --- I'll take this on unless i see any objections. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5228) Deprecate fields and types tags in schema.xml
[ https://issues.apache.org/jira/browse/SOLR-5228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13944636#comment-13944636 ] Benson Margulies commented on SOLR-5228: I apologize for showing up so late with an opinion. I can't get over the feeling that this might be solving the wrong problem. In XML, the structure of {code} SOME_ITEMs SOME_ITEM /SOME_ITEM ... /SOME_ITEMs {code} is ancient and honorable. Yea, some schemas dispense with the container for the group, but plenty do not. The source of this was someone who misplaced an item and didn't get a diagnosis. _Why don't we concentrate on diagnosis?_ Why not create a schema and, by default, check it? It's not like we're in a giant hurry at start-up compared to the extra time of enabling a validating parse. Grouping these guys together is harmless at worst and slight helpful at best. If we are going to change the schema, I would beg that anyone changing it put forth an actual, well, _schema_ that is an accurate representation of what is allowed. So I'm belatedly -1 on this change, for why tiny little bit its worth. Deprecate fields and types tags in schema.xml - Key: SOLR-5228 URL: https://issues.apache.org/jira/browse/SOLR-5228 Project: Solr Issue Type: Improvement Components: Schema and Analysis Reporter: Hoss Man Assignee: Erick Erickson Attachments: SOLR-5228.patch, SOLR-5228.patch On the solr-user mailing list, Nutan recently mentioned spending days trying to track down a problem that turned out to be because he had attempted to add a {{dynamicField .. /}} that was outside of the {{fields}} block in his schema.xml -- Solr was just silently ignoring it. We have made improvements in other areas of config validation by generating statup errors when tags/attributes are found that are not expected -- but in this case i think we should just stop expecting/requiring that the {{fields}} and {{types}} tags will be used to group these sorts of things. I think schema.xml parsing should just start ignoring them and only care about finding the {{field}}, {{dynamicField}}, and {{fieldType}} tags wherever they may be. If people want to keep using them, fine. If people want to mix fieldTypes and fields side by side (perhaps specify a fieldType, then list all the fields using it) fine. I don't see any value in forcing people to use them, but we definitely shouldn't leave things the way they are with otherwise perfectly valid field/type declarations being silently ignored. --- I'll take this on unless i see any objections. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Assigned] (LUCENE-5449) Two ancient classes renamed to be less peculiar: _TestHelper and _TestUtil
[ https://issues.apache.org/jira/browse/LUCENE-5449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benson Margulies reassigned LUCENE-5449: Assignee: Benson Margulies Two ancient classes renamed to be less peculiar: _TestHelper and _TestUtil -- Key: LUCENE-5449 URL: https://issues.apache.org/jira/browse/LUCENE-5449 Project: Lucene - Core Issue Type: Improvement Affects Versions: 4.6.1 Reporter: Benson Margulies Assignee: Benson Margulies Priority: Minor _TestUtil and _TestHelper begin with _ for historical reasons that don't apply any longer. Lets eliminate those _'s. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5449) Two ancient classes renamed to be less peculiar: _TestHelper and _TestUtil
[ https://issues.apache.org/jira/browse/LUCENE-5449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13906174#comment-13906174 ] Benson Margulies commented on LUCENE-5449: -- I'm unable to reconstruct how I laid this egg. My only theory is that I had somehow cd'd back to the wrong tree before running ant precommit after thinking i've set up the merge. Rob's commit really just finishes my work on 'part 1': part 2 was always going to be the _TestHelper commit. Let's see if I can get that one right. Two ancient classes renamed to be less peculiar: _TestHelper and _TestUtil -- Key: LUCENE-5449 URL: https://issues.apache.org/jira/browse/LUCENE-5449 Project: Lucene - Core Issue Type: Improvement Affects Versions: 4.6.1 Reporter: Benson Margulies Assignee: Benson Margulies Priority: Minor _TestUtil and _TestHelper begin with _ for historical reasons that don't apply any longer. Lets eliminate those _'s. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-5449) Two ancient classes renamed to be less peculiar: _TestHelper and _TestUtil
[ https://issues.apache.org/jira/browse/LUCENE-5449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13906174#comment-13906174 ] Benson Margulies edited comment on LUCENE-5449 at 2/19/14 10:02 PM: I'm unable to reconstruct how I laid this egg. My only theory is that I had somehow cd'd back to the wrong tree before running ant precommit after thinking I had made all the corrections after the merge. Rob's commit really just finishes my work on 'part 1': part 2 was always going to be the _TestHelper commit. Let's see if I can get that one right. was (Author: bmargulies): I'm unable to reconstruct how I laid this egg. My only theory is that I had somehow cd'd back to the wrong tree before running ant precommit after thinking i've set up the merge. Rob's commit really just finishes my work on 'part 1': part 2 was always going to be the _TestHelper commit. Let's see if I can get that one right. Two ancient classes renamed to be less peculiar: _TestHelper and _TestUtil -- Key: LUCENE-5449 URL: https://issues.apache.org/jira/browse/LUCENE-5449 Project: Lucene - Core Issue Type: Improvement Affects Versions: 4.6.1 Reporter: Benson Margulies Assignee: Benson Margulies Priority: Minor _TestUtil and _TestHelper begin with _ for historical reasons that don't apply any longer. Lets eliminate those _'s. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5449) Two ancient classes renamed to be less peculiar: _TestHelper and _TestUtil
[ https://issues.apache.org/jira/browse/LUCENE-5449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benson Margulies updated LUCENE-5449: - Fix Version/s: 5.0 4.8 Two ancient classes renamed to be less peculiar: _TestHelper and _TestUtil -- Key: LUCENE-5449 URL: https://issues.apache.org/jira/browse/LUCENE-5449 Project: Lucene - Core Issue Type: Improvement Affects Versions: 4.6.1 Reporter: Benson Margulies Assignee: Benson Margulies Priority: Minor Fix For: 4.8, 5.0 _TestUtil and _TestHelper begin with _ for historical reasons that don't apply any longer. Lets eliminate those _'s. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-5449) Two ancient classes renamed to be less peculiar: _TestHelper and _TestUtil
[ https://issues.apache.org/jira/browse/LUCENE-5449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benson Margulies resolved LUCENE-5449. -- Resolution: Fixed Two ancient classes renamed to be less peculiar: _TestHelper and _TestUtil -- Key: LUCENE-5449 URL: https://issues.apache.org/jira/browse/LUCENE-5449 Project: Lucene - Core Issue Type: Improvement Affects Versions: 4.6.1 Reporter: Benson Margulies Assignee: Benson Margulies Priority: Minor Fix For: 4.8, 5.0 _TestUtil and _TestHelper begin with _ for historical reasons that don't apply any longer. Lets eliminate those _'s. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5449) Two ancient classes renamed to be less peculiar: _TestHelper and _TestUtil
[ https://issues.apache.org/jira/browse/LUCENE-5449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13903983#comment-13903983 ] Benson Margulies commented on LUCENE-5449: -- [~thetaphi], I am not enthusiastic about 1000 edits to change from importing the class to static importing the methods. Do you see this as a requirement, or just a desirable practice going forward? Two ancient classes renamed to be less peculiar: _TestHelper and _TestUtil -- Key: LUCENE-5449 URL: https://issues.apache.org/jira/browse/LUCENE-5449 Project: Lucene - Core Issue Type: Improvement Affects Versions: 4.6.1 Reporter: Benson Margulies Priority: Minor _TestUtil and _TestHelper begin with _ for historical reasons that don't apply any longer. Lets eliminate those _'s. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5449) Two ancient classes renamed to be less peculiar: _TestHelper and _TestUtil
[ https://issues.apache.org/jira/browse/LUCENE-5449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13904004#comment-13904004 ] Benson Margulies commented on LUCENE-5449: -- OK, then this is good to go. (I did include one example of switching to a static import, even though I agree with [~mikemccand] in general. Two ancient classes renamed to be less peculiar: _TestHelper and _TestUtil -- Key: LUCENE-5449 URL: https://issues.apache.org/jira/browse/LUCENE-5449 Project: Lucene - Core Issue Type: Improvement Affects Versions: 4.6.1 Reporter: Benson Margulies Priority: Minor _TestUtil and _TestHelper begin with _ for historical reasons that don't apply any longer. Lets eliminate those _'s. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-5448) Random string generation centralized in _TestUtil
Benson Margulies created LUCENE-5448: Summary: Random string generation centralized in _TestUtil Key: LUCENE-5448 URL: https://issues.apache.org/jira/browse/LUCENE-5448 Project: Lucene - Core Issue Type: Improvement Affects Versions: 4.6.1 Reporter: Benson Margulies The random string generators in BaseTokenStreamTestCase have wider applicability and should move in with their cousins. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-5448) Random string generation centralized in _TestUtil
[ https://issues.apache.org/jira/browse/LUCENE-5448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benson Margulies resolved LUCENE-5448. -- Resolution: Fixed Fix Version/s: 5.0 Assignee: Benson Margulies Random string generation centralized in _TestUtil - Key: LUCENE-5448 URL: https://issues.apache.org/jira/browse/LUCENE-5448 Project: Lucene - Core Issue Type: Improvement Affects Versions: 4.6.1 Reporter: Benson Margulies Assignee: Benson Margulies Fix For: 5.0 The random string generators in BaseTokenStreamTestCase have wider applicability and should move in with their cousins. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5448) Random string generation centralized in _TestUtil
[ https://issues.apache.org/jira/browse/LUCENE-5448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benson Margulies updated LUCENE-5448: - Fix Version/s: 4.7 Random string generation centralized in _TestUtil - Key: LUCENE-5448 URL: https://issues.apache.org/jira/browse/LUCENE-5448 Project: Lucene - Core Issue Type: Improvement Affects Versions: 4.6.1 Reporter: Benson Margulies Assignee: Benson Margulies Fix For: 5.0, 4.7 The random string generators in BaseTokenStreamTestCase have wider applicability and should move in with their cousins. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-5449) Two ancient classes renamed to be less peculiar: _TestHelper and _TestUtil
Benson Margulies created LUCENE-5449: Summary: Two ancient classes renamed to be less peculiar: _TestHelper and _TestUtil Key: LUCENE-5449 URL: https://issues.apache.org/jira/browse/LUCENE-5449 Project: Lucene - Core Issue Type: Improvement Affects Versions: 4.6.1 Reporter: Benson Margulies Priority: Minor _TestUtil and _TestHelper begin with _ for historical reasons that don't apply any longer. Lets eliminate those _'s. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries
[ https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13895706#comment-13895706 ] Benson Margulies commented on LUCENE-4956: -- This is a patch, not an accepted component of Apache Lucene. There's no guarantee that anyone will work on it. the korean analyzer that has a korean morphological analyzer and dictionaries - Key: LUCENE-4956 URL: https://issues.apache.org/jira/browse/LUCENE-4956 Project: Lucene - Core Issue Type: New Feature Components: modules/analysis Affects Versions: 4.2 Reporter: SooMyung Lee Assignee: Christian Moen Labels: newbie Attachments: LUCENE-4956.patch, eval.patch, kr.analyzer.4x.tar, lucene-4956.patch, lucene4956.patch Korean language has specific characteristic. When developing search service with lucene solr in korean, there are some problems in searching and indexing. The korean analyer solved the problems with a korean morphological anlyzer. It consists of a korean morphological analyzer, dictionaries, a korean tokenizer and a korean filter. The korean anlyzer is made for lucene and solr. If you develop a search service with lucene in korean, It is the best idea to choose the korean analyzer. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5623) Better diagnosis of RuntimeExceptions in analysis
[ https://issues.apache.org/jira/browse/SOLR-5623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13892049#comment-13892049 ] Benson Margulies commented on SOLR-5623: [~shalinmangar] Apparently I haven't learned to read the output of ant test very well, and fooled myself into believing that all as well. Thanks for cleaning up after me. Better diagnosis of RuntimeExceptions in analysis - Key: SOLR-5623 URL: https://issues.apache.org/jira/browse/SOLR-5623 Project: Solr Issue Type: Bug Affects Versions: 4.6.1 Reporter: Benson Margulies Assignee: Benson Margulies Fix For: 5.0, 4.7 Attachments: SOLR-5623-nowrap.patch, SOLR-5623-nowrap.patch If an analysis component (tokenizer, filter, etc) gets really into a hissy fit and throws a RuntimeException, the resulting log traffic is less than informative, lacking any pointer to the doc under discussion (in the doc case). It would be more better if there was a catch/try shortstop that logged this more informatively. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5623) Better diagnosis of RuntimeExceptions in analysis
[ https://issues.apache.org/jira/browse/SOLR-5623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13891512#comment-13891512 ] Benson Margulies commented on SOLR-5623: trunk patch 1564584. Better diagnosis of RuntimeExceptions in analysis - Key: SOLR-5623 URL: https://issues.apache.org/jira/browse/SOLR-5623 Project: Solr Issue Type: Bug Reporter: Benson Margulies Assignee: Benson Margulies If an analysis component (tokenizer, filter, etc) gets really into a hissy fit and throws a RuntimeException, the resulting log traffic is less than informative, lacking any pointer to the doc under discussion (in the doc case). It would be more better if there was a catch/try shortstop that logged this more informatively. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5405) Exception strategy for analysis improved
[ https://issues.apache.org/jira/browse/LUCENE-5405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13889406#comment-13889406 ] Benson Margulies commented on LUCENE-5405: -- Will do. Thanks, this is exactly what sort of feedback I was looking for. Exception strategy for analysis improved Key: LUCENE-5405 URL: https://issues.apache.org/jira/browse/LUCENE-5405 Project: Lucene - Core Issue Type: Improvement Reporter: Benson Margulies Assignee: Benson Margulies Fix For: 5.0 Attachments: LUCENE-5405-4.x.patch SOLR-5623 included some conversation about the dilemmas of exception management and reporting in the analysis chain. I've belatedly become educated about the infostream, and this situation is a job for it. The DocInverterPerField can note exceptions in the analysis chain, log out to the infostream, and then rethrow them as before. No wrapping, no muss, no fuss. There are comments on this JIRA from a more complex prior idea that readers might want to ignore. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5405) Exception strategy for analysis improved
[ https://issues.apache.org/jira/browse/LUCENE-5405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benson Margulies updated LUCENE-5405: - Fix Version/s: 4.7 Exception strategy for analysis improved Key: LUCENE-5405 URL: https://issues.apache.org/jira/browse/LUCENE-5405 Project: Lucene - Core Issue Type: Improvement Reporter: Benson Margulies Assignee: Benson Margulies Fix For: 5.0, 4.7 Attachments: LUCENE-5405-4.x.patch SOLR-5623 included some conversation about the dilemmas of exception management and reporting in the analysis chain. I've belatedly become educated about the infostream, and this situation is a job for it. The DocInverterPerField can note exceptions in the analysis chain, log out to the infostream, and then rethrow them as before. No wrapping, no muss, no fuss. There are comments on this JIRA from a more complex prior idea that readers might want to ignore. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5405) Exception strategy for analysis improved
[ https://issues.apache.org/jira/browse/LUCENE-5405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13889424#comment-13889424 ] Benson Margulies commented on LUCENE-5405: -- rev 1563850 provides the backport. Exception strategy for analysis improved Key: LUCENE-5405 URL: https://issues.apache.org/jira/browse/LUCENE-5405 Project: Lucene - Core Issue Type: Improvement Reporter: Benson Margulies Assignee: Benson Margulies Fix For: 5.0, 4.7 Attachments: LUCENE-5405-4.x.patch SOLR-5623 included some conversation about the dilemmas of exception management and reporting in the analysis chain. I've belatedly become educated about the infostream, and this situation is a job for it. The DocInverterPerField can note exceptions in the analysis chain, log out to the infostream, and then rethrow them as before. No wrapping, no muss, no fuss. There are comments on this JIRA from a more complex prior idea that readers might want to ignore. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-5405) Exception strategy for analysis improved
[ https://issues.apache.org/jira/browse/LUCENE-5405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benson Margulies resolved LUCENE-5405. -- Resolution: Fixed backported, CHANGES.txt filled in. 'this time for sure' Exception strategy for analysis improved Key: LUCENE-5405 URL: https://issues.apache.org/jira/browse/LUCENE-5405 Project: Lucene - Core Issue Type: Improvement Reporter: Benson Margulies Assignee: Benson Margulies Fix For: 5.0, 4.7 Attachments: LUCENE-5405-4.x.patch SOLR-5623 included some conversation about the dilemmas of exception management and reporting in the analysis chain. I've belatedly become educated about the infostream, and this situation is a job for it. The DocInverterPerField can note exceptions in the analysis chain, log out to the infostream, and then rethrow them as before. No wrapping, no muss, no fuss. There are comments on this JIRA from a more complex prior idea that readers might want to ignore. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5405) Exception strategy for analysis improved
[ https://issues.apache.org/jira/browse/LUCENE-5405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13888958#comment-13888958 ] Benson Margulies commented on LUCENE-5405: -- I can backport, [~mikemccand]. Is there any doc on how the project manages branches? If not, I can add some to the web site to help guide patch-offerers. Exception strategy for analysis improved Key: LUCENE-5405 URL: https://issues.apache.org/jira/browse/LUCENE-5405 Project: Lucene - Core Issue Type: Improvement Reporter: Benson Margulies Assignee: Benson Margulies Fix For: 5.0 SOLR-5623 included some conversation about the dilemmas of exception management and reporting in the analysis chain. I've belatedly become educated about the infostream, and this situation is a job for it. The DocInverterPerField can note exceptions in the analysis chain, log out to the infostream, and then rethrow them as before. No wrapping, no muss, no fuss. There are comments on this JIRA from a more complex prior idea that readers might want to ignore. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5405) Exception strategy for analysis improved
[ https://issues.apache.org/jira/browse/LUCENE-5405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13889051#comment-13889051 ] Benson Margulies commented on LUCENE-5405: -- Somehow the unit test escaped the prior commit. 1563711 fills it in. Exception strategy for analysis improved Key: LUCENE-5405 URL: https://issues.apache.org/jira/browse/LUCENE-5405 Project: Lucene - Core Issue Type: Improvement Reporter: Benson Margulies Assignee: Benson Margulies Fix For: 5.0 SOLR-5623 included some conversation about the dilemmas of exception management and reporting in the analysis chain. I've belatedly become educated about the infostream, and this situation is a job for it. The DocInverterPerField can note exceptions in the analysis chain, log out to the infostream, and then rethrow them as before. No wrapping, no muss, no fuss. There are comments on this JIRA from a more complex prior idea that readers might want to ignore. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5405) Exception strategy for analysis improved
[ https://issues.apache.org/jira/browse/LUCENE-5405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13889095#comment-13889095 ] Benson Margulies commented on LUCENE-5405: -- Well, svn merge did something I can't make heads or tails of, so I'm going to merge by hand. Exception strategy for analysis improved Key: LUCENE-5405 URL: https://issues.apache.org/jira/browse/LUCENE-5405 Project: Lucene - Core Issue Type: Improvement Reporter: Benson Margulies Assignee: Benson Margulies Fix For: 5.0 SOLR-5623 included some conversation about the dilemmas of exception management and reporting in the analysis chain. I've belatedly become educated about the infostream, and this situation is a job for it. The DocInverterPerField can note exceptions in the analysis chain, log out to the infostream, and then rethrow them as before. No wrapping, no muss, no fuss. There are comments on this JIRA from a more complex prior idea that readers might want to ignore. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5405) Exception strategy for analysis improved
[ https://issues.apache.org/jira/browse/LUCENE-5405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benson Margulies updated LUCENE-5405: - Attachment: LUCENE-5405-4.x.patch Reviewable port. Exception strategy for analysis improved Key: LUCENE-5405 URL: https://issues.apache.org/jira/browse/LUCENE-5405 Project: Lucene - Core Issue Type: Improvement Reporter: Benson Margulies Assignee: Benson Margulies Fix For: 5.0 Attachments: LUCENE-5405-4.x.patch SOLR-5623 included some conversation about the dilemmas of exception management and reporting in the analysis chain. I've belatedly become educated about the infostream, and this situation is a job for it. The DocInverterPerField can note exceptions in the analysis chain, log out to the infostream, and then rethrow them as before. No wrapping, no muss, no fuss. There are comments on this JIRA from a more complex prior idea that readers might want to ignore. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5405) Exception strategy for analysis improved
[ https://issues.apache.org/jira/browse/LUCENE-5405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13889101#comment-13889101 ] Benson Margulies commented on LUCENE-5405: -- [~mikemccand] and [~rcmuir]: The code in the 4.x branch is more complex. I _think_ I've managed to carry the strategy across, but I'd be grateful for some skeptical eyeballs before I commit the attach patch that does the backport. Exception strategy for analysis improved Key: LUCENE-5405 URL: https://issues.apache.org/jira/browse/LUCENE-5405 Project: Lucene - Core Issue Type: Improvement Reporter: Benson Margulies Assignee: Benson Margulies Fix For: 5.0 Attachments: LUCENE-5405-4.x.patch SOLR-5623 included some conversation about the dilemmas of exception management and reporting in the analysis chain. I've belatedly become educated about the infostream, and this situation is a job for it. The DocInverterPerField can note exceptions in the analysis chain, log out to the infostream, and then rethrow them as before. No wrapping, no muss, no fuss. There are comments on this JIRA from a more complex prior idea that readers might want to ignore. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Assigned] (SOLR-5623) Better diagnosis of RuntimeExceptions in analysis
[ https://issues.apache.org/jira/browse/SOLR-5623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benson Margulies reassigned SOLR-5623: -- Assignee: Benson Margulies Better diagnosis of RuntimeExceptions in analysis - Key: SOLR-5623 URL: https://issues.apache.org/jira/browse/SOLR-5623 Project: Solr Issue Type: Bug Reporter: Benson Margulies Assignee: Benson Margulies If an analysis component (tokenizer, filter, etc) gets really into a hissy fit and throws a RuntimeException, the resulting log traffic is less than informative, lacking any pointer to the doc under discussion (in the doc case). It would be more better if there was a catch/try shortstop that logged this more informatively. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5405) Exception strategy for analysis improved
[ https://issues.apache.org/jira/browse/LUCENE-5405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13886067#comment-13886067 ] Benson Margulies commented on LUCENE-5405: -- Am I good to commit here? Exception strategy for analysis improved Key: LUCENE-5405 URL: https://issues.apache.org/jira/browse/LUCENE-5405 Project: Lucene - Core Issue Type: Improvement Reporter: Benson Margulies Assignee: Benson Margulies SOLR-5623 included some conversation about the dilemmas of exception management and reporting in the analysis chain. I've belatedly become educated about the infostream, and this situation is a job for it. The DocInverterPerField can note exceptions in the analysis chain, log out to the infostream, and then rethrow them as before. No wrapping, no muss, no fuss. There are comments on this JIRA from a more complex prior idea that readers might want to ignore. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Assigned] (LUCENE-5405) Exception strategy for analysis improved
[ https://issues.apache.org/jira/browse/LUCENE-5405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benson Margulies reassigned LUCENE-5405: Assignee: Benson Margulies Exception strategy for analysis improved Key: LUCENE-5405 URL: https://issues.apache.org/jira/browse/LUCENE-5405 Project: Lucene - Core Issue Type: Improvement Reporter: Benson Margulies Assignee: Benson Margulies SOLR-5623 included some conversation about the dilemmas of exception management and reporting in the analysis chain. I've belatedly become educated about the infostream, and this situation is a job for it. The DocInverterPerField can note exceptions in the analysis chain, log out to the infostream, and then rethrow them as before. No wrapping, no muss, no fuss. There are comments on this JIRA from a more complex prior idea that readers might want to ignore. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5623) Better diagnosis of RuntimeExceptions in analysis
[ https://issues.apache.org/jira/browse/SOLR-5623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13886068#comment-13886068 ] Benson Margulies commented on SOLR-5623: [~hossman_luc...@fucit.org] have you looked at my revs? Better diagnosis of RuntimeExceptions in analysis - Key: SOLR-5623 URL: https://issues.apache.org/jira/browse/SOLR-5623 Project: Solr Issue Type: Bug Reporter: Benson Margulies Assignee: Benson Margulies If an analysis component (tokenizer, filter, etc) gets really into a hissy fit and throws a RuntimeException, the resulting log traffic is less than informative, lacking any pointer to the doc under discussion (in the doc case). It would be more better if there was a catch/try shortstop that logged this more informatively. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Moved] (SOLR-5677) HaversineConstFunction ignores one of its two values, is this on purpose?
[ https://issues.apache.org/jira/browse/SOLR-5677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benson Margulies moved LUCENE-4036 to SOLR-5677: Component/s: (was: core/other) Schema and Analysis Lucene Fields: (was: New) Affects Version/s: (was: 4.0-ALPHA) 4.0-ALPHA Key: SOLR-5677 (was: LUCENE-4036) Project: Solr (was: Lucene - Core) HaversineConstFunction ignores one of its two values, is this on purpose? - Key: SOLR-5677 URL: https://issues.apache.org/jira/browse/SOLR-5677 Project: Solr Issue Type: Bug Components: Schema and Analysis Affects Versions: 4.0-ALPHA Reporter: Benson Margulies org.apache.solr.search.function.distance.HaversineConstFunction.parser.new ValueSourceParser() {...}.parse(FunctionQParser) has an unused variable warning for 'vs2', and uses vs1 to initialize mv2. Maybe vs2 should just be deleted? -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (SOLR-5677) HaversineConstFunction ignores one of its two values, is this on purpose?
[ https://issues.apache.org/jira/browse/SOLR-5677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benson Margulies resolved SOLR-5677. Resolution: Fixed Fix Version/s: 5.0 Assignee: Benson Margulies Well, the trunk code no longer has this problem. HaversineConstFunction ignores one of its two values, is this on purpose? - Key: SOLR-5677 URL: https://issues.apache.org/jira/browse/SOLR-5677 Project: Solr Issue Type: Bug Components: Schema and Analysis Affects Versions: 4.0-ALPHA Reporter: Benson Margulies Assignee: Benson Margulies Fix For: 5.0 org.apache.solr.search.function.distance.HaversineConstFunction.parser.new ValueSourceParser() {...}.parse(FunctionQParser) has an unused variable warning for 'vs2', and uses vs1 to initialize mv2. Maybe vs2 should just be deleted? -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-5405) Exception strategy for analysis improved
[ https://issues.apache.org/jira/browse/LUCENE-5405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benson Margulies resolved LUCENE-5405. -- Resolution: Fixed Fix Version/s: 5.0 Fixed in rev 1562657. Exception strategy for analysis improved Key: LUCENE-5405 URL: https://issues.apache.org/jira/browse/LUCENE-5405 Project: Lucene - Core Issue Type: Improvement Reporter: Benson Margulies Assignee: Benson Margulies Fix For: 5.0 SOLR-5623 included some conversation about the dilemmas of exception management and reporting in the analysis chain. I've belatedly become educated about the infostream, and this situation is a job for it. The DocInverterPerField can note exceptions in the analysis chain, log out to the infostream, and then rethrow them as before. No wrapping, no muss, no fuss. There are comments on this JIRA from a more complex prior idea that readers might want to ignore. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5623) Better diagnosis of RuntimeExceptions in analysis
[ https://issues.apache.org/jira/browse/SOLR-5623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13875702#comment-13875702 ] Benson Margulies commented on SOLR-5623: OK, pushed changes as per remarks. Better diagnosis of RuntimeExceptions in analysis - Key: SOLR-5623 URL: https://issues.apache.org/jira/browse/SOLR-5623 Project: Solr Issue Type: Bug Reporter: Benson Margulies If an analysis component (tokenizer, filter, etc) gets really into a hissy fit and throws a RuntimeException, the resulting log traffic is less than informative, lacking any pointer to the doc under discussion (in the doc case). It would be more better if there was a catch/try shortstop that logged this more informatively. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-5405) Exception strategy for analysis improved
Benson Margulies created LUCENE-5405: Summary: Exception strategy for analysis improved Key: LUCENE-5405 URL: https://issues.apache.org/jira/browse/LUCENE-5405 Project: Lucene - Core Issue Type: Improvement Reporter: Benson Margulies SOLR-5623 included some conversation about the dilemmas of exception management and reporting in the analysis chain. Here is a 5.0 proposal: TokenStream.reset and TokenStream.incrementToken (and perhaps the rest) should have a new, checked, exception in their signatures: call it AnalysisError if you like. Unlike IOException, it will have a full set of constructors, including the constructors that can wrap a 'cause'. Its constructors will accept a field name. TokenStream will have a fieldName field, accepted in a constructor argument. (OK, this might a bit authoritarian.) TokenStream will have: protected void throwAnalysisException(String explanation, Throwable cause) { throw new AnalysisException(fieldName, explanation, cause); } Implementors of analysis will be thus encouraged to write things like: try { doSomething(); } catch (IOExceptionOrWhatever e) { throwAnalysisException(Some Explanation, e); } Then, situations like Solr can diagnose the field name. Note that no information is lost here, due to the use of exception wrapping. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5405) Exception strategy for analysis improved
[ https://issues.apache.org/jira/browse/LUCENE-5405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13875710#comment-13875710 ] Benson Margulies commented on LUCENE-5405: -- I've been frustrated for years by the coincidence that IOException lacks constructors for 'cause', and the Lucene API is full of 'throws IOException'. However, I only just now noticed that Java fixed this in 1.6. So, a weaker form of this would be a subclass of IOException that can carry a field name, and a place for TokenStream to hide a field name. Then something like the Solr error handler could instanceof to see if there's a field name to be had. Given the other API changes to token stream component construction for 5.0, one might argue that adding a ctor arg isn't so bad. Exception strategy for analysis improved Key: LUCENE-5405 URL: https://issues.apache.org/jira/browse/LUCENE-5405 Project: Lucene - Core Issue Type: Improvement Reporter: Benson Margulies SOLR-5623 included some conversation about the dilemmas of exception management and reporting in the analysis chain. Here is a 5.0 proposal: TokenStream.reset and TokenStream.incrementToken (and perhaps the rest) should have a new, checked, exception in their signatures: call it AnalysisError if you like. Unlike IOException, it will have a full set of constructors, including the constructors that can wrap a 'cause'. Its constructors will accept a field name. TokenStream will have a fieldName field, accepted in a constructor argument. (OK, this might a bit authoritarian.) TokenStream will have: protected void throwAnalysisException(String explanation, Throwable cause) { throw new AnalysisException(fieldName, explanation, cause); } Implementors of analysis will be thus encouraged to write things like: try { doSomething(); } catch (IOExceptionOrWhatever e) { throwAnalysisException(Some Explanation, e); } Then, situations like Solr can diagnose the field name. Note that no information is lost here, due to the use of exception wrapping. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5405) Exception strategy for analysis improved
[ https://issues.apache.org/jira/browse/LUCENE-5405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13875711#comment-13875711 ] Benson Margulies commented on LUCENE-5405: -- Hmm, well, backing up. In the other discussion, you and others seemed very unhappy with schemes of the form: throw new SomeException(Some local explanation, someExceptionObject); Based on your most recent remark, I don't see any other way to get around this; my idea about storing field names is stupid, since the chains are reusable. So, either this sort of wrapping is tolerable or not. If tolerable, I'll rewrite this JIRA, else I'll close it. Exception strategy for analysis improved Key: LUCENE-5405 URL: https://issues.apache.org/jira/browse/LUCENE-5405 Project: Lucene - Core Issue Type: Improvement Reporter: Benson Margulies SOLR-5623 included some conversation about the dilemmas of exception management and reporting in the analysis chain. Here is a 5.0 proposal: TokenStream.reset and TokenStream.incrementToken (and perhaps the rest) should have a new, checked, exception in their signatures: call it AnalysisError if you like. Unlike IOException, it will have a full set of constructors, including the constructors that can wrap a 'cause'. Its constructors will accept a field name. TokenStream will have a fieldName field, accepted in a constructor argument. (OK, this might a bit authoritarian.) TokenStream will have: protected void throwAnalysisException(String explanation, Throwable cause) { throw new AnalysisException(fieldName, explanation, cause); } Implementors of analysis will be thus encouraged to write things like: try { doSomething(); } catch (IOExceptionOrWhatever e) { throwAnalysisException(Some Explanation, e); } Then, situations like Solr can diagnose the field name. Note that no information is lost here, due to the use of exception wrapping. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5405) Exception strategy for analysis improved
[ https://issues.apache.org/jira/browse/LUCENE-5405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13875713#comment-13875713 ] Benson Margulies commented on LUCENE-5405: -- Yes, we're now in the same place. Does a catch/throw in DocInverterPerField that does something like throw new LuceneAnalysisException(Error analyzing text, fieldName, originalException); make life better? I think it makes life better, as I don't see much evil in exception wrapping like this. Exception strategy for analysis improved Key: LUCENE-5405 URL: https://issues.apache.org/jira/browse/LUCENE-5405 Project: Lucene - Core Issue Type: Improvement Reporter: Benson Margulies SOLR-5623 included some conversation about the dilemmas of exception management and reporting in the analysis chain. Here is a 5.0 proposal: TokenStream.reset and TokenStream.incrementToken (and perhaps the rest) should have a new, checked, exception in their signatures: call it AnalysisError if you like. Unlike IOException, it will have a full set of constructors, including the constructors that can wrap a 'cause'. Its constructors will accept a field name. TokenStream will have a fieldName field, accepted in a constructor argument. (OK, this might a bit authoritarian.) TokenStream will have: protected void throwAnalysisException(String explanation, Throwable cause) { throw new AnalysisException(fieldName, explanation, cause); } Implementors of analysis will be thus encouraged to write things like: try { doSomething(); } catch (IOExceptionOrWhatever e) { throwAnalysisException(Some Explanation, e); } Then, situations like Solr can diagnose the field name. Note that no information is lost here, due to the use of exception wrapping. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5405) Exception strategy for analysis improved
[ https://issues.apache.org/jira/browse/LUCENE-5405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benson Margulies updated LUCENE-5405: - Description: SOLR-5623 included some conversation about the dilemmas of exception management and reporting in the analysis chain. I've belatedly become educated about the infostream, and this situation is a job for it. The DocInverterPerField can note exceptions in the analysis chain, log out to the infostream, and then rethrow them as before. No wrapping, no muss, no fuss. There are comments on this JIRA from a more complex prior idea that readers might want to ignore. was: SOLR-5623 included some conversation about the dilemmas of exception management and reporting in the analysis chain. Here is a 5.0 proposal: TokenStream.reset and TokenStream.incrementToken (and perhaps the rest) should have a new, checked, exception in their signatures: call it AnalysisError if you like. Unlike IOException, it will have a full set of constructors, including the constructors that can wrap a 'cause'. Its constructors will accept a field name. TokenStream will have a fieldName field, accepted in a constructor argument. (OK, this might a bit authoritarian.) TokenStream will have: protected void throwAnalysisException(String explanation, Throwable cause) { throw new AnalysisException(fieldName, explanation, cause); } Implementors of analysis will be thus encouraged to write things like: try { doSomething(); } catch (IOExceptionOrWhatever e) { throwAnalysisException(Some Explanation, e); } Then, situations like Solr can diagnose the field name. Note that no information is lost here, due to the use of exception wrapping. Exception strategy for analysis improved Key: LUCENE-5405 URL: https://issues.apache.org/jira/browse/LUCENE-5405 Project: Lucene - Core Issue Type: Improvement Reporter: Benson Margulies SOLR-5623 included some conversation about the dilemmas of exception management and reporting in the analysis chain. I've belatedly become educated about the infostream, and this situation is a job for it. The DocInverterPerField can note exceptions in the analysis chain, log out to the infostream, and then rethrow them as before. No wrapping, no muss, no fuss. There are comments on this JIRA from a more complex prior idea that readers might want to ignore. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5405) Exception strategy for analysis improved
[ https://issues.apache.org/jira/browse/LUCENE-5405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13875724#comment-13875724 ] Benson Margulies commented on LUCENE-5405: -- https://github.com/apache/lucene-solr/pull/21 is a seemingly simple idea for how to code this. I'm off to write the test. In the mean time, I offer the PR just to show a concrete idea. Exception strategy for analysis improved Key: LUCENE-5405 URL: https://issues.apache.org/jira/browse/LUCENE-5405 Project: Lucene - Core Issue Type: Improvement Reporter: Benson Margulies SOLR-5623 included some conversation about the dilemmas of exception management and reporting in the analysis chain. I've belatedly become educated about the infostream, and this situation is a job for it. The DocInverterPerField can note exceptions in the analysis chain, log out to the infostream, and then rethrow them as before. No wrapping, no muss, no fuss. There are comments on this JIRA from a more complex prior idea that readers might want to ignore. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5405) Exception strategy for analysis improved
[ https://issues.apache.org/jira/browse/LUCENE-5405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13875726#comment-13875726 ] Benson Margulies commented on LUCENE-5405: -- Test added. It passes. I'm sure I've missed something here. Exception strategy for analysis improved Key: LUCENE-5405 URL: https://issues.apache.org/jira/browse/LUCENE-5405 Project: Lucene - Core Issue Type: Improvement Reporter: Benson Margulies SOLR-5623 included some conversation about the dilemmas of exception management and reporting in the analysis chain. I've belatedly become educated about the infostream, and this situation is a job for it. The DocInverterPerField can note exceptions in the analysis chain, log out to the infostream, and then rethrow them as before. No wrapping, no muss, no fuss. There are comments on this JIRA from a more complex prior idea that readers might want to ignore. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (SOLR-5623) Better diagnosis of RuntimeExceptions in analysis
[ https://issues.apache.org/jira/browse/SOLR-5623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13868400#comment-13868400 ] Benson Margulies edited comment on SOLR-5623 at 1/11/14 7:59 PM: - [~hossman_luc...@fucit.org] I think the patch request is now good to go, again sticking with a Solr change and considering a Lucene change later on. was (Author: bmargulies): I think the patch request is now good to go, again sticking with a Solr change and considering a Lucene change later on. Better diagnosis of RuntimeExceptions in analysis - Key: SOLR-5623 URL: https://issues.apache.org/jira/browse/SOLR-5623 Project: Solr Issue Type: Bug Reporter: Benson Margulies If an analysis component (tokenizer, filter, etc) gets really into a hissy fit and throws a RuntimeException, the resulting log traffic is less than informative, lacking any pointer to the doc under discussion (in the doc case). It would be more better if there was a catch/try shortstop that logged this more informatively. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-5392) Documentation for modified token / analysis pipeline
Benson Margulies created LUCENE-5392: Summary: Documentation for modified token / analysis pipeline Key: LUCENE-5392 URL: https://issues.apache.org/jira/browse/LUCENE-5392 Project: Lucene - Core Issue Type: Bug Affects Versions: 5.0 Reporter: Benson Margulies The changes to the tokenizer and analyzer need to be reflected in the package overview for core analysis. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5392) Documentation for modified token / analysis pipeline
[ https://issues.apache.org/jira/browse/LUCENE-5392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13867801#comment-13867801 ] Benson Margulies commented on LUCENE-5392: -- https://github.com/apache/lucene-solr/pull/17 [~rcmuir] and [~thetaphi] 'Look out below', here's a bunch of work on the analysis doc. Documentation for modified token / analysis pipeline Key: LUCENE-5392 URL: https://issues.apache.org/jira/browse/LUCENE-5392 Project: Lucene - Core Issue Type: Bug Affects Versions: 5.0 Reporter: Benson Margulies The changes to the tokenizer and analyzer need to be reflected in the package overview for core analysis. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-5623) Better diagnosis of RuntimeExceptions in analysis
Benson Margulies created SOLR-5623: -- Summary: Better diagnosis of RuntimeExceptions in analysis Key: SOLR-5623 URL: https://issues.apache.org/jira/browse/SOLR-5623 Project: Solr Issue Type: Bug Reporter: Benson Margulies If an analysis component (tokenizer, filter, etc) gets really into a hissy fit and throws a RuntimeException, the resulting log traffic is less than informative, lacking any pointer to the doc under discussion (in the doc case). It would be more better if there was a catch/try shortstop that logged this more informatively. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5623) Better diagnosis of RuntimeExceptions in analysis
[ https://issues.apache.org/jira/browse/SOLR-5623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13868265#comment-13868265 ] Benson Margulies commented on SOLR-5623: [~hossman_luc...@fucit.org] https://github.com/apache/lucene-solr/pull/18 shows the failing test case. How do I make a test that asserts facts about logging? I can certainly use this to make some improvements to the logging, but I don't know how to automate proving that I did it? Better diagnosis of RuntimeExceptions in analysis - Key: SOLR-5623 URL: https://issues.apache.org/jira/browse/SOLR-5623 Project: Solr Issue Type: Bug Reporter: Benson Margulies If an analysis component (tokenizer, filter, etc) gets really into a hissy fit and throws a RuntimeException, the resulting log traffic is less than informative, lacking any pointer to the doc under discussion (in the doc case). It would be more better if there was a catch/try shortstop that logged this more informatively. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5623) Better diagnosis of RuntimeExceptions in analysis
[ https://issues.apache.org/jira/browse/SOLR-5623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13868325#comment-13868325 ] Benson Margulies commented on SOLR-5623: OK. Does it make sense to adopt the idea that 'if there is an ID field with a value, include that in the exception? Better diagnosis of RuntimeExceptions in analysis - Key: SOLR-5623 URL: https://issues.apache.org/jira/browse/SOLR-5623 Project: Solr Issue Type: Bug Reporter: Benson Margulies If an analysis component (tokenizer, filter, etc) gets really into a hissy fit and throws a RuntimeException, the resulting log traffic is less than informative, lacking any pointer to the doc under discussion (in the doc case). It would be more better if there was a catch/try shortstop that logged this more informatively. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5623) Better diagnosis of RuntimeExceptions in analysis
[ https://issues.apache.org/jira/browse/SOLR-5623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13868347#comment-13868347 ] Benson Margulies commented on SOLR-5623: [~rcmuir] The identity of the field we are processing is known down in Lucene core. What do you think about wrapping generic Throwables in org.apache.lucene.index.DocInverterPerField.processFields in some specific runtime exception that carries the field name? Then I can in turn make it into a Solr exception in DirectUpdateHandler2. Better diagnosis of RuntimeExceptions in analysis - Key: SOLR-5623 URL: https://issues.apache.org/jira/browse/SOLR-5623 Project: Solr Issue Type: Bug Reporter: Benson Margulies If an analysis component (tokenizer, filter, etc) gets really into a hissy fit and throws a RuntimeException, the resulting log traffic is less than informative, lacking any pointer to the doc under discussion (in the doc case). It would be more better if there was a catch/try shortstop that logged this more informatively. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (SOLR-5623) Better diagnosis of RuntimeExceptions in analysis
[ https://issues.apache.org/jira/browse/SOLR-5623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13868347#comment-13868347 ] Benson Margulies edited comment on SOLR-5623 at 1/10/14 9:46 PM: - [~rcmuir] The identity of the field we are processing is known down in Lucene core. What do you think about wrapping Throwables in org.apache.lucene.index.DocInverterPerField.processFields in some specific runtime exception that carries the field name? Then I can in turn make it into a Solr exception in DirectUpdateHandler2. was (Author: bmargulies): [~rcmuir] The identity of the field we are processing is known down in Lucene core. What do you think about wrapping generic Throwables in org.apache.lucene.index.DocInverterPerField.processFields in some specific runtime exception that carries the field name? Then I can in turn make it into a Solr exception in DirectUpdateHandler2. Better diagnosis of RuntimeExceptions in analysis - Key: SOLR-5623 URL: https://issues.apache.org/jira/browse/SOLR-5623 Project: Solr Issue Type: Bug Reporter: Benson Margulies If an analysis component (tokenizer, filter, etc) gets really into a hissy fit and throws a RuntimeException, the resulting log traffic is less than informative, lacking any pointer to the doc under discussion (in the doc case). It would be more better if there was a catch/try shortstop that logged this more informatively. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5623) Better diagnosis of RuntimeExceptions in analysis
[ https://issues.apache.org/jira/browse/SOLR-5623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13868359#comment-13868359 ] Benson Margulies commented on SOLR-5623: OK, we can log and return the ID and not the field name, and that's already an improvement, so I'll stick with that. Better diagnosis of RuntimeExceptions in analysis - Key: SOLR-5623 URL: https://issues.apache.org/jira/browse/SOLR-5623 Project: Solr Issue Type: Bug Reporter: Benson Margulies If an analysis component (tokenizer, filter, etc) gets really into a hissy fit and throws a RuntimeException, the resulting log traffic is less than informative, lacking any pointer to the doc under discussion (in the doc case). It would be more better if there was a catch/try shortstop that logged this more informatively. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5623) Better diagnosis of RuntimeExceptions in analysis
[ https://issues.apache.org/jira/browse/SOLR-5623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13868368#comment-13868368 ] Benson Margulies commented on SOLR-5623: Here's another ouch. Doc for Document#get says: {code} /** Returns the string value of the field with the given name if any exist in * this document, or null. If multiple fields exist with this name, this * method returns the first value added. If only binary fields with this name * exist, returns null. * For {@link IntField}, {@link LongField}, {@link * FloatField} and {@link DoubleField} it returns the string value of the number. If you want * the actual numeric field instance back, use {@link #getField}. */ {code} But given a Solr field like the following, with a value of '1', I end up with \u0080\u\u0001. It doesn't appear to be an IntField, just a Field. What am I missing? {code} field name=id type=sint indexed=true stored=true multiValued=false/ {code} Better diagnosis of RuntimeExceptions in analysis - Key: SOLR-5623 URL: https://issues.apache.org/jira/browse/SOLR-5623 Project: Solr Issue Type: Bug Reporter: Benson Margulies If an analysis component (tokenizer, filter, etc) gets really into a hissy fit and throws a RuntimeException, the resulting log traffic is less than informative, lacking any pointer to the doc under discussion (in the doc case). It would be more better if there was a catch/try shortstop that logged this more informatively. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5623) Better diagnosis of RuntimeExceptions in analysis
[ https://issues.apache.org/jira/browse/SOLR-5623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13868374#comment-13868374 ] Benson Margulies commented on SOLR-5623: Back to the exception decoration problem: There's a general design puzzle here: an outer function knows something that an inner function does not, and the catcher of the exception wants to know both. I share your distaste for the obvious Java solution of endless exception wrapping. Another option is to log, but does the Lucene code log things? I'm working against trunk because I don't know any better. I'm inclined to stay out at the Solr level for now, and maybe make another patch for this idea in the core later. Better diagnosis of RuntimeExceptions in analysis - Key: SOLR-5623 URL: https://issues.apache.org/jira/browse/SOLR-5623 Project: Solr Issue Type: Bug Reporter: Benson Margulies If an analysis component (tokenizer, filter, etc) gets really into a hissy fit and throws a RuntimeException, the resulting log traffic is less than informative, lacking any pointer to the doc under discussion (in the doc case). It would be more better if there was a catch/try shortstop that logged this more informatively. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Issue Comment Deleted] (SOLR-5623) Better diagnosis of RuntimeExceptions in analysis
[ https://issues.apache.org/jira/browse/SOLR-5623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benson Margulies updated SOLR-5623: --- Comment: was deleted (was: Here's another ouch. Doc for Document#get says: {code} /** Returns the string value of the field with the given name if any exist in * this document, or null. If multiple fields exist with this name, this * method returns the first value added. If only binary fields with this name * exist, returns null. * For {@link IntField}, {@link LongField}, {@link * FloatField} and {@link DoubleField} it returns the string value of the number. If you want * the actual numeric field instance back, use {@link #getField}. */ {code} But given a Solr field like the following, with a value of '1', I end up with \u0080\u\u0001. It doesn't appear to be an IntField, just a Field. What am I missing? {code} field name=id type=sint indexed=true stored=true multiValued=false/ {code} ) Better diagnosis of RuntimeExceptions in analysis - Key: SOLR-5623 URL: https://issues.apache.org/jira/browse/SOLR-5623 Project: Solr Issue Type: Bug Reporter: Benson Margulies If an analysis component (tokenizer, filter, etc) gets really into a hissy fit and throws a RuntimeException, the resulting log traffic is less than informative, lacking any pointer to the doc under discussion (in the doc case). It would be more better if there was a catch/try shortstop that logged this more informatively. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5623) Better diagnosis of RuntimeExceptions in analysis
[ https://issues.apache.org/jira/browse/SOLR-5623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13868400#comment-13868400 ] Benson Margulies commented on SOLR-5623: I think the patch request is now good to go, again sticking with a Solr change and considering a Lucene change later on. Better diagnosis of RuntimeExceptions in analysis - Key: SOLR-5623 URL: https://issues.apache.org/jira/browse/SOLR-5623 Project: Solr Issue Type: Bug Reporter: Benson Margulies If an analysis component (tokenizer, filter, etc) gets really into a hissy fit and throws a RuntimeException, the resulting log traffic is less than informative, lacking any pointer to the doc under discussion (in the doc case). It would be more better if there was a catch/try shortstop that logged this more informatively. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5388) Eliminate construction over readers for Tokenizer
[ https://issues.apache.org/jira/browse/LUCENE-5388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13866561#comment-13866561 ] Benson Margulies commented on LUCENE-5388: -- Should I try to get the branch in git to match the .patch, or should I just let you proceed from here? I guess that might depend on reactions of others. Eliminate construction over readers for Tokenizer - Key: LUCENE-5388 URL: https://issues.apache.org/jira/browse/LUCENE-5388 Project: Lucene - Core Issue Type: Improvement Components: core/other Reporter: Benson Margulies Attachments: LUCENE-5388.patch In the modern world, Tokenizers are intended to be reusable, with input supplied via #setReader. The constructors that take Reader are a vestige. Worse yet, they invite people to make mistakes in handling the reader that tangle them up with the state machine in Tokenizer. The sensible thing is to eliminate these ctors, and force setReader usage. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5389) Even more doc for construction of TokenStream components
[ https://issues.apache.org/jira/browse/LUCENE-5389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13866765#comment-13866765 ] Benson Margulies commented on LUCENE-5389: -- [~rcmuir]I think that this is ready to go . If you commit this and merge down to 4.x, I can then tackle work on this file for the new stuff. Even more doc for construction of TokenStream components Key: LUCENE-5389 URL: https://issues.apache.org/jira/browse/LUCENE-5389 Project: Lucene - Core Issue Type: Improvement Reporter: Benson Margulies There are more useful things to tell would-be authors of tokenizers. Let's tell them. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5386) Make Tokenizers deliver their final offsets
[ https://issues.apache.org/jira/browse/LUCENE-5386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13867437#comment-13867437 ] Benson Margulies commented on LUCENE-5386: -- Let me try to restate the above in my own words to make sure I understand it. At #end(), all the pieces of an analysis chain are responsible for putting the attributes into a consistent state that reflects the end of the input. TokenStream itself takes care of PositionIncrementAttribute. Only the Tokenizer can take care of OffsetAttribute, but it's easy to forget -- and if there are other interesting things going on, a Tokenizer or anything else may have other work to do. So Rob's thoughts above are to make Tokenizer or a derivative track the final offset, which is simple, and have protocol to keep PositionIncrement in line given the possibility of skipped tokens. To avoid loading up the 'Tokenizer' class with too much stuff that someone might want to do for themselves, add an intermediate class for this and let Tokenizer proper be lean. I'll get organized to sketch some code. Make Tokenizers deliver their final offsets --- Key: LUCENE-5386 URL: https://issues.apache.org/jira/browse/LUCENE-5386 Project: Lucene - Core Issue Type: Improvement Reporter: Benson Margulies Tokenizers _must_ have an implementation of #end() in which they set up the final offset. Currently, nothing enforces this. end() has a useful implementation in TokenStream, so just making it abstract is not attractive. Proposal: add abstract int finalOffset(); to tokenizer, and then make void end() { super.end(); int fo = finalOffset(); offsetAttr.setOffsets(fo, fo); } or something to that effect. Other alternative to be considered depending on how this looks. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5389) Even more doc for construction of TokenStream components
[ https://issues.apache.org/jira/browse/LUCENE-5389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13867441#comment-13867441 ] Benson Margulies commented on LUCENE-5389: -- Sorry, I forgot to lint after accepting the suggestion about delegation. Yes, I'll start making various next-step patches. Even more doc for construction of TokenStream components Key: LUCENE-5389 URL: https://issues.apache.org/jira/browse/LUCENE-5389 Project: Lucene - Core Issue Type: Improvement Reporter: Benson Margulies Fix For: 5.0, 4.7 There are more useful things to tell would-be authors of tokenizers. Let's tell them. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5388) Eliminate construction over readers for Tokenizer
[ https://issues.apache.org/jira/browse/LUCENE-5388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13865385#comment-13865385 ] Benson Margulies commented on LUCENE-5388: -- Uwe, what's that mean practically? No PR yet? A PR just in trunk? Merging my recent doc to a 4.x branch? Eliminate construction over readers for Tokenizer - Key: LUCENE-5388 URL: https://issues.apache.org/jira/browse/LUCENE-5388 Project: Lucene - Core Issue Type: Improvement Components: core/other Reporter: Benson Margulies In the modern world, Tokenizers are intended to be reusable, with input supplied via #setReader. The constructors that take Reader are a vestige. Worse yet, they invite people to make mistakes in handling the reader that tangle them up with the state machine in Tokenizer. The sensible thing is to eliminate these ctors, and force setReader usage. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-5388) Eliminate construction over readers for Tokenizer
[ https://issues.apache.org/jira/browse/LUCENE-5388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13865385#comment-13865385 ] Benson Margulies edited comment on LUCENE-5388 at 1/8/14 12:59 PM: --- Uwe, what's that mean practically? No PR yet? A PR just in trunk? Merging my recent doc to a 4.x branch? A feature branch where this goes to be merged in when the time is ripe? was (Author: bmargulies): Uwe, what's that mean practically? No PR yet? A PR just in trunk? Merging my recent doc to a 4.x branch? Eliminate construction over readers for Tokenizer - Key: LUCENE-5388 URL: https://issues.apache.org/jira/browse/LUCENE-5388 Project: Lucene - Core Issue Type: Improvement Components: core/other Reporter: Benson Margulies In the modern world, Tokenizers are intended to be reusable, with input supplied via #setReader. The constructors that take Reader are a vestige. Worse yet, they invite people to make mistakes in handling the reader that tangle them up with the state machine in Tokenizer. The sensible thing is to eliminate these ctors, and force setReader usage. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5388) Eliminate construction over readers for Tokenizer
[ https://issues.apache.org/jira/browse/LUCENE-5388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13865573#comment-13865573 ] Benson Margulies commented on LUCENE-5388: -- How about we start by adding ctors that don't require a reader, and do treat them as 4.x fodder? Eliminate construction over readers for Tokenizer - Key: LUCENE-5388 URL: https://issues.apache.org/jira/browse/LUCENE-5388 Project: Lucene - Core Issue Type: Improvement Components: core/other Reporter: Benson Margulies In the modern world, Tokenizers are intended to be reusable, with input supplied via #setReader. The constructors that take Reader are a vestige. Worse yet, they invite people to make mistakes in handling the reader that tangle them up with the state machine in Tokenizer. The sensible thing is to eliminate these ctors, and force setReader usage. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5388) Eliminate construction over readers for Tokenizer
[ https://issues.apache.org/jira/browse/LUCENE-5388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13865580#comment-13865580 ] Benson Margulies commented on LUCENE-5388: -- setReader throws IOException, but the existing constructors don't. Analyzer 'createComponents' doesn't. How to sort this out? Eliminate construction over readers for Tokenizer - Key: LUCENE-5388 URL: https://issues.apache.org/jira/browse/LUCENE-5388 Project: Lucene - Core Issue Type: Improvement Components: core/other Reporter: Benson Margulies In the modern world, Tokenizers are intended to be reusable, with input supplied via #setReader. The constructors that take Reader are a vestige. Worse yet, they invite people to make mistakes in handling the reader that tangle them up with the state machine in Tokenizer. The sensible thing is to eliminate these ctors, and force setReader usage. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5388) Eliminate construction over readers for Tokenizer
[ https://issues.apache.org/jira/browse/LUCENE-5388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13865626#comment-13865626 ] Benson Margulies commented on LUCENE-5388: -- OK, I see. If we don't do compatibility, then no one calls setReader in createComponents, and all is well. OK, I'm proceeding. Eliminate construction over readers for Tokenizer - Key: LUCENE-5388 URL: https://issues.apache.org/jira/browse/LUCENE-5388 Project: Lucene - Core Issue Type: Improvement Components: core/other Reporter: Benson Margulies In the modern world, Tokenizers are intended to be reusable, with input supplied via #setReader. The constructors that take Reader are a vestige. Worse yet, they invite people to make mistakes in handling the reader that tangle them up with the state machine in Tokenizer. The sensible thing is to eliminate these ctors, and force setReader usage. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5388) Eliminate construction over readers for Tokenizer
[ https://issues.apache.org/jira/browse/LUCENE-5388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13865634#comment-13865634 ] Benson Margulies commented on LUCENE-5388: -- Why does the reader get passed to createComponents in this model? Should that param go away? Eliminate construction over readers for Tokenizer - Key: LUCENE-5388 URL: https://issues.apache.org/jira/browse/LUCENE-5388 Project: Lucene - Core Issue Type: Improvement Components: core/other Reporter: Benson Margulies In the modern world, Tokenizers are intended to be reusable, with input supplied via #setReader. The constructors that take Reader are a vestige. Worse yet, they invite people to make mistakes in handling the reader that tangle them up with the state machine in Tokenizer. The sensible thing is to eliminate these ctors, and force setReader usage. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5388) Eliminate construction over readers for Tokenizer
[ https://issues.apache.org/jira/browse/LUCENE-5388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13865661#comment-13865661 ] Benson Margulies commented on LUCENE-5388: -- https://github.com/apache/lucene-solr/pull/16 is available for your read pleasure to see what these changes look like. Eliminate construction over readers for Tokenizer - Key: LUCENE-5388 URL: https://issues.apache.org/jira/browse/LUCENE-5388 Project: Lucene - Core Issue Type: Improvement Components: core/other Reporter: Benson Margulies In the modern world, Tokenizers are intended to be reusable, with input supplied via #setReader. The constructors that take Reader are a vestige. Worse yet, they invite people to make mistakes in handling the reader that tangle them up with the state machine in Tokenizer. The sensible thing is to eliminate these ctors, and force setReader usage. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5388) Eliminate construction over readers for Tokenizer
[ https://issues.apache.org/jira/browse/LUCENE-5388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13865667#comment-13865667 ] Benson Margulies commented on LUCENE-5388: -- [~rcmuir] Next frontier is TokenizerFactory. Do we change #create to not take a reader, or do we add 'throws IOException'? Based on comments above, I'd think we take out the reader. [~mikemccand] I would love help. If you tell me your github id, I'll add you to my repo, and you can take up some of the ton of editing. Eliminate construction over readers for Tokenizer - Key: LUCENE-5388 URL: https://issues.apache.org/jira/browse/LUCENE-5388 Project: Lucene - Core Issue Type: Improvement Components: core/other Reporter: Benson Margulies In the modern world, Tokenizers are intended to be reusable, with input supplied via #setReader. The constructors that take Reader are a vestige. Worse yet, they invite people to make mistakes in handling the reader that tangle them up with the state machine in Tokenizer. The sensible thing is to eliminate these ctors, and force setReader usage. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5388) Eliminate construction over readers for Tokenizer
[ https://issues.apache.org/jira/browse/LUCENE-5388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13865907#comment-13865907 ] Benson Margulies commented on LUCENE-5388: -- [~rcmuir] or [~mikemccand] I could really use some help here with TestRandomChains. Eliminate construction over readers for Tokenizer - Key: LUCENE-5388 URL: https://issues.apache.org/jira/browse/LUCENE-5388 Project: Lucene - Core Issue Type: Improvement Components: core/other Reporter: Benson Margulies In the modern world, Tokenizers are intended to be reusable, with input supplied via #setReader. The constructors that take Reader are a vestige. Worse yet, they invite people to make mistakes in handling the reader that tangle them up with the state machine in Tokenizer. The sensible thing is to eliminate these ctors, and force setReader usage. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5388) Eliminate construction over readers for Tokenizer
[ https://issues.apache.org/jira/browse/LUCENE-5388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13865925#comment-13865925 ] Benson Margulies commented on LUCENE-5388: -- It does something complex with the input reader in a createComponents. the challenge is to move all that to initReader so that it works. I think I'm too fried from 1000 other edits, I'll look in after dinner but anyone who wants to grab my branch from github and pitch in is more than welcome. Eliminate construction over readers for Tokenizer - Key: LUCENE-5388 URL: https://issues.apache.org/jira/browse/LUCENE-5388 Project: Lucene - Core Issue Type: Improvement Components: core/other Reporter: Benson Margulies In the modern world, Tokenizers are intended to be reusable, with input supplied via #setReader. The constructors that take Reader are a vestige. Worse yet, they invite people to make mistakes in handling the reader that tangle them up with the state machine in Tokenizer. The sensible thing is to eliminate these ctors, and force setReader usage. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5388) Eliminate construction over readers for Tokenizer
[ https://issues.apache.org/jira/browse/LUCENE-5388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13866077#comment-13866077 ] Benson Margulies commented on LUCENE-5388: -- You got me off the dot on RandomChains. Eliminate construction over readers for Tokenizer - Key: LUCENE-5388 URL: https://issues.apache.org/jira/browse/LUCENE-5388 Project: Lucene - Core Issue Type: Improvement Components: core/other Reporter: Benson Margulies In the modern world, Tokenizers are intended to be reusable, with input supplied via #setReader. The constructors that take Reader are a vestige. Worse yet, they invite people to make mistakes in handling the reader that tangle them up with the state machine in Tokenizer. The sensible thing is to eliminate these ctors, and force setReader usage. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5388) Eliminate construction over readers for Tokenizer
[ https://issues.apache.org/jira/browse/LUCENE-5388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13866131#comment-13866131 ] Benson Margulies commented on LUCENE-5388: -- I've got all of lucene to compile, and a bunch of tests running. Eliminate construction over readers for Tokenizer - Key: LUCENE-5388 URL: https://issues.apache.org/jira/browse/LUCENE-5388 Project: Lucene - Core Issue Type: Improvement Components: core/other Reporter: Benson Margulies In the modern world, Tokenizers are intended to be reusable, with input supplied via #setReader. The constructors that take Reader are a vestige. Worse yet, they invite people to make mistakes in handling the reader that tangle them up with the state machine in Tokenizer. The sensible thing is to eliminate these ctors, and force setReader usage. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5388) Eliminate construction over readers for Tokenizer
[ https://issues.apache.org/jira/browse/LUCENE-5388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13866227#comment-13866227 ] Benson Margulies commented on LUCENE-5388: -- I have Solr test failures in PreAnalyzedField, which has some stubborn fondness for the idea of a reader passed to a constructor. Eliminate construction over readers for Tokenizer - Key: LUCENE-5388 URL: https://issues.apache.org/jira/browse/LUCENE-5388 Project: Lucene - Core Issue Type: Improvement Components: core/other Reporter: Benson Margulies In the modern world, Tokenizers are intended to be reusable, with input supplied via #setReader. The constructors that take Reader are a vestige. Worse yet, they invite people to make mistakes in handling the reader that tangle them up with the state machine in Tokenizer. The sensible thing is to eliminate these ctors, and force setReader usage. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-5388) Eliminate construction over readers for Tokenizer
[ https://issues.apache.org/jira/browse/LUCENE-5388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13866227#comment-13866227 ] Benson Margulies edited comment on LUCENE-5388 at 1/9/14 2:41 AM: -- I have Solr test failures in PreAnalyzedField, which has some stubborn fondness for the idea of a reader passed to a constructor. But that seems to be all that's broken; a few Solr failures (based on 'ant test'). was (Author: bmargulies): I have Solr test failures in PreAnalyzedField, which has some stubborn fondness for the idea of a reader passed to a constructor. Eliminate construction over readers for Tokenizer - Key: LUCENE-5388 URL: https://issues.apache.org/jira/browse/LUCENE-5388 Project: Lucene - Core Issue Type: Improvement Components: core/other Reporter: Benson Margulies In the modern world, Tokenizers are intended to be reusable, with input supplied via #setReader. The constructors that take Reader are a vestige. Worse yet, they invite people to make mistakes in handling the reader that tangle them up with the state machine in Tokenizer. The sensible thing is to eliminate these ctors, and force setReader usage. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-5388) Eliminate construction over readers for Tokenizer
Benson Margulies created LUCENE-5388: Summary: Eliminate construction over readers for Tokenizer Key: LUCENE-5388 URL: https://issues.apache.org/jira/browse/LUCENE-5388 Project: Lucene - Core Issue Type: Improvement Components: core/other Reporter: Benson Margulies In the modern world, Tokenizers are intended to be reusable, with input supplied via #setReader. The constructors that take Reader are a vestige. Worse yet, they invite people to make mistakes in handling the reader that tangle them up with the state machine in Tokenizer. The sensible thing is to eliminate these ctors, and force setReader usage. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-5389) Even more doc for construction of TokenStream components
Benson Margulies created LUCENE-5389: Summary: Even more doc for construction of TokenStream components Key: LUCENE-5389 URL: https://issues.apache.org/jira/browse/LUCENE-5389 Project: Lucene - Core Issue Type: Improvement Reporter: Benson Margulies There are more useful things to tell would-be authors of tokenizers. Let's tell them. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5389) Even more doc for construction of TokenStream components
[ https://issues.apache.org/jira/browse/LUCENE-5389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13864825#comment-13864825 ] Benson Margulies commented on LUCENE-5389: -- https://github.com/apache/lucene-solr/pull/14 Even more doc for construction of TokenStream components Key: LUCENE-5389 URL: https://issues.apache.org/jira/browse/LUCENE-5389 Project: Lucene - Core Issue Type: Improvement Reporter: Benson Margulies There are more useful things to tell would-be authors of tokenizers. Let's tell them. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-5386) Make Tokenizers deliver their final offsets
Benson Margulies created LUCENE-5386: Summary: Make Tokenizers deliver their final offsets Key: LUCENE-5386 URL: https://issues.apache.org/jira/browse/LUCENE-5386 Project: Lucene - Core Issue Type: Improvement Reporter: Benson Margulies Tokenizers _must_ have an implementation of #end() in which they set up the final offset. Currently, nothing enforces this. end() has a useful implementation in TokenStream, so just making it abstract is not attractive. Proposal: add abstract int finalOffset(); to tokenizer, and then make void end() { super.end(); int fo = finalOffset(); offsetAttr.setOffsets(fo, fo); } or something to that effect. Other alternative to be considered depending on how this looks. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5386) Make Tokenizers deliver their final offsets
[ https://issues.apache.org/jira/browse/LUCENE-5386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13863521#comment-13863521 ] Benson Margulies commented on LUCENE-5386: -- How about, then: Tokenizer: abstract void tokenizerEnd(); final void end() { super.end(); tokenizerEnd(); } ? Make Tokenizers deliver their final offsets --- Key: LUCENE-5386 URL: https://issues.apache.org/jira/browse/LUCENE-5386 Project: Lucene - Core Issue Type: Improvement Reporter: Benson Margulies Tokenizers _must_ have an implementation of #end() in which they set up the final offset. Currently, nothing enforces this. end() has a useful implementation in TokenStream, so just making it abstract is not attractive. Proposal: add abstract int finalOffset(); to tokenizer, and then make void end() { super.end(); int fo = finalOffset(); offsetAttr.setOffsets(fo, fo); } or something to that effect. Other alternative to be considered depending on how this looks. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5386) Make Tokenizers deliver their final offsets
[ https://issues.apache.org/jira/browse/LUCENE-5386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13863570#comment-13863570 ] Benson Margulies commented on LUCENE-5386: -- Can you help me with how this relates to your previous remark about attributes other than Offset? What other attributes would get manipulated and how? Make Tokenizers deliver their final offsets --- Key: LUCENE-5386 URL: https://issues.apache.org/jira/browse/LUCENE-5386 Project: Lucene - Core Issue Type: Improvement Reporter: Benson Margulies Tokenizers _must_ have an implementation of #end() in which they set up the final offset. Currently, nothing enforces this. end() has a useful implementation in TokenStream, so just making it abstract is not attractive. Proposal: add abstract int finalOffset(); to tokenizer, and then make void end() { super.end(); int fo = finalOffset(); offsetAttr.setOffsets(fo, fo); } or something to that effect. Other alternative to be considered depending on how this looks. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-5384) Analysis overview could mention clearAttributes and end
Benson Margulies created LUCENE-5384: Summary: Analysis overview could mention clearAttributes and end Key: LUCENE-5384 URL: https://issues.apache.org/jira/browse/LUCENE-5384 Project: Lucene - Core Issue Type: Improvement Reporter: Benson Margulies Assignee: Benson Margulies It would be helpful to tokenizer implementors for the analysis package overview to mention more things. I'll supply a patch. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5384) Analysis overview could mention clearAttributes and end
[ https://issues.apache.org/jira/browse/LUCENE-5384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13862696#comment-13862696 ] Benson Margulies commented on LUCENE-5384: -- https://github.com/apache/lucene-solr/pull/12 contains some more documentation. Yes, this is offered under the terms of the Apache license, in case anyone is still uncertain as to the relationship of github pull requests to the AL. Analysis overview could mention clearAttributes and end --- Key: LUCENE-5384 URL: https://issues.apache.org/jira/browse/LUCENE-5384 Project: Lucene - Core Issue Type: Improvement Reporter: Benson Margulies Assignee: Benson Margulies It would be helpful to tokenizer implementors for the analysis package overview to mention more things. I'll supply a patch. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module
[ https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13820031#comment-13820031 ] Benson Margulies commented on LUCENE-2899: -- I know of an NER model that looks at the entire text to bias towards consistent tagging of entities in larger units. However, I agree that crocks are bad. Perhaps this is an opportunity to think about how to expand the analysis protocol to support this sort of thing more smoothly? It would be desirable if this integration were to start with a set of Token Attributes that could be used in any number of analysis components, inside or outside of Lucene, that were in a position to deliver similar items. I suppose I'm late to ask for this, as the UIMA component must pose the same question. In some languages, NER is very clumsy as a token filter, because entities don't obey token boundaries very well. Also, in my experience, entities aren't useful as additional tokens in the same field as their source text, but rather in their own field (where they can be facetted upon, for example). Is there any appetite to look at Lucene support for a stream that delivers to more than one field? Or is there such a thing and I've missed it? I agree with Rob about UIMA because I think that Lucene analysis attributes are a weak data model for interconnecting NLP modules and flowing data through them -- and one frequently needs to do that. Add OpenNLP Analysis capabilities as a module - Key: LUCENE-2899 URL: https://issues.apache.org/jira/browse/LUCENE-2899 Project: Lucene - Core Issue Type: New Feature Components: modules/analysis Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 4.6 Attachments: LUCENE-2899-RJN.patch, LUCENE-2899.patch, OpenNLPFilter.java, OpenNLPTokenizer.java Now that OpenNLP is an ASF project and has a nice license, it would be nice to have a submodule (under analysis) that exposed capabilities for it. Drew Farris, Tom Morton and I have code that does: * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it would have to change slightly to buffer tokens) * NamedEntity recognition as a TokenFilter We are also planning a Tokenizer/TokenFilter that can put parts of speech as either payloads (PartOfSpeechAttribute?) on a token or at the same position. I'd propose it go under: modules/analysis/opennlp -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module
[ https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13820085#comment-13820085 ] Benson Margulies commented on LUCENE-2899: -- Fair enough. Solr URP's do this very well upstream of analysis. ES doesn't have the concept, perhaps it should. It clarifies the situation nicely to think of Lucene as serial token operations. Add OpenNLP Analysis capabilities as a module - Key: LUCENE-2899 URL: https://issues.apache.org/jira/browse/LUCENE-2899 Project: Lucene - Core Issue Type: New Feature Components: modules/analysis Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 4.6 Attachments: LUCENE-2899-RJN.patch, LUCENE-2899.patch, OpenNLPFilter.java, OpenNLPTokenizer.java Now that OpenNLP is an ASF project and has a nice license, it would be nice to have a submodule (under analysis) that exposed capabilities for it. Drew Farris, Tom Morton and I have code that does: * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it would have to change slightly to buffer tokens) * NamedEntity recognition as a TokenFilter We are also planning a Tokenizer/TokenFilter that can put parts of speech as either payloads (PartOfSpeechAttribute?) on a token or at the same position. I'd propose it go under: modules/analysis/opennlp -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries
[ https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13812034#comment-13812034 ] Benson Margulies commented on LUCENE-4956: -- Looks like mapHanja.dic needs some adjustment of the legal notice? Or was this going to be replaced? the korean analyzer that has a korean morphological analyzer and dictionaries - Key: LUCENE-4956 URL: https://issues.apache.org/jira/browse/LUCENE-4956 Project: Lucene - Core Issue Type: New Feature Components: modules/analysis Affects Versions: 4.2 Reporter: SooMyung Lee Assignee: Christian Moen Labels: newbie Attachments: LUCENE-4956.patch, eval.patch, kr.analyzer.4x.tar, lucene-4956.patch, lucene4956.patch Korean language has specific characteristic. When developing search service with lucene solr in korean, there are some problems in searching and indexing. The korean analyer solved the problems with a korean morphological anlyzer. It consists of a korean morphological analyzer, dictionaries, a korean tokenizer and a korean filter. The korean anlyzer is made for lucene and solr. If you develop a search service with lucene in korean, It is the best idea to choose the korean analyzer. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries
[ https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13812044#comment-13812044 ] Benson Margulies commented on LUCENE-4956: -- My point is that it might have a bit too much legal notice. Generally, when someone grants a license, the headers all move up to some global NOTICE file, and the file is left with just an Apache license. I also noted the following: ! Except as contained in this notice, the name of a copyright holder shall not be ! used in advertising or otherwise to promote the sale, use or other dealings in ! these Data Files or Software without prior written authorization of the copyright holder. and then noticed: that http://www.apache.org/legal/resolved.html says that it approves of * BSD (without advertising clause). So that Unicode license is possibly an issue. Right now I'm using the git clone, but I just did a pull, and the pathname is lucene/analysis/arirang/src/data/mapHanja.dic the korean analyzer that has a korean morphological analyzer and dictionaries - Key: LUCENE-4956 URL: https://issues.apache.org/jira/browse/LUCENE-4956 Project: Lucene - Core Issue Type: New Feature Components: modules/analysis Affects Versions: 4.2 Reporter: SooMyung Lee Assignee: Christian Moen Labels: newbie Attachments: LUCENE-4956.patch, eval.patch, kr.analyzer.4x.tar, lucene-4956.patch, lucene4956.patch Korean language has specific characteristic. When developing search service with lucene solr in korean, there are some problems in searching and indexing. The korean analyer solved the problems with a korean morphological anlyzer. It consists of a korean morphological analyzer, dictionaries, a korean tokenizer and a korean filter. The korean anlyzer is made for lucene and solr. If you develop a search service with lucene in korean, It is the best idea to choose the korean analyzer. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries
[ https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13812050#comment-13812050 ] Benson Margulies commented on LUCENE-4956: -- That jira concerns a different license. The license on the file pointed-to there has no advertising clause that I can spot. the korean analyzer that has a korean morphological analyzer and dictionaries - Key: LUCENE-4956 URL: https://issues.apache.org/jira/browse/LUCENE-4956 Project: Lucene - Core Issue Type: New Feature Components: modules/analysis Affects Versions: 4.2 Reporter: SooMyung Lee Assignee: Christian Moen Labels: newbie Attachments: LUCENE-4956.patch, eval.patch, kr.analyzer.4x.tar, lucene-4956.patch, lucene4956.patch Korean language has specific characteristic. When developing search service with lucene solr in korean, there are some problems in searching and indexing. The korean analyer solved the problems with a korean morphological anlyzer. It consists of a korean morphological analyzer, dictionaries, a korean tokenizer and a korean filter. The korean anlyzer is made for lucene and solr. If you develop a search service with lucene in korean, It is the best idea to choose the korean analyzer. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries
[ https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13812050#comment-13812050 ] Benson Margulies edited comment on LUCENE-4956 at 11/2/13 4:11 PM: --- That jira concerns a different license. The license on the file pointed-to there has no advertising clause that I can spot. Which isn't to say that legal would have a problem with this, just that I don't think that the JIRA in question tells us. was (Author: bmargulies): That jira concerns a different license. The license on the file pointed-to there has no advertising clause that I can spot. the korean analyzer that has a korean morphological analyzer and dictionaries - Key: LUCENE-4956 URL: https://issues.apache.org/jira/browse/LUCENE-4956 Project: Lucene - Core Issue Type: New Feature Components: modules/analysis Affects Versions: 4.2 Reporter: SooMyung Lee Assignee: Christian Moen Labels: newbie Attachments: LUCENE-4956.patch, eval.patch, kr.analyzer.4x.tar, lucene-4956.patch, lucene4956.patch Korean language has specific characteristic. When developing search service with lucene solr in korean, there are some problems in searching and indexing. The korean analyer solved the problems with a korean morphological anlyzer. It consists of a korean morphological analyzer, dictionaries, a korean tokenizer and a korean filter. The korean anlyzer is made for lucene and solr. If you develop a search service with lucene in korean, It is the best idea to choose the korean analyzer. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries
[ https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13812053#comment-13812053 ] Benson Margulies commented on LUCENE-4956: -- Rob, I got shat on at great length over this for merely test data over at the WS project. I had to make the build pull the data over the network to get certain directors off of my back. I'm trying to spare you the experience. That's all. As a low-intensity member of the UTC, I would also expect there to be only one license. However, I compare: {noformat} # Copyright (c) 1991-2011 Unicode, Inc. All Rights reserved. # # This file is provided as-is by Unicode, Inc. (The Unicode Consortium). No # claims are made as to fitness for any particular purpose. No warranties of # any kind are expressed or implied. The recipient agrees to determine # applicability of information provided. If this file has been provided on # magnetic media by Unicode, Inc., the sole remedy for any claim will be # exchange of defective media within 90 days of receipt. # # Unicode, Inc. hereby grants the right to freely use the information # supplied in this file in the creation of products supporting the # Unicode Standard, and to make copies of this file in any form for # internal or external distribution as long as this notice remains # attached. {noformat} with {noformat} ! Copyright (c) 1991-2013 Unicode, Inc. ! All rights reserved. ! Distributed under the Terms of Use in http://www.unicode.org/copyright.html. ! ! Permission is hereby granted, free of charge, to any person obtaining a copy ! of the Unicode data files and any associated documentation (the Data Files) ! or Unicode software and any associated documentation (the Software) to deal ! in the Data Files or Software without restriction, including without limitation ! the rights to use, copy, modify, merge, publish, distribute, and/or sell copies ! of the Data Files or Software, and to permit persons to whom the Data Files or ! Software are furnished to do so, provided that (a) the above copyright notice(s) ! and this permission notice appear with all copies of the Data Files or Software, ! (b) both the above copyright notice(s) and this permission notice appear in ! associated documentation, and (c) there is clear notice in each modified Data ! File or in the Software as well as in the documentation associated with the Data ! File(s) or Software that the data or software has been modified. ! ! THE DATA FILES AND SOFTWARE ARE PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND, ! EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, ! FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT OF THIRD PARTY RIGHTS. IN NO ! EVENT SHALL THE COPYRIGHT HOLDER OR HOLDERS INCLUDED IN THIS NOTICE BE LIABLE FOR ! ANY CLAIM, OR ANY SPECIAL INDIRECT OR CONSEQUENTIAL DAMAGES, OR ANY DAMAGES ! WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF ! CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION ! WITH THE USE OR PERFORMANCE OF THE DATA FILES OR SOFTWARE. ! ! Except as contained in this notice, the name of a copyright holder shall not be ! used in advertising or otherwise to promote the sale, use or other dealings in ! these Data Files or Software without prior written authorization of the copyright holder. {noformat} They look pretty different to me. Go figure? the korean analyzer that has a korean morphological analyzer and dictionaries - Key: LUCENE-4956 URL: https://issues.apache.org/jira/browse/LUCENE-4956 Project: Lucene - Core Issue Type: New Feature Components: modules/analysis Affects Versions: 4.2 Reporter: SooMyung Lee Assignee: Christian Moen Labels: newbie Attachments: LUCENE-4956.patch, eval.patch, kr.analyzer.4x.tar, lucene-4956.patch, lucene4956.patch Korean language has specific characteristic. When developing search service with lucene solr in korean, there are some problems in searching and indexing. The korean analyer solved the problems with a korean morphological anlyzer. It consists of a korean morphological analyzer, dictionaries, a korean tokenizer and a korean filter. The korean anlyzer is made for lucene and solr. If you develop a search service with lucene in korean, It is the best idea to choose the korean analyzer. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries
[ https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13812059#comment-13812059 ] Benson Margulies commented on LUCENE-4956: -- OK, I see, the email thread about Unicode data in general does certainly cover this. Sometimes the workings of Legal are pretty perplexing. the korean analyzer that has a korean morphological analyzer and dictionaries - Key: LUCENE-4956 URL: https://issues.apache.org/jira/browse/LUCENE-4956 Project: Lucene - Core Issue Type: New Feature Components: modules/analysis Affects Versions: 4.2 Reporter: SooMyung Lee Assignee: Christian Moen Labels: newbie Attachments: LUCENE-4956.patch, eval.patch, kr.analyzer.4x.tar, lucene-4956.patch, lucene4956.patch Korean language has specific characteristic. When developing search service with lucene solr in korean, there are some problems in searching and indexing. The korean analyer solved the problems with a korean morphological anlyzer. It consists of a korean morphological analyzer, dictionaries, a korean tokenizer and a korean filter. The korean anlyzer is made for lucene and solr. If you develop a search service with lucene in korean, It is the best idea to choose the korean analyzer. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries
[ https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13807877#comment-13807877 ] Benson Margulies commented on LUCENE-4956: -- Hmm. When I followed the link, I found a .tar.gz. I guess the zip was further down the page. the korean analyzer that has a korean morphological analyzer and dictionaries - Key: LUCENE-4956 URL: https://issues.apache.org/jira/browse/LUCENE-4956 Project: Lucene - Core Issue Type: New Feature Components: modules/analysis Affects Versions: 4.2 Reporter: SooMyung Lee Assignee: Christian Moen Labels: newbie Attachments: eval.patch, kr.analyzer.4x.tar, lucene-4956.patch, lucene4956.patch, LUCENE-4956.patch Korean language has specific characteristic. When developing search service with lucene solr in korean, there are some problems in searching and indexing. The korean analyer solved the problems with a korean morphological anlyzer. It consists of a korean morphological analyzer, dictionaries, a korean tokenizer and a korean filter. The korean anlyzer is made for lucene and solr. If you develop a search service with lucene in korean, It is the best idea to choose the korean analyzer. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries
[ https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13807918#comment-13807918 ] Benson Margulies commented on LUCENE-4956: -- Something's funny here. On this page (http://www.kristalinfo.com/TestCollections/), the zip file has directories like HANTEC-2.0/relevance_file/+�++/L2.rel The code in the patch expects the word 'full' in latin-alphabet, no funny full-width, in the that intermediate directory. On the other hand, I can't seem to find an unzip with a documented -O option on Linux. the korean analyzer that has a korean morphological analyzer and dictionaries - Key: LUCENE-4956 URL: https://issues.apache.org/jira/browse/LUCENE-4956 Project: Lucene - Core Issue Type: New Feature Components: modules/analysis Affects Versions: 4.2 Reporter: SooMyung Lee Assignee: Christian Moen Labels: newbie Attachments: eval.patch, kr.analyzer.4x.tar, lucene-4956.patch, lucene4956.patch, LUCENE-4956.patch Korean language has specific characteristic. When developing search service with lucene solr in korean, there are some problems in searching and indexing. The korean analyer solved the problems with a korean morphological anlyzer. It consists of a korean morphological analyzer, dictionaries, a korean tokenizer and a korean filter. The korean anlyzer is made for lucene and solr. If you develop a search service with lucene in korean, It is the best idea to choose the korean analyzer. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries
[ https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13807918#comment-13807918 ] Benson Margulies edited comment on LUCENE-4956 at 10/29/13 12:32 PM: - Something's funny here. On this page (http://www.kristalinfo.com/TestCollections/), the zip file has directories like HANTEC-2.0/relevance_file/과학기술분야/ HANTEC-2.0/relevance_file/전체/ The code in the patch expects the word 'full' in latin-alphabet, no funny full-width, in the that intermediate directory. So I don't see how a code-page option to unzip got there. I'm suspecting that an 'mv' is in order. was (Author: bmargulies): Something's funny here. On this page (http://www.kristalinfo.com/TestCollections/), the zip file has directories like HANTEC-2.0/relevance_file/+�++/L2.rel The code in the patch expects the word 'full' in latin-alphabet, no funny full-width, in the that intermediate directory. On the other hand, I can't seem to find an unzip with a documented -O option on Linux. the korean analyzer that has a korean morphological analyzer and dictionaries - Key: LUCENE-4956 URL: https://issues.apache.org/jira/browse/LUCENE-4956 Project: Lucene - Core Issue Type: New Feature Components: modules/analysis Affects Versions: 4.2 Reporter: SooMyung Lee Assignee: Christian Moen Labels: newbie Attachments: eval.patch, kr.analyzer.4x.tar, lucene-4956.patch, lucene4956.patch, LUCENE-4956.patch Korean language has specific characteristic. When developing search service with lucene solr in korean, there are some problems in searching and indexing. The korean analyer solved the problems with a korean morphological anlyzer. It consists of a korean morphological analyzer, dictionaries, a korean tokenizer and a korean filter. The korean anlyzer is made for lucene and solr. If you develop a search service with lucene in korean, It is the best idea to choose the korean analyzer. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries
[ https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13807918#comment-13807918 ] Benson Margulies edited comment on LUCENE-4956 at 10/29/13 12:34 PM: - Something's funny here. On this page (http://www.kristalinfo.com/TestCollections/), the zip file has directories like HANTEC-2.0/relevance_file/과학기술분야/ HANTEC-2.0/relevance_file/전체/ The first translates as 'Science and Technology' and the second as 'All'. The code in the patch expects the word 'full' in latin-alphabet, no funny full-width, in the that intermediate directory. So I don't see how a code-page option to unzip got there. I'm suspecting that an 'mv' is in order. was (Author: bmargulies): Something's funny here. On this page (http://www.kristalinfo.com/TestCollections/), the zip file has directories like HANTEC-2.0/relevance_file/과학기술분야/ HANTEC-2.0/relevance_file/전체/ The code in the patch expects the word 'full' in latin-alphabet, no funny full-width, in the that intermediate directory. So I don't see how a code-page option to unzip got there. I'm suspecting that an 'mv' is in order. the korean analyzer that has a korean morphological analyzer and dictionaries - Key: LUCENE-4956 URL: https://issues.apache.org/jira/browse/LUCENE-4956 Project: Lucene - Core Issue Type: New Feature Components: modules/analysis Affects Versions: 4.2 Reporter: SooMyung Lee Assignee: Christian Moen Labels: newbie Attachments: eval.patch, kr.analyzer.4x.tar, lucene-4956.patch, lucene4956.patch, LUCENE-4956.patch Korean language has specific characteristic. When developing search service with lucene solr in korean, there are some problems in searching and indexing. The korean analyer solved the problems with a korean morphological anlyzer. It consists of a korean morphological analyzer, dictionaries, a korean tokenizer and a korean filter. The korean anlyzer is made for lucene and solr. If you develop a search service with lucene in korean, It is the best idea to choose the korean analyzer. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries
[ https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13807567#comment-13807567 ] Benson Margulies commented on LUCENE-4956: -- Could you share the trick of unpacking the big tarball, locale-wise? I ended up with: [benson] /data/HANTEC-2.0 % ls relevance_file %B0%FA%C7б%E2%BC%FA%BAо%DF %C0%FCü which does not work so well. Did you set LOCALE to something before unpacking? the korean analyzer that has a korean morphological analyzer and dictionaries - Key: LUCENE-4956 URL: https://issues.apache.org/jira/browse/LUCENE-4956 Project: Lucene - Core Issue Type: New Feature Components: modules/analysis Affects Versions: 4.2 Reporter: SooMyung Lee Assignee: Christian Moen Labels: newbie Attachments: eval.patch, kr.analyzer.4x.tar, lucene-4956.patch, lucene4956.patch, LUCENE-4956.patch Korean language has specific characteristic. When developing search service with lucene solr in korean, there are some problems in searching and indexing. The korean analyer solved the problems with a korean morphological anlyzer. It consists of a korean morphological analyzer, dictionaries, a korean tokenizer and a korean filter. The korean anlyzer is made for lucene and solr. If you develop a search service with lucene in korean, It is the best idea to choose the korean analyzer. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries
[ https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13797876#comment-13797876 ] Benson Margulies commented on LUCENE-4956: -- As a potential user of this technology, I'd like to ask for it to have documentation of its linguistic approach. * What is the goal of the tokenizer? Is it to deliver eojeol or hyung-tae-so? If eojeol, does it split up the case where Korean writers are sometimes relaxed about whitespace between them? * Similarly, what does it set out to index? Does it index eojeol and them also their contained eumjeol or hyung-tae-so, using position-increment / position-length to indicate compound relationships. the korean analyzer that has a korean morphological analyzer and dictionaries - Key: LUCENE-4956 URL: https://issues.apache.org/jira/browse/LUCENE-4956 Project: Lucene - Core Issue Type: New Feature Components: modules/analysis Affects Versions: 4.2 Reporter: SooMyung Lee Assignee: Christian Moen Labels: newbie Attachments: eval.patch, kr.analyzer.4x.tar, lucene-4956.patch, lucene4956.patch, LUCENE-4956.patch Korean language has specific characteristic. When developing search service with lucene solr in korean, there are some problems in searching and indexing. The korean analyer solved the problems with a korean morphological anlyzer. It consists of a korean morphological analyzer, dictionaries, a korean tokenizer and a korean filter. The korean anlyzer is made for lucene and solr. If you develop a search service with lucene in korean, It is the best idea to choose the korean analyzer. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries
[ https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13798393#comment-13798393 ] Benson Margulies commented on LUCENE-4956: -- I am told (I don't read Korean myself) that people often leave out the white space between eojeol that are made up entirely of Hangul letters (Korean letters). Are you just defining these very long things to be single eojeol? Prof Kang in his own work has a module that splits these using some rules. the korean analyzer that has a korean morphological analyzer and dictionaries - Key: LUCENE-4956 URL: https://issues.apache.org/jira/browse/LUCENE-4956 Project: Lucene - Core Issue Type: New Feature Components: modules/analysis Affects Versions: 4.2 Reporter: SooMyung Lee Assignee: Christian Moen Labels: newbie Attachments: eval.patch, kr.analyzer.4x.tar, lucene-4956.patch, lucene4956.patch, LUCENE-4956.patch Korean language has specific characteristic. When developing search service with lucene solr in korean, there are some problems in searching and indexing. The korean analyer solved the problems with a korean morphological anlyzer. It consists of a korean morphological analyzer, dictionaries, a korean tokenizer and a korean filter. The korean anlyzer is made for lucene and solr. If you develop a search service with lucene in korean, It is the best idea to choose the korean analyzer. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-5244) NPE in Japanese Analyzer
Benson Margulies created LUCENE-5244: Summary: NPE in Japanese Analyzer Key: LUCENE-5244 URL: https://issues.apache.org/jira/browse/LUCENE-5244 Project: Lucene - Core Issue Type: Bug Components: modules/analysis Affects Versions: 4.4 Reporter: Benson Margulies I've got a test case that shows an NPE with the Japanese analyzer. It's all available in https://github.com/benson-basis/kuromoji-npe, and I explicitly grant a license to the Foundation. If anyone would prefer that I attach a tarball here, just let me know. {noformat} --- T E S T S --- Running com.basistech.testcase.JapaneseNpeTest Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.298 sec FAILURE! - in com.basistech.testcase.JapaneseNpeTest japaneseNpe(com.basistech.testcase.JapaneseNpeTest) Time elapsed: 0.282 sec ERROR! java.lang.NullPointerException: null at org.apache.lucene.analysis.util.RollingCharBuffer.get(RollingCharBuffer.java:86) at org.apache.lucene.analysis.ja.JapaneseTokenizer.parse(JapaneseTokenizer.java:618) at org.apache.lucene.analysis.ja.JapaneseTokenizer.incrementToken(JapaneseTokenizer.java:468) at com.basistech.testcase.JapaneseNpeTest.japaneseNpe(JapaneseNpeTest.java:28) {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-5244) NPE in Japanese Analyzer
[ https://issues.apache.org/jira/browse/LUCENE-5244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benson Margulies resolved LUCENE-5244. -- Resolution: Invalid This was pilot error, I forgot to call reset(). NPE in Japanese Analyzer Key: LUCENE-5244 URL: https://issues.apache.org/jira/browse/LUCENE-5244 Project: Lucene - Core Issue Type: Bug Components: modules/analysis Affects Versions: 4.4 Reporter: Benson Margulies I've got a test case that shows an NPE with the Japanese analyzer. It's all available in https://github.com/benson-basis/kuromoji-npe, and I explicitly grant a license to the Foundation. If anyone would prefer that I attach a tarball here, just let me know. {noformat} --- T E S T S --- Running com.basistech.testcase.JapaneseNpeTest Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.298 sec FAILURE! - in com.basistech.testcase.JapaneseNpeTest japaneseNpe(com.basistech.testcase.JapaneseNpeTest) Time elapsed: 0.282 sec ERROR! java.lang.NullPointerException: null at org.apache.lucene.analysis.util.RollingCharBuffer.get(RollingCharBuffer.java:86) at org.apache.lucene.analysis.ja.JapaneseTokenizer.parse(JapaneseTokenizer.java:618) at org.apache.lucene.analysis.ja.JapaneseTokenizer.incrementToken(JapaneseTokenizer.java:468) at com.basistech.testcase.JapaneseNpeTest.japaneseNpe(JapaneseNpeTest.java:28) {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-5259) Typo in error message from missing / wrong _version_ field
Benson Margulies created SOLR-5259: -- Summary: Typo in error message from missing / wrong _version_ field Key: SOLR-5259 URL: https://issues.apache.org/jira/browse/SOLR-5259 Project: Solr Issue Type: Bug Affects Versions: 4.4 Reporter: Benson Margulies Note the missing space between _version_ and field. Caused by: org.apache.solr.common.SolrException: Unable to use updateLog: _version_field must exist in schema, using indexed=true stored=true and multiValued=false (_version_ is not indexed -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5202) LookaheadTokenFilter consumes an extra token in nextToken
[ https://issues.apache.org/jira/browse/LUCENE-5202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13761254#comment-13761254 ] Benson Margulies commented on LUCENE-5202: -- Yes, that's what I have and it works, except for the problem I wrote this test case to demonstrate. There's a call to peekToken in nextToken used to detect the end of the input. When that gets called, a token 'moves' from the input to the positions, so the calls to peekToken in my code never see it. Either I'm supposed to call restoreState to examine it, or there's a problem here. If I'm supposed to call restoreState, I need to figure out how to notice (by looking at positions?) that I'm in that situation. Or there's some problem in my logic for deciding when to do my next load of peeks, so that nextToken is never supposed to reach that call to peek, but I can't figure out what it is. LookaheadTokenFilter consumes an extra token in nextToken - Key: LUCENE-5202 URL: https://issues.apache.org/jira/browse/LUCENE-5202 Project: Lucene - Core Issue Type: Bug Affects Versions: 4.3.1 Reporter: Benson Margulies Attachments: LUCENE-5202.patch, LUCENE-5202.patch This is a bit hard to explain except by looking at the test case. I've coded a filter that uses LookaheadTokenFilter. The incrementToken method peeks some tokens. Then, it seems, nextToken in the Lookahead class calls peekToken itself, which seems to me to consume a token so that it's not seen when the derived class sets out to process the next set of tokens. In passing, this test case can be used to demonstrate that it does not work to try to use the afterPosition method to set up attributes of the token that we're 'after'. Probably that was never intended. However, I'm hoping for some feedback as to whether the rest of the structure here is as intended for subclasses of LookaheadTokenFilter. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5202) LookaheadTokenFilter consumes an extra token in nextToken
[ https://issues.apache.org/jira/browse/LUCENE-5202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13761475#comment-13761475 ] Benson Margulies commented on LUCENE-5202: -- OK, I see. So I'll leave it to you to apply this patch to pick up the fix you made. thanks LookaheadTokenFilter consumes an extra token in nextToken - Key: LUCENE-5202 URL: https://issues.apache.org/jira/browse/LUCENE-5202 Project: Lucene - Core Issue Type: Bug Affects Versions: 4.3.1 Reporter: Benson Margulies Attachments: LUCENE-5202.patch, LUCENE-5202.patch This is a bit hard to explain except by looking at the test case. I've coded a filter that uses LookaheadTokenFilter. The incrementToken method peeks some tokens. Then, it seems, nextToken in the Lookahead class calls peekToken itself, which seems to me to consume a token so that it's not seen when the derived class sets out to process the next set of tokens. In passing, this test case can be used to demonstrate that it does not work to try to use the afterPosition method to set up attributes of the token that we're 'after'. Probably that was never intended. However, I'm hoping for some feedback as to whether the rest of the structure here is as intended for subclasses of LookaheadTokenFilter. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org