from:"Benson Margulies \(JIRA\)"

[jira] [Commented] (SOLR-5228) Deprecate fields and types tags in schema.xml

2014-03-24 Thread Benson Margulies (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-5228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13944920#comment-13944920
 ] 

Benson Margulies commented on SOLR-5228:


Allow the person extending the schema to provide a, well, extended schema.



 Deprecate fields and types tags in schema.xml
 -

 Key: SOLR-5228
 URL: https://issues.apache.org/jira/browse/SOLR-5228
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Reporter: Hoss Man
Assignee: Erick Erickson
 Fix For: 4.8, 5.0

 Attachments: SOLR-5228.patch, SOLR-5228.patch


 On the solr-user mailing list, Nutan recently mentioned spending days trying 
 to track down a problem that turned out to be because he had attempted to add 
 a {{dynamicField .. /}} that was outside of the {{fields}} block in his 
 schema.xml -- Solr was just silently ignoring it.
 We have made improvements in other areas of config validation by generating 
 statup errors when tags/attributes are found that are not expected -- but in 
 this case i think we should just stop expecting/requiring that the 
 {{fields}} and {{types}} tags will be used to group these sorts of 
 things.  I think schema.xml parsing should just start ignoring them and only 
 care about finding the {{field}}, {{dynamicField}}, and {{fieldType}} 
 tags wherever they may be.
 If people want to keep using them, fine.  If people want to mix fieldTypes 
 and fields side by side (perhaps specify a fieldType, then list all the 
 fields using it) fine.  I don't see any value in forcing people to use them, 
 but we definitely shouldn't leave things the way they are with otherwise 
 perfectly valid field/type declarations being silently ignored.
 ---
 I'll take this on unless i see any objections.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-5228) Deprecate fields and types tags in schema.xml

2014-03-24 Thread Benson Margulies (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-5228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13944971#comment-13944971
]

Benson Margulies commented on SOLR-5228:

DTD's are useless. We need to pick one of W3C XML Schema or RNG. RNG is a lot
easier to work with. Schematron is another possibility, but I have no
experience. See
http://docs.oracle.com/javase/7/docs/api/javax/xml/validation/package-summary.html.

Choices are:

* validation is easy to disable; people who customize disable it
* customizers take the entire schema, add to it, and provide their added one.
Not so good for multiples.
* customizers are constrained to use _namespaces_ -- you customize, you add an
XML namespace, and you provide a schema for your namespace.

Of course the first time we try this we'll find problems in the test schemas.

Has anyone done anything in this area that I could start from if I was inclined
to try to work on this?

Deprecate fields and types tags in schema.xml
-

Key: SOLR-5228
URL: https://issues.apache.org/jira/browse/SOLR-5228
Project: Solr
Issue Type: Improvement
Components: Schema and Analysis
Reporter: Hoss Man
Assignee: Erick Erickson
Fix For: 4.8, 5.0

Attachments: SOLR-5228.patch, SOLR-5228.patch

On the solr-user mailing list, Nutan recently mentioned spending days trying
to track down a problem that turned out to be because he had attempted to add
a {{dynamicField .. /}} that was outside of the {{fields}} block in his
schema.xml -- Solr was just silently ignoring it.
We have made improvements in other areas of config validation by generating
statup errors when tags/attributes are found that are not expected -- but in
this case i think we should just stop expecting/requiring that the
{{fields}} and {{types}} tags will be used to group these sorts of
things. I think schema.xml parsing should just start ignoring them and only
care about finding the {{field}}, {{dynamicField}}, and {{fieldType}}
tags wherever they may be.
If people want to keep using them, fine. If people want to mix fieldTypes
and fields side by side (perhaps specify a fieldType, then list all the
fields using it) fine. I don't see any value in forcing people to use them,
but we definitely shouldn't leave things the way they are with otherwise
perfectly valid field/type declarations being silently ignored.
---
I'll take this on unless i see any objections.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-5228) Deprecate fields and types tags in schema.xml

2014-03-23 Thread Benson Margulies (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-5228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13944636#comment-13944636
]

Benson Margulies commented on SOLR-5228:

I apologize for showing up so late with an opinion. I can't get over the
feeling that this might be solving the wrong problem.

In XML, the structure of

{code}
SOME_ITEMs
SOME_ITEM
/SOME_ITEM
...
/SOME_ITEMs
{code}

is ancient and honorable. Yea, some schemas dispense with the container for the
group, but plenty do not. The source of this was someone who misplaced an item
and didn't get a diagnosis. _Why don't we concentrate on diagnosis?_ Why not
create a schema and, by default, check it? It's not like we're in a giant hurry
at start-up compared to the extra time of enabling a validating parse.

Grouping these guys together is harmless at worst and slight helpful at best.

If we are going to change the schema, I would beg that anyone changing it put
forth an actual, well, _schema_ that is an accurate representation of what is
allowed.

So I'm belatedly -1 on this change, for why tiny little bit its worth.

Deprecate fields and types tags in schema.xml
-

Key: SOLR-5228
URL: https://issues.apache.org/jira/browse/SOLR-5228
Project: Solr
Issue Type: Improvement
Components: Schema and Analysis
Reporter: Hoss Man
Assignee: Erick Erickson
Attachments: SOLR-5228.patch, SOLR-5228.patch

--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Assigned] (LUCENE-5449) Two ancient classes renamed to be less peculiar: _TestHelper and _TestUtil

2014-02-19 Thread Benson Margulies (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-5449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benson Margulies reassigned LUCENE-5449:


Assignee: Benson Margulies

 Two ancient classes renamed to be less peculiar: _TestHelper and _TestUtil
 --

 Key: LUCENE-5449
 URL: https://issues.apache.org/jira/browse/LUCENE-5449
 Project: Lucene - Core
  Issue Type: Improvement
Affects Versions: 4.6.1
Reporter: Benson Margulies
Assignee: Benson Margulies
Priority: Minor

 _TestUtil and _TestHelper begin with _ for historical reasons that don't 
 apply any longer. Lets eliminate those _'s.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5449) Two ancient classes renamed to be less peculiar: _TestHelper and _TestUtil

2014-02-19 Thread Benson Margulies (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13906174#comment-13906174
 ] 

Benson Margulies commented on LUCENE-5449:
--

I'm unable to reconstruct how I laid this egg. My only theory is that I had 
somehow cd'd back to the wrong tree before running ant precommit after thinking 
i've set up the merge. Rob's commit really just finishes my work on 'part 1': 
part 2 was always going to be the _TestHelper commit. Let's see if I can get 
that one right.


 Two ancient classes renamed to be less peculiar: _TestHelper and _TestUtil
 --

 Key: LUCENE-5449
 URL: https://issues.apache.org/jira/browse/LUCENE-5449
 Project: Lucene - Core
  Issue Type: Improvement
Affects Versions: 4.6.1
Reporter: Benson Margulies
Assignee: Benson Margulies
Priority: Minor

 _TestUtil and _TestHelper begin with _ for historical reasons that don't 
 apply any longer. Lets eliminate those _'s.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-5449) Two ancient classes renamed to be less peculiar: _TestHelper and _TestUtil

2014-02-19 Thread Benson Margulies (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-5449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13906174#comment-13906174
]

Benson Margulies edited comment on LUCENE-5449 at 2/19/14 10:02 PM:

I'm unable to reconstruct how I laid this egg. My only theory is that I had
somehow cd'd back to the wrong tree before running ant precommit after thinking
I had made all the corrections after the merge. Rob's commit really just
finishes my work on 'part 1': part 2 was always going to be the _TestHelper
commit. Let's see if I can get that one right.

was (Author: bmargulies):
I'm unable to reconstruct how I laid this egg. My only theory is that I had
somehow cd'd back to the wrong tree before running ant precommit after thinking
i've set up the merge. Rob's commit really just finishes my work on 'part 1':
part 2 was always going to be the _TestHelper commit. Let's see if I can get
that one right.

Two ancient classes renamed to be less peculiar: _TestHelper and _TestUtil
--

Key: LUCENE-5449
URL: https://issues.apache.org/jira/browse/LUCENE-5449
Project: Lucene - Core
Issue Type: Improvement
Affects Versions: 4.6.1
Reporter: Benson Margulies
Assignee: Benson Margulies
Priority: Minor

_TestUtil and _TestHelper begin with _ for historical reasons that don't
apply any longer. Lets eliminate those _'s.

--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-5449) Two ancient classes renamed to be less peculiar: _TestHelper and _TestUtil

2014-02-19 Thread Benson Margulies (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-5449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benson Margulies updated LUCENE-5449:
-

Fix Version/s: 5.0
   4.8

 Two ancient classes renamed to be less peculiar: _TestHelper and _TestUtil
 --

 Key: LUCENE-5449
 URL: https://issues.apache.org/jira/browse/LUCENE-5449
 Project: Lucene - Core
  Issue Type: Improvement
Affects Versions: 4.6.1
Reporter: Benson Margulies
Assignee: Benson Margulies
Priority: Minor
 Fix For: 4.8, 5.0


 _TestUtil and _TestHelper begin with _ for historical reasons that don't 
 apply any longer. Lets eliminate those _'s.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-5449) Two ancient classes renamed to be less peculiar: _TestHelper and _TestUtil

2014-02-19 Thread Benson Margulies (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-5449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benson Margulies resolved LUCENE-5449.
--

Resolution: Fixed

 Two ancient classes renamed to be less peculiar: _TestHelper and _TestUtil
 --

 Key: LUCENE-5449
 URL: https://issues.apache.org/jira/browse/LUCENE-5449
 Project: Lucene - Core
  Issue Type: Improvement
Affects Versions: 4.6.1
Reporter: Benson Margulies
Assignee: Benson Margulies
Priority: Minor
 Fix For: 4.8, 5.0


 _TestUtil and _TestHelper begin with _ for historical reasons that don't 
 apply any longer. Lets eliminate those _'s.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5449) Two ancient classes renamed to be less peculiar: _TestHelper and _TestUtil

2014-02-18 Thread Benson Margulies (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13903983#comment-13903983
 ] 

Benson Margulies commented on LUCENE-5449:
--

[~thetaphi], I am not enthusiastic about  1000 edits to change from importing 
the class to static importing the methods. Do you see this as a requirement, or 
just a desirable practice going forward?


 Two ancient classes renamed to be less peculiar: _TestHelper and _TestUtil
 --

 Key: LUCENE-5449
 URL: https://issues.apache.org/jira/browse/LUCENE-5449
 Project: Lucene - Core
  Issue Type: Improvement
Affects Versions: 4.6.1
Reporter: Benson Margulies
Priority: Minor

 _TestUtil and _TestHelper begin with _ for historical reasons that don't 
 apply any longer. Lets eliminate those _'s.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5449) Two ancient classes renamed to be less peculiar: _TestHelper and _TestUtil

2014-02-18 Thread Benson Margulies (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13904004#comment-13904004
 ] 

Benson Margulies commented on LUCENE-5449:
--

OK, then this is good to go. (I did include one example of switching to a 
static import, even though I agree with [~mikemccand] in general.

 Two ancient classes renamed to be less peculiar: _TestHelper and _TestUtil
 --

 Key: LUCENE-5449
 URL: https://issues.apache.org/jira/browse/LUCENE-5449
 Project: Lucene - Core
  Issue Type: Improvement
Affects Versions: 4.6.1
Reporter: Benson Margulies
Priority: Minor

 _TestUtil and _TestHelper begin with _ for historical reasons that don't 
 apply any longer. Lets eliminate those _'s.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-5448) Random string generation centralized in _TestUtil

2014-02-17 Thread Benson Margulies (JIRA)

Benson Margulies created LUCENE-5448:


 Summary: Random string generation centralized in _TestUtil
 Key: LUCENE-5448
 URL: https://issues.apache.org/jira/browse/LUCENE-5448
 Project: Lucene - Core
  Issue Type: Improvement
Affects Versions: 4.6.1
Reporter: Benson Margulies


The random string generators in BaseTokenStreamTestCase have wider 
applicability and should move in with their cousins.




--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-5448) Random string generation centralized in _TestUtil

2014-02-17 Thread Benson Margulies (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-5448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benson Margulies resolved LUCENE-5448.
--

   Resolution: Fixed
Fix Version/s: 5.0
 Assignee: Benson Margulies

 Random string generation centralized in _TestUtil
 -

 Key: LUCENE-5448
 URL: https://issues.apache.org/jira/browse/LUCENE-5448
 Project: Lucene - Core
  Issue Type: Improvement
Affects Versions: 4.6.1
Reporter: Benson Margulies
Assignee: Benson Margulies
 Fix For: 5.0


 The random string generators in BaseTokenStreamTestCase have wider 
 applicability and should move in with their cousins.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-5448) Random string generation centralized in _TestUtil

2014-02-17 Thread Benson Margulies (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-5448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benson Margulies updated LUCENE-5448:
-

Fix Version/s: 4.7

 Random string generation centralized in _TestUtil
 -

 Key: LUCENE-5448
 URL: https://issues.apache.org/jira/browse/LUCENE-5448
 Project: Lucene - Core
  Issue Type: Improvement
Affects Versions: 4.6.1
Reporter: Benson Margulies
Assignee: Benson Margulies
 Fix For: 5.0, 4.7


 The random string generators in BaseTokenStreamTestCase have wider 
 applicability and should move in with their cousins.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-5449) Two ancient classes renamed to be less peculiar: _TestHelper and _TestUtil

2014-02-17 Thread Benson Margulies (JIRA)

Benson Margulies created LUCENE-5449:


 Summary: Two ancient classes renamed to be less peculiar: 
_TestHelper and _TestUtil
 Key: LUCENE-5449
 URL: https://issues.apache.org/jira/browse/LUCENE-5449
 Project: Lucene - Core
  Issue Type: Improvement
Affects Versions: 4.6.1
Reporter: Benson Margulies
Priority: Minor


_TestUtil and _TestHelper begin with _ for historical reasons that don't apply 
any longer. Lets eliminate those _'s.




--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries

2014-02-08 Thread Benson Margulies (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13895706#comment-13895706
 ] 

Benson Margulies commented on LUCENE-4956:
--

This is a patch, not an accepted component of Apache Lucene. There's no 
guarantee that anyone will work on it.

 the korean analyzer that has a korean morphological analyzer and dictionaries
 -

 Key: LUCENE-4956
 URL: https://issues.apache.org/jira/browse/LUCENE-4956
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Affects Versions: 4.2
Reporter: SooMyung Lee
Assignee: Christian Moen
  Labels: newbie
 Attachments: LUCENE-4956.patch, eval.patch, kr.analyzer.4x.tar, 
 lucene-4956.patch, lucene4956.patch


 Korean language has specific characteristic. When developing search service 
 with lucene  solr in korean, there are some problems in searching and 
 indexing. The korean analyer solved the problems with a korean morphological 
 anlyzer. It consists of a korean morphological analyzer, dictionaries, a 
 korean tokenizer and a korean filter. The korean anlyzer is made for lucene 
 and solr. If you develop a search service with lucene in korean, It is the 
 best idea to choose the korean analyzer.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-5623) Better diagnosis of RuntimeExceptions in analysis

2014-02-05 Thread Benson Margulies (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-5623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13892049#comment-13892049
 ] 

Benson Margulies commented on SOLR-5623:


[~shalinmangar] Apparently I haven't learned to read the output of ant test 
very well, and fooled myself into believing that all as well. Thanks for 
cleaning up after me.


 Better diagnosis of RuntimeExceptions in analysis
 -

 Key: SOLR-5623
 URL: https://issues.apache.org/jira/browse/SOLR-5623
 Project: Solr
  Issue Type: Bug
Affects Versions: 4.6.1
Reporter: Benson Margulies
Assignee: Benson Margulies
 Fix For: 5.0, 4.7

 Attachments: SOLR-5623-nowrap.patch, SOLR-5623-nowrap.patch


 If an analysis component (tokenizer, filter, etc) gets really into a hissy 
 fit and throws a RuntimeException, the resulting log traffic is less than 
 informative, lacking any pointer to the doc under discussion (in the doc 
 case). It would be more better if there was a catch/try shortstop that logged 
 this more informatively.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-5623) Better diagnosis of RuntimeExceptions in analysis

2014-02-04 Thread Benson Margulies (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-5623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13891512#comment-13891512
 ] 

Benson Margulies commented on SOLR-5623:


trunk patch 1564584.

 Better diagnosis of RuntimeExceptions in analysis
 -

 Key: SOLR-5623
 URL: https://issues.apache.org/jira/browse/SOLR-5623
 Project: Solr
  Issue Type: Bug
Reporter: Benson Margulies
Assignee: Benson Margulies

 If an analysis component (tokenizer, filter, etc) gets really into a hissy 
 fit and throws a RuntimeException, the resulting log traffic is less than 
 informative, lacking any pointer to the doc under discussion (in the doc 
 case). It would be more better if there was a catch/try shortstop that logged 
 this more informatively.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5405) Exception strategy for analysis improved

2014-02-03 Thread Benson Margulies (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13889406#comment-13889406
 ] 

Benson Margulies commented on LUCENE-5405:
--

Will do. Thanks, this is exactly what sort of feedback I was looking for.

 Exception strategy for analysis improved
 

 Key: LUCENE-5405
 URL: https://issues.apache.org/jira/browse/LUCENE-5405
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Benson Margulies
Assignee: Benson Margulies
 Fix For: 5.0

 Attachments: LUCENE-5405-4.x.patch


 SOLR-5623 included some conversation about the dilemmas of exception 
 management and reporting in the analysis chain. 
 I've belatedly become educated about the infostream, and this situation is a 
 job for it. The DocInverterPerField can note exceptions in the analysis 
 chain, log out to the infostream, and then rethrow them as before. No 
 wrapping, no muss, no fuss.
 There are comments on this JIRA from a more complex prior idea that readers 
 might want to ignore.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-5405) Exception strategy for analysis improved

2014-02-03 Thread Benson Margulies (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-5405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benson Margulies updated LUCENE-5405:
-

Fix Version/s: 4.7

 Exception strategy for analysis improved
 

 Key: LUCENE-5405
 URL: https://issues.apache.org/jira/browse/LUCENE-5405
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Benson Margulies
Assignee: Benson Margulies
 Fix For: 5.0, 4.7

 Attachments: LUCENE-5405-4.x.patch


 SOLR-5623 included some conversation about the dilemmas of exception 
 management and reporting in the analysis chain. 
 I've belatedly become educated about the infostream, and this situation is a 
 job for it. The DocInverterPerField can note exceptions in the analysis 
 chain, log out to the infostream, and then rethrow them as before. No 
 wrapping, no muss, no fuss.
 There are comments on this JIRA from a more complex prior idea that readers 
 might want to ignore.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5405) Exception strategy for analysis improved

2014-02-03 Thread Benson Margulies (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13889424#comment-13889424
 ] 

Benson Margulies commented on LUCENE-5405:
--

rev 1563850 provides the backport.

 Exception strategy for analysis improved
 

 Key: LUCENE-5405
 URL: https://issues.apache.org/jira/browse/LUCENE-5405
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Benson Margulies
Assignee: Benson Margulies
 Fix For: 5.0, 4.7

 Attachments: LUCENE-5405-4.x.patch


 SOLR-5623 included some conversation about the dilemmas of exception 
 management and reporting in the analysis chain. 
 I've belatedly become educated about the infostream, and this situation is a 
 job for it. The DocInverterPerField can note exceptions in the analysis 
 chain, log out to the infostream, and then rethrow them as before. No 
 wrapping, no muss, no fuss.
 There are comments on this JIRA from a more complex prior idea that readers 
 might want to ignore.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-5405) Exception strategy for analysis improved

2014-02-03 Thread Benson Margulies (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-5405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benson Margulies resolved LUCENE-5405.
--

Resolution: Fixed

backported, CHANGES.txt filled in. 'this time for sure'

 Exception strategy for analysis improved
 

 Key: LUCENE-5405
 URL: https://issues.apache.org/jira/browse/LUCENE-5405
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Benson Margulies
Assignee: Benson Margulies
 Fix For: 5.0, 4.7

 Attachments: LUCENE-5405-4.x.patch


 SOLR-5623 included some conversation about the dilemmas of exception 
 management and reporting in the analysis chain. 
 I've belatedly become educated about the infostream, and this situation is a 
 job for it. The DocInverterPerField can note exceptions in the analysis 
 chain, log out to the infostream, and then rethrow them as before. No 
 wrapping, no muss, no fuss.
 There are comments on this JIRA from a more complex prior idea that readers 
 might want to ignore.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5405) Exception strategy for analysis improved

2014-02-02 Thread Benson Margulies (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13888958#comment-13888958
 ] 

Benson Margulies commented on LUCENE-5405:
--

I can backport, [~mikemccand]. Is there any doc on how the project manages 
branches? If not, I can add some to the web site to help guide patch-offerers.

 Exception strategy for analysis improved
 

 Key: LUCENE-5405
 URL: https://issues.apache.org/jira/browse/LUCENE-5405
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Benson Margulies
Assignee: Benson Margulies
 Fix For: 5.0


 SOLR-5623 included some conversation about the dilemmas of exception 
 management and reporting in the analysis chain. 
 I've belatedly become educated about the infostream, and this situation is a 
 job for it. The DocInverterPerField can note exceptions in the analysis 
 chain, log out to the infostream, and then rethrow them as before. No 
 wrapping, no muss, no fuss.
 There are comments on this JIRA from a more complex prior idea that readers 
 might want to ignore.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5405) Exception strategy for analysis improved

2014-02-02 Thread Benson Margulies (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13889051#comment-13889051
 ] 

Benson Margulies commented on LUCENE-5405:
--

Somehow the unit test escaped the prior commit. 1563711 fills it in.

 Exception strategy for analysis improved
 

 Key: LUCENE-5405
 URL: https://issues.apache.org/jira/browse/LUCENE-5405
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Benson Margulies
Assignee: Benson Margulies
 Fix For: 5.0


 SOLR-5623 included some conversation about the dilemmas of exception 
 management and reporting in the analysis chain. 
 I've belatedly become educated about the infostream, and this situation is a 
 job for it. The DocInverterPerField can note exceptions in the analysis 
 chain, log out to the infostream, and then rethrow them as before. No 
 wrapping, no muss, no fuss.
 There are comments on this JIRA from a more complex prior idea that readers 
 might want to ignore.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5405) Exception strategy for analysis improved

2014-02-02 Thread Benson Margulies (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13889095#comment-13889095
 ] 

Benson Margulies commented on LUCENE-5405:
--

Well, svn merge did something I can't make heads or tails of, so I'm going to 
merge by hand.  

 Exception strategy for analysis improved
 

 Key: LUCENE-5405
 URL: https://issues.apache.org/jira/browse/LUCENE-5405
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Benson Margulies
Assignee: Benson Margulies
 Fix For: 5.0


 SOLR-5623 included some conversation about the dilemmas of exception 
 management and reporting in the analysis chain. 
 I've belatedly become educated about the infostream, and this situation is a 
 job for it. The DocInverterPerField can note exceptions in the analysis 
 chain, log out to the infostream, and then rethrow them as before. No 
 wrapping, no muss, no fuss.
 There are comments on this JIRA from a more complex prior idea that readers 
 might want to ignore.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-5405) Exception strategy for analysis improved

2014-02-02 Thread Benson Margulies (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-5405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benson Margulies updated LUCENE-5405:
-

Attachment: LUCENE-5405-4.x.patch

Reviewable port.

 Exception strategy for analysis improved
 

 Key: LUCENE-5405
 URL: https://issues.apache.org/jira/browse/LUCENE-5405
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Benson Margulies
Assignee: Benson Margulies
 Fix For: 5.0

 Attachments: LUCENE-5405-4.x.patch


 SOLR-5623 included some conversation about the dilemmas of exception 
 management and reporting in the analysis chain. 
 I've belatedly become educated about the infostream, and this situation is a 
 job for it. The DocInverterPerField can note exceptions in the analysis 
 chain, log out to the infostream, and then rethrow them as before. No 
 wrapping, no muss, no fuss.
 There are comments on this JIRA from a more complex prior idea that readers 
 might want to ignore.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5405) Exception strategy for analysis improved

2014-02-02 Thread Benson Margulies (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13889101#comment-13889101
 ] 

Benson Margulies commented on LUCENE-5405:
--

[~mikemccand] and [~rcmuir]: The code in the 4.x branch is more complex. I 
_think_ I've managed to carry the strategy across, but I'd be grateful for some 
skeptical eyeballs before I commit the attach patch that does the backport.

 Exception strategy for analysis improved
 

 Key: LUCENE-5405
 URL: https://issues.apache.org/jira/browse/LUCENE-5405
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Benson Margulies
Assignee: Benson Margulies
 Fix For: 5.0

 Attachments: LUCENE-5405-4.x.patch


 SOLR-5623 included some conversation about the dilemmas of exception 
 management and reporting in the analysis chain. 
 I've belatedly become educated about the infostream, and this situation is a 
 job for it. The DocInverterPerField can note exceptions in the analysis 
 chain, log out to the infostream, and then rethrow them as before. No 
 wrapping, no muss, no fuss.
 There are comments on this JIRA from a more complex prior idea that readers 
 might want to ignore.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Assigned] (SOLR-5623) Better diagnosis of RuntimeExceptions in analysis

2014-01-29 Thread Benson Margulies (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-5623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benson Margulies reassigned SOLR-5623:
--

Assignee: Benson Margulies

 Better diagnosis of RuntimeExceptions in analysis
 -

 Key: SOLR-5623
 URL: https://issues.apache.org/jira/browse/SOLR-5623
 Project: Solr
  Issue Type: Bug
Reporter: Benson Margulies
Assignee: Benson Margulies

 If an analysis component (tokenizer, filter, etc) gets really into a hissy 
 fit and throws a RuntimeException, the resulting log traffic is less than 
 informative, lacking any pointer to the doc under discussion (in the doc 
 case). It would be more better if there was a catch/try shortstop that logged 
 this more informatively.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5405) Exception strategy for analysis improved

2014-01-29 Thread Benson Margulies (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13886067#comment-13886067
 ] 

Benson Margulies commented on LUCENE-5405:
--

Am I good to commit here?

 Exception strategy for analysis improved
 

 Key: LUCENE-5405
 URL: https://issues.apache.org/jira/browse/LUCENE-5405
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Benson Margulies
Assignee: Benson Margulies

 SOLR-5623 included some conversation about the dilemmas of exception 
 management and reporting in the analysis chain. 
 I've belatedly become educated about the infostream, and this situation is a 
 job for it. The DocInverterPerField can note exceptions in the analysis 
 chain, log out to the infostream, and then rethrow them as before. No 
 wrapping, no muss, no fuss.
 There are comments on this JIRA from a more complex prior idea that readers 
 might want to ignore.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Assigned] (LUCENE-5405) Exception strategy for analysis improved

2014-01-29 Thread Benson Margulies (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-5405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benson Margulies reassigned LUCENE-5405:


Assignee: Benson Margulies

 Exception strategy for analysis improved
 

 Key: LUCENE-5405
 URL: https://issues.apache.org/jira/browse/LUCENE-5405
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Benson Margulies
Assignee: Benson Margulies

 SOLR-5623 included some conversation about the dilemmas of exception 
 management and reporting in the analysis chain. 
 I've belatedly become educated about the infostream, and this situation is a 
 job for it. The DocInverterPerField can note exceptions in the analysis 
 chain, log out to the infostream, and then rethrow them as before. No 
 wrapping, no muss, no fuss.
 There are comments on this JIRA from a more complex prior idea that readers 
 might want to ignore.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-5623) Better diagnosis of RuntimeExceptions in analysis

2014-01-29 Thread Benson Margulies (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-5623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13886068#comment-13886068
 ] 

Benson Margulies commented on SOLR-5623:


[~hossman_luc...@fucit.org] have you looked at my revs?

 Better diagnosis of RuntimeExceptions in analysis
 -

 Key: SOLR-5623
 URL: https://issues.apache.org/jira/browse/SOLR-5623
 Project: Solr
  Issue Type: Bug
Reporter: Benson Margulies
Assignee: Benson Margulies

 If an analysis component (tokenizer, filter, etc) gets really into a hissy 
 fit and throws a RuntimeException, the resulting log traffic is less than 
 informative, lacking any pointer to the doc under discussion (in the doc 
 case). It would be more better if there was a catch/try shortstop that logged 
 this more informatively.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Moved] (SOLR-5677) HaversineConstFunction ignores one of its two values, is this on purpose?

2014-01-29 Thread Benson Margulies (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-5677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benson Margulies moved LUCENE-4036 to SOLR-5677:


  Component/s: (was: core/other)
   Schema and Analysis
Lucene Fields:   (was: New)
Affects Version/s: (was: 4.0-ALPHA)
   4.0-ALPHA
  Key: SOLR-5677  (was: LUCENE-4036)
  Project: Solr  (was: Lucene - Core)

 HaversineConstFunction ignores one of its two values, is this on purpose?
 -

 Key: SOLR-5677
 URL: https://issues.apache.org/jira/browse/SOLR-5677
 Project: Solr
  Issue Type: Bug
  Components: Schema and Analysis
Affects Versions: 4.0-ALPHA
Reporter: Benson Margulies

 org.apache.solr.search.function.distance.HaversineConstFunction.parser.new 
 ValueSourceParser() {...}.parse(FunctionQParser)
 has an unused variable warning for 'vs2', and uses vs1 to initialize mv2. 
 Maybe vs2 should just be deleted?



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Resolved] (SOLR-5677) HaversineConstFunction ignores one of its two values, is this on purpose?

2014-01-29 Thread Benson Margulies (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-5677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benson Margulies resolved SOLR-5677.


   Resolution: Fixed
Fix Version/s: 5.0
 Assignee: Benson Margulies

Well, the trunk code no longer has this problem.

 HaversineConstFunction ignores one of its two values, is this on purpose?
 -

 Key: SOLR-5677
 URL: https://issues.apache.org/jira/browse/SOLR-5677
 Project: Solr
  Issue Type: Bug
  Components: Schema and Analysis
Affects Versions: 4.0-ALPHA
Reporter: Benson Margulies
Assignee: Benson Margulies
 Fix For: 5.0


 org.apache.solr.search.function.distance.HaversineConstFunction.parser.new 
 ValueSourceParser() {...}.parse(FunctionQParser)
 has an unused variable warning for 'vs2', and uses vs1 to initialize mv2. 
 Maybe vs2 should just be deleted?



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-5405) Exception strategy for analysis improved

2014-01-29 Thread Benson Margulies (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-5405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benson Margulies resolved LUCENE-5405.
--

   Resolution: Fixed
Fix Version/s: 5.0

Fixed in rev 1562657.

 Exception strategy for analysis improved
 

 Key: LUCENE-5405
 URL: https://issues.apache.org/jira/browse/LUCENE-5405
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Benson Margulies
Assignee: Benson Margulies
 Fix For: 5.0


 SOLR-5623 included some conversation about the dilemmas of exception 
 management and reporting in the analysis chain. 
 I've belatedly become educated about the infostream, and this situation is a 
 job for it. The DocInverterPerField can note exceptions in the analysis 
 chain, log out to the infostream, and then rethrow them as before. No 
 wrapping, no muss, no fuss.
 There are comments on this JIRA from a more complex prior idea that readers 
 might want to ignore.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-5623) Better diagnosis of RuntimeExceptions in analysis

2014-01-18 Thread Benson Margulies (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-5623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13875702#comment-13875702
 ] 

Benson Margulies commented on SOLR-5623:


OK, pushed changes as per remarks.

 Better diagnosis of RuntimeExceptions in analysis
 -

 Key: SOLR-5623
 URL: https://issues.apache.org/jira/browse/SOLR-5623
 Project: Solr
  Issue Type: Bug
Reporter: Benson Margulies

 If an analysis component (tokenizer, filter, etc) gets really into a hissy 
 fit and throws a RuntimeException, the resulting log traffic is less than 
 informative, lacking any pointer to the doc under discussion (in the doc 
 case). It would be more better if there was a catch/try shortstop that logged 
 this more informatively.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-5405) Exception strategy for analysis improved

2014-01-18 Thread Benson Margulies (JIRA)

Benson Margulies created LUCENE-5405:


 Summary: Exception strategy for analysis improved
 Key: LUCENE-5405
 URL: https://issues.apache.org/jira/browse/LUCENE-5405
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Benson Margulies


SOLR-5623 included some conversation about the dilemmas of exception management 
and reporting in the analysis chain. Here is a 5.0 proposal:

TokenStream.reset and TokenStream.incrementToken (and perhaps the rest) should 
have a new, checked, exception in their signatures: call it AnalysisError if 
you like. Unlike IOException, it will have a full set of constructors, 
including the constructors that can wrap a 'cause'. Its constructors will 
accept a field name.

TokenStream will have a fieldName field, accepted in a constructor argument. 
(OK, this might a bit authoritarian.)

TokenStream will have:

  protected void throwAnalysisException(String explanation, Throwable cause) {
throw new AnalysisException(fieldName, explanation, cause);
  }

Implementors of analysis will be thus encouraged to write things like:

  try {
doSomething();
  } catch (IOExceptionOrWhatever e) {
throwAnalysisException(Some Explanation, e);
 }

Then, situations like Solr can diagnose the field name.

Note that no information is lost here, due to the use of exception wrapping.




--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5405) Exception strategy for analysis improved

2014-01-18 Thread Benson Margulies (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-5405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13875710#comment-13875710
]

Benson Margulies commented on LUCENE-5405:
--

I've been frustrated for years by the coincidence that IOException lacks
constructors for 'cause', and the Lucene API is full of 'throws IOException'.
However, I only just now noticed that Java fixed this in 1.6.

So, a weaker form of this would be a subclass of IOException that can carry a
field name, and a place for TokenStream to hide a field name. Then something
like the Solr error handler could instanceof to see if there's a field name to
be had.

Given the other API changes to token stream component construction for 5.0, one
might argue that adding a ctor arg isn't so bad.

Exception strategy for analysis improved

Key: LUCENE-5405
URL: https://issues.apache.org/jira/browse/LUCENE-5405
Project: Lucene - Core
Issue Type: Improvement
Reporter: Benson Margulies

SOLR-5623 included some conversation about the dilemmas of exception
management and reporting in the analysis chain. Here is a 5.0 proposal:
TokenStream.reset and TokenStream.incrementToken (and perhaps the rest)
should have a new, checked, exception in their signatures: call it
AnalysisError if you like. Unlike IOException, it will have a full set of
constructors, including the constructors that can wrap a 'cause'. Its
constructors will accept a field name.
TokenStream will have a fieldName field, accepted in a constructor argument.
(OK, this might a bit authoritarian.)
TokenStream will have:
protected void throwAnalysisException(String explanation, Throwable cause) {
throw new AnalysisException(fieldName, explanation, cause);
}
Implementors of analysis will be thus encouraged to write things like:
try {
doSomething();
} catch (IOExceptionOrWhatever e) {
throwAnalysisException(Some Explanation, e);
}
Then, situations like Solr can diagnose the field name.
Note that no information is lost here, due to the use of exception wrapping.

--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5405) Exception strategy for analysis improved

2014-01-18 Thread Benson Margulies (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13875711#comment-13875711
 ] 

Benson Margulies commented on LUCENE-5405:
--

Hmm, well, backing up. In the other discussion, you and others seemed very 
unhappy with schemes of the form:

throw new SomeException(Some local explanation, someExceptionObject);

Based on your most recent remark, I don't see any other way to get around this; 
my idea about storing field names is stupid, since the chains are reusable. So, 
either this sort of wrapping is tolerable or not. If tolerable, I'll rewrite 
this JIRA, else I'll close it.


 Exception strategy for analysis improved
 

 Key: LUCENE-5405
 URL: https://issues.apache.org/jira/browse/LUCENE-5405
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Benson Margulies

 SOLR-5623 included some conversation about the dilemmas of exception 
 management and reporting in the analysis chain. Here is a 5.0 proposal:
 TokenStream.reset and TokenStream.incrementToken (and perhaps the rest) 
 should have a new, checked, exception in their signatures: call it 
 AnalysisError if you like. Unlike IOException, it will have a full set of 
 constructors, including the constructors that can wrap a 'cause'. Its 
 constructors will accept a field name.
 TokenStream will have a fieldName field, accepted in a constructor argument. 
 (OK, this might a bit authoritarian.)
 TokenStream will have:
   protected void throwAnalysisException(String explanation, Throwable cause) {
 throw new AnalysisException(fieldName, explanation, cause);
   }
 Implementors of analysis will be thus encouraged to write things like:
   try {
 doSomething();
   } catch (IOExceptionOrWhatever e) {
 throwAnalysisException(Some Explanation, e);
  }
 Then, situations like Solr can diagnose the field name.
 Note that no information is lost here, due to the use of exception wrapping.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5405) Exception strategy for analysis improved

2014-01-18 Thread Benson Margulies (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13875713#comment-13875713
 ] 

Benson Margulies commented on LUCENE-5405:
--

Yes, we're now in the same place. Does a catch/throw in DocInverterPerField 
that does something like

   throw new LuceneAnalysisException(Error analyzing text, fieldName, 
originalException);

make life better? I think it makes life better, as I don't see much evil in 
exception wrapping like this.

 Exception strategy for analysis improved
 

 Key: LUCENE-5405
 URL: https://issues.apache.org/jira/browse/LUCENE-5405
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Benson Margulies

 SOLR-5623 included some conversation about the dilemmas of exception 
 management and reporting in the analysis chain. Here is a 5.0 proposal:
 TokenStream.reset and TokenStream.incrementToken (and perhaps the rest) 
 should have a new, checked, exception in their signatures: call it 
 AnalysisError if you like. Unlike IOException, it will have a full set of 
 constructors, including the constructors that can wrap a 'cause'. Its 
 constructors will accept a field name.
 TokenStream will have a fieldName field, accepted in a constructor argument. 
 (OK, this might a bit authoritarian.)
 TokenStream will have:
   protected void throwAnalysisException(String explanation, Throwable cause) {
 throw new AnalysisException(fieldName, explanation, cause);
   }
 Implementors of analysis will be thus encouraged to write things like:
   try {
 doSomething();
   } catch (IOExceptionOrWhatever e) {
 throwAnalysisException(Some Explanation, e);
  }
 Then, situations like Solr can diagnose the field name.
 Note that no information is lost here, due to the use of exception wrapping.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-5405) Exception strategy for analysis improved

2014-01-18 Thread Benson Margulies (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-5405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Benson Margulies updated LUCENE-5405:
-

Description:
SOLR-5623 included some conversation about the dilemmas of exception management
and reporting in the analysis chain.

I've belatedly become educated about the infostream, and this situation is a
job for it. The DocInverterPerField can note exceptions in the analysis chain,
log out to the infostream, and then rethrow them as before. No wrapping, no
muss, no fuss.

There are comments on this JIRA from a more complex prior idea that readers
might want to ignore.

was:
SOLR-5623 included some conversation about the dilemmas of exception management
and reporting in the analysis chain. Here is a 5.0 proposal:

TokenStream.reset and TokenStream.incrementToken (and perhaps the rest) should
have a new, checked, exception in their signatures: call it AnalysisError if
you like. Unlike IOException, it will have a full set of constructors,
including the constructors that can wrap a 'cause'. Its constructors will
accept a field name.

TokenStream will have a fieldName field, accepted in a constructor argument.
(OK, this might a bit authoritarian.)

TokenStream will have:

protected void throwAnalysisException(String explanation, Throwable cause) {
throw new AnalysisException(fieldName, explanation, cause);
}

Implementors of analysis will be thus encouraged to write things like:

try {
doSomething();
} catch (IOExceptionOrWhatever e) {
throwAnalysisException(Some Explanation, e);
}

Then, situations like Solr can diagnose the field name.

Note that no information is lost here, due to the use of exception wrapping.

Exception strategy for analysis improved

Key: LUCENE-5405
URL: https://issues.apache.org/jira/browse/LUCENE-5405
Project: Lucene - Core
Issue Type: Improvement
Reporter: Benson Margulies

SOLR-5623 included some conversation about the dilemmas of exception
management and reporting in the analysis chain.
I've belatedly become educated about the infostream, and this situation is a
job for it. The DocInverterPerField can note exceptions in the analysis
chain, log out to the infostream, and then rethrow them as before. No
wrapping, no muss, no fuss.
There are comments on this JIRA from a more complex prior idea that readers
might want to ignore.

--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5405) Exception strategy for analysis improved

2014-01-18 Thread Benson Margulies (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13875724#comment-13875724
 ] 

Benson Margulies commented on LUCENE-5405:
--

https://github.com/apache/lucene-solr/pull/21 is a seemingly simple idea for 
how to code this.

I'm off to write the test. In the mean time, I offer the PR just to show a 
concrete idea.

 Exception strategy for analysis improved
 

 Key: LUCENE-5405
 URL: https://issues.apache.org/jira/browse/LUCENE-5405
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Benson Margulies

 SOLR-5623 included some conversation about the dilemmas of exception 
 management and reporting in the analysis chain. 
 I've belatedly become educated about the infostream, and this situation is a 
 job for it. The DocInverterPerField can note exceptions in the analysis 
 chain, log out to the infostream, and then rethrow them as before. No 
 wrapping, no muss, no fuss.
 There are comments on this JIRA from a more complex prior idea that readers 
 might want to ignore.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5405) Exception strategy for analysis improved

2014-01-18 Thread Benson Margulies (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13875726#comment-13875726
 ] 

Benson Margulies commented on LUCENE-5405:
--

Test added. It passes. I'm sure I've missed something here.


 Exception strategy for analysis improved
 

 Key: LUCENE-5405
 URL: https://issues.apache.org/jira/browse/LUCENE-5405
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Benson Margulies

 SOLR-5623 included some conversation about the dilemmas of exception 
 management and reporting in the analysis chain. 
 I've belatedly become educated about the infostream, and this situation is a 
 job for it. The DocInverterPerField can note exceptions in the analysis 
 chain, log out to the infostream, and then rethrow them as before. No 
 wrapping, no muss, no fuss.
 There are comments on this JIRA from a more complex prior idea that readers 
 might want to ignore.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (SOLR-5623) Better diagnosis of RuntimeExceptions in analysis

2014-01-11 Thread Benson Margulies (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-5623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13868400#comment-13868400
 ] 

Benson Margulies edited comment on SOLR-5623 at 1/11/14 7:59 PM:
-

[~hossman_luc...@fucit.org] I think the patch request is now good to go, again 
sticking with a Solr change and considering a Lucene change later on.


was (Author: bmargulies):
I think the patch request is now good to go, again sticking with a Solr change 
and considering a Lucene change later on.

 Better diagnosis of RuntimeExceptions in analysis
 -

 Key: SOLR-5623
 URL: https://issues.apache.org/jira/browse/SOLR-5623
 Project: Solr
  Issue Type: Bug
Reporter: Benson Margulies

 If an analysis component (tokenizer, filter, etc) gets really into a hissy 
 fit and throws a RuntimeException, the resulting log traffic is less than 
 informative, lacking any pointer to the doc under discussion (in the doc 
 case). It would be more better if there was a catch/try shortstop that logged 
 this more informatively.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-5392) Documentation for modified token / analysis pipeline

2014-01-10 Thread Benson Margulies (JIRA)

Benson Margulies created LUCENE-5392:


 Summary: Documentation for modified token / analysis pipeline
 Key: LUCENE-5392
 URL: https://issues.apache.org/jira/browse/LUCENE-5392
 Project: Lucene - Core
  Issue Type: Bug
Affects Versions: 5.0
Reporter: Benson Margulies


The changes to the tokenizer and analyzer need to be reflected in the package 
overview for core analysis.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5392) Documentation for modified token / analysis pipeline

2014-01-10 Thread Benson Margulies (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13867801#comment-13867801
 ] 

Benson Margulies commented on LUCENE-5392:
--

https://github.com/apache/lucene-solr/pull/17

[~rcmuir] and [~thetaphi] 'Look out below', here's a bunch of work on the 
analysis doc.


 Documentation for modified token / analysis pipeline
 

 Key: LUCENE-5392
 URL: https://issues.apache.org/jira/browse/LUCENE-5392
 Project: Lucene - Core
  Issue Type: Bug
Affects Versions: 5.0
Reporter: Benson Margulies

 The changes to the tokenizer and analyzer need to be reflected in the package 
 overview for core analysis.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (SOLR-5623) Better diagnosis of RuntimeExceptions in analysis

2014-01-10 Thread Benson Margulies (JIRA)

Benson Margulies created SOLR-5623:
--

 Summary: Better diagnosis of RuntimeExceptions in analysis
 Key: SOLR-5623
 URL: https://issues.apache.org/jira/browse/SOLR-5623
 Project: Solr
  Issue Type: Bug
Reporter: Benson Margulies


If an analysis component (tokenizer, filter, etc) gets really into a hissy fit 
and throws a RuntimeException, the resulting log traffic is less than 
informative, lacking any pointer to the doc under discussion (in the doc case). 
It would be more better if there was a catch/try shortstop that logged this 
more informatively.




--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-5623) Better diagnosis of RuntimeExceptions in analysis

2014-01-10 Thread Benson Margulies (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-5623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13868265#comment-13868265
 ] 

Benson Margulies commented on SOLR-5623:


[~hossman_luc...@fucit.org] https://github.com/apache/lucene-solr/pull/18 shows 
the failing test case.

How do I make a test that asserts facts about logging? I can certainly use this 
to make some improvements to the logging, but I don't know how to automate 
proving that I did it?


 Better diagnosis of RuntimeExceptions in analysis
 -

 Key: SOLR-5623
 URL: https://issues.apache.org/jira/browse/SOLR-5623
 Project: Solr
  Issue Type: Bug
Reporter: Benson Margulies

 If an analysis component (tokenizer, filter, etc) gets really into a hissy 
 fit and throws a RuntimeException, the resulting log traffic is less than 
 informative, lacking any pointer to the doc under discussion (in the doc 
 case). It would be more better if there was a catch/try shortstop that logged 
 this more informatively.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-5623) Better diagnosis of RuntimeExceptions in analysis

2014-01-10 Thread Benson Margulies (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-5623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13868325#comment-13868325
 ] 

Benson Margulies commented on SOLR-5623:


OK. Does it make sense to adopt the idea that 'if there is an ID field with a 
value, include that in the exception?



 Better diagnosis of RuntimeExceptions in analysis
 -

 Key: SOLR-5623
 URL: https://issues.apache.org/jira/browse/SOLR-5623
 Project: Solr
  Issue Type: Bug
Reporter: Benson Margulies

 If an analysis component (tokenizer, filter, etc) gets really into a hissy 
 fit and throws a RuntimeException, the resulting log traffic is less than 
 informative, lacking any pointer to the doc under discussion (in the doc 
 case). It would be more better if there was a catch/try shortstop that logged 
 this more informatively.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-5623) Better diagnosis of RuntimeExceptions in analysis

2014-01-10 Thread Benson Margulies (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-5623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13868347#comment-13868347
 ] 

Benson Margulies commented on SOLR-5623:


[~rcmuir] The identity of the field we are processing is known down in Lucene 
core. What do you think about wrapping generic Throwables in 
org.apache.lucene.index.DocInverterPerField.processFields in some specific 
runtime exception that carries the field name?

Then I can in turn make it into a Solr exception in DirectUpdateHandler2.



 Better diagnosis of RuntimeExceptions in analysis
 -

 Key: SOLR-5623
 URL: https://issues.apache.org/jira/browse/SOLR-5623
 Project: Solr
  Issue Type: Bug
Reporter: Benson Margulies

 If an analysis component (tokenizer, filter, etc) gets really into a hissy 
 fit and throws a RuntimeException, the resulting log traffic is less than 
 informative, lacking any pointer to the doc under discussion (in the doc 
 case). It would be more better if there was a catch/try shortstop that logged 
 this more informatively.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (SOLR-5623) Better diagnosis of RuntimeExceptions in analysis

2014-01-10 Thread Benson Margulies (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-5623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13868347#comment-13868347
 ] 

Benson Margulies edited comment on SOLR-5623 at 1/10/14 9:46 PM:
-

[~rcmuir] The identity of the field we are processing is known down in Lucene 
core. What do you think about wrapping Throwables in 
org.apache.lucene.index.DocInverterPerField.processFields in some specific 
runtime exception that carries the field name?

Then I can in turn make it into a Solr exception in DirectUpdateHandler2.




was (Author: bmargulies):
[~rcmuir] The identity of the field we are processing is known down in Lucene 
core. What do you think about wrapping generic Throwables in 
org.apache.lucene.index.DocInverterPerField.processFields in some specific 
runtime exception that carries the field name?

Then I can in turn make it into a Solr exception in DirectUpdateHandler2.



 Better diagnosis of RuntimeExceptions in analysis
 -

 Key: SOLR-5623
 URL: https://issues.apache.org/jira/browse/SOLR-5623
 Project: Solr
  Issue Type: Bug
Reporter: Benson Margulies

 If an analysis component (tokenizer, filter, etc) gets really into a hissy 
 fit and throws a RuntimeException, the resulting log traffic is less than 
 informative, lacking any pointer to the doc under discussion (in the doc 
 case). It would be more better if there was a catch/try shortstop that logged 
 this more informatively.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-5623) Better diagnosis of RuntimeExceptions in analysis

2014-01-10 Thread Benson Margulies (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-5623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13868359#comment-13868359
 ] 

Benson Margulies commented on SOLR-5623:


OK, we can log and return the ID and not the field name, and that's already an 
improvement, so I'll stick with that.

 Better diagnosis of RuntimeExceptions in analysis
 -

 Key: SOLR-5623
 URL: https://issues.apache.org/jira/browse/SOLR-5623
 Project: Solr
  Issue Type: Bug
Reporter: Benson Margulies

 If an analysis component (tokenizer, filter, etc) gets really into a hissy 
 fit and throws a RuntimeException, the resulting log traffic is less than 
 informative, lacking any pointer to the doc under discussion (in the doc 
 case). It would be more better if there was a catch/try shortstop that logged 
 this more informatively.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-5623) Better diagnosis of RuntimeExceptions in analysis

2014-01-10 Thread Benson Margulies (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-5623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13868368#comment-13868368
 ] 

Benson Margulies commented on SOLR-5623:


Here's another ouch.

Doc for Document#get says:
{code}
/** Returns the string value of the field with the given name if any exist in
   * this document, or null.  If multiple fields exist with this name, this
   * method returns the first value added. If only binary fields with this name
   * exist, returns null.
   * For {@link IntField}, {@link LongField}, {@link
   * FloatField} and {@link DoubleField} it returns the string value of the 
number. If you want
   * the actual numeric field instance back, use {@link #getField}.
   */
{code}

But given a Solr field like the following, with a value of '1', I end up with 
\u0080\u\u0001. It doesn't appear to be an IntField, just a Field. What am 
I missing?

{code}
   field name=id type=sint indexed=true stored=true 
multiValued=false/
{code}





 Better diagnosis of RuntimeExceptions in analysis
 -

 Key: SOLR-5623
 URL: https://issues.apache.org/jira/browse/SOLR-5623
 Project: Solr
  Issue Type: Bug
Reporter: Benson Margulies

 If an analysis component (tokenizer, filter, etc) gets really into a hissy 
 fit and throws a RuntimeException, the resulting log traffic is less than 
 informative, lacking any pointer to the doc under discussion (in the doc 
 case). It would be more better if there was a catch/try shortstop that logged 
 this more informatively.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-5623) Better diagnosis of RuntimeExceptions in analysis

2014-01-10 Thread Benson Margulies (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-5623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13868374#comment-13868374
 ] 

Benson Margulies commented on SOLR-5623:


Back to the exception decoration problem:

There's a general design puzzle here: an outer function knows something that an 
inner function does not, and the catcher of the exception wants to know both. I 
share your distaste for the obvious Java solution of endless exception 
wrapping. Another option is to log, but does the Lucene code log things? 

I'm working against trunk because I don't know any better. I'm inclined to stay 
out at the Solr level for now, and maybe make another patch for this idea in 
the core later.


 Better diagnosis of RuntimeExceptions in analysis
 -

 Key: SOLR-5623
 URL: https://issues.apache.org/jira/browse/SOLR-5623
 Project: Solr
  Issue Type: Bug
Reporter: Benson Margulies

 If an analysis component (tokenizer, filter, etc) gets really into a hissy 
 fit and throws a RuntimeException, the resulting log traffic is less than 
 informative, lacking any pointer to the doc under discussion (in the doc 
 case). It would be more better if there was a catch/try shortstop that logged 
 this more informatively.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Issue Comment Deleted] (SOLR-5623) Better diagnosis of RuntimeExceptions in analysis

2014-01-10 Thread Benson Margulies (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-5623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benson Margulies updated SOLR-5623:
---

Comment: was deleted

(was: Here's another ouch.

Doc for Document#get says:
{code}
/** Returns the string value of the field with the given name if any exist in
   * this document, or null.  If multiple fields exist with this name, this
   * method returns the first value added. If only binary fields with this name
   * exist, returns null.
   * For {@link IntField}, {@link LongField}, {@link
   * FloatField} and {@link DoubleField} it returns the string value of the 
number. If you want
   * the actual numeric field instance back, use {@link #getField}.
   */
{code}

But given a Solr field like the following, with a value of '1', I end up with 
\u0080\u\u0001. It doesn't appear to be an IntField, just a Field. What am 
I missing?

{code}
   field name=id type=sint indexed=true stored=true 
multiValued=false/
{code}



)

 Better diagnosis of RuntimeExceptions in analysis
 -

 Key: SOLR-5623
 URL: https://issues.apache.org/jira/browse/SOLR-5623
 Project: Solr
  Issue Type: Bug
Reporter: Benson Margulies

 If an analysis component (tokenizer, filter, etc) gets really into a hissy 
 fit and throws a RuntimeException, the resulting log traffic is less than 
 informative, lacking any pointer to the doc under discussion (in the doc 
 case). It would be more better if there was a catch/try shortstop that logged 
 this more informatively.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-5623) Better diagnosis of RuntimeExceptions in analysis

2014-01-10 Thread Benson Margulies (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-5623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13868400#comment-13868400
 ] 

Benson Margulies commented on SOLR-5623:


I think the patch request is now good to go, again sticking with a Solr change 
and considering a Lucene change later on.

 Better diagnosis of RuntimeExceptions in analysis
 -

 Key: SOLR-5623
 URL: https://issues.apache.org/jira/browse/SOLR-5623
 Project: Solr
  Issue Type: Bug
Reporter: Benson Margulies

 If an analysis component (tokenizer, filter, etc) gets really into a hissy 
 fit and throws a RuntimeException, the resulting log traffic is less than 
 informative, lacking any pointer to the doc under discussion (in the doc 
 case). It would be more better if there was a catch/try shortstop that logged 
 this more informatively.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5388) Eliminate construction over readers for Tokenizer

2014-01-09 Thread Benson Margulies (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13866561#comment-13866561
 ] 

Benson Margulies commented on LUCENE-5388:
--

Should I try to get the branch in git to match the .patch, or should I just let 
you proceed from here? I guess that might depend on reactions of others.

 Eliminate construction over readers for Tokenizer
 -

 Key: LUCENE-5388
 URL: https://issues.apache.org/jira/browse/LUCENE-5388
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/other
Reporter: Benson Margulies
 Attachments: LUCENE-5388.patch


 In the modern world, Tokenizers are intended to be reusable, with input 
 supplied via #setReader. The constructors that take Reader are a vestige. 
 Worse yet, they invite people to make mistakes in handling the reader that 
 tangle them up with the state machine in Tokenizer. The sensible thing is to 
 eliminate these ctors, and force setReader usage.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5389) Even more doc for construction of TokenStream components

2014-01-09 Thread Benson Margulies (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13866765#comment-13866765
 ] 

Benson Margulies commented on LUCENE-5389:
--

[~rcmuir]I think that this is ready to go . If you commit this and merge down 
to 4.x, I can then tackle work on this file for the new stuff.


 Even more doc for construction of TokenStream components
 

 Key: LUCENE-5389
 URL: https://issues.apache.org/jira/browse/LUCENE-5389
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Benson Margulies

 There are more useful things to tell would-be authors of tokenizers. Let's 
 tell them.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5386) Make Tokenizers deliver their final offsets

2014-01-09 Thread Benson Margulies (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13867437#comment-13867437
 ] 

Benson Margulies commented on LUCENE-5386:
--

Let me try to restate the above in my own words to make sure I understand it.

At #end(), all the pieces of an analysis chain are responsible for putting the 
attributes into a consistent state that reflects the end of the input. 
TokenStream itself takes care of PositionIncrementAttribute. Only the Tokenizer 
can take care of OffsetAttribute, but it's easy to forget -- and if there are 
other interesting things going on, a Tokenizer or anything else may have other 
work to do. 

So Rob's thoughts above are to make Tokenizer or a derivative track the final 
offset, which is simple, and have protocol to keep PositionIncrement in line 
given the possibility of skipped tokens. To avoid loading up the 'Tokenizer' 
class with too much stuff that someone might want to do for themselves, add an 
intermediate class for this and let Tokenizer proper be lean.

I'll get organized to sketch some code.

 Make Tokenizers deliver their final offsets
 ---

 Key: LUCENE-5386
 URL: https://issues.apache.org/jira/browse/LUCENE-5386
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Benson Margulies

 Tokenizers _must_ have an implementation of #end() in which they set up the 
 final offset. Currently, nothing enforces this. end() has a useful 
 implementation in TokenStream, so just making it abstract is not attractive.
 Proposal: add
   abstract int finalOffset(); 
 to tokenizer, and then make
 void end() {
 super.end();
 int fo = finalOffset();
offsetAttr.setOffsets(fo, fo);
}
 or something to that effect.
 Other alternative to be considered depending on how this looks.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5389) Even more doc for construction of TokenStream components

2014-01-09 Thread Benson Margulies (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13867441#comment-13867441
 ] 

Benson Margulies commented on LUCENE-5389:
--

Sorry, I forgot to lint after accepting the suggestion about delegation.

Yes, I'll start making various next-step patches.


 Even more doc for construction of TokenStream components
 

 Key: LUCENE-5389
 URL: https://issues.apache.org/jira/browse/LUCENE-5389
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Benson Margulies
 Fix For: 5.0, 4.7


 There are more useful things to tell would-be authors of tokenizers. Let's 
 tell them.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5388) Eliminate construction over readers for Tokenizer

2014-01-08 Thread Benson Margulies (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13865385#comment-13865385
 ] 

Benson Margulies commented on LUCENE-5388:
--

Uwe, what's that mean practically? No PR yet? A PR just in trunk? Merging my 
recent doc to a 4.x branch?

 Eliminate construction over readers for Tokenizer
 -

 Key: LUCENE-5388
 URL: https://issues.apache.org/jira/browse/LUCENE-5388
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/other
Reporter: Benson Margulies

 In the modern world, Tokenizers are intended to be reusable, with input 
 supplied via #setReader. The constructors that take Reader are a vestige. 
 Worse yet, they invite people to make mistakes in handling the reader that 
 tangle them up with the state machine in Tokenizer. The sensible thing is to 
 eliminate these ctors, and force setReader usage.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-5388) Eliminate construction over readers for Tokenizer

2014-01-08 Thread Benson Margulies (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13865385#comment-13865385
 ] 

Benson Margulies edited comment on LUCENE-5388 at 1/8/14 12:59 PM:
---

Uwe, what's that mean practically? No PR yet? A PR just in trunk? Merging my 
recent doc to a 4.x branch? A feature branch where this goes to be merged in 
when the time is ripe?


was (Author: bmargulies):
Uwe, what's that mean practically? No PR yet? A PR just in trunk? Merging my 
recent doc to a 4.x branch?

 Eliminate construction over readers for Tokenizer
 -

 Key: LUCENE-5388
 URL: https://issues.apache.org/jira/browse/LUCENE-5388
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/other
Reporter: Benson Margulies

 In the modern world, Tokenizers are intended to be reusable, with input 
 supplied via #setReader. The constructors that take Reader are a vestige. 
 Worse yet, they invite people to make mistakes in handling the reader that 
 tangle them up with the state machine in Tokenizer. The sensible thing is to 
 eliminate these ctors, and force setReader usage.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5388) Eliminate construction over readers for Tokenizer

2014-01-08 Thread Benson Margulies (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13865573#comment-13865573
 ] 

Benson Margulies commented on LUCENE-5388:
--

How about we start by adding ctors that don't require a reader, and do treat 
them as 4.x fodder?

 Eliminate construction over readers for Tokenizer
 -

 Key: LUCENE-5388
 URL: https://issues.apache.org/jira/browse/LUCENE-5388
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/other
Reporter: Benson Margulies

 In the modern world, Tokenizers are intended to be reusable, with input 
 supplied via #setReader. The constructors that take Reader are a vestige. 
 Worse yet, they invite people to make mistakes in handling the reader that 
 tangle them up with the state machine in Tokenizer. The sensible thing is to 
 eliminate these ctors, and force setReader usage.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5388) Eliminate construction over readers for Tokenizer

2014-01-08 Thread Benson Margulies (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13865580#comment-13865580
 ] 

Benson Margulies commented on LUCENE-5388:
--

setReader throws IOException, but the existing constructors don't. Analyzer 
'createComponents' doesn't. How to sort this out?

 Eliminate construction over readers for Tokenizer
 -

 Key: LUCENE-5388
 URL: https://issues.apache.org/jira/browse/LUCENE-5388
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/other
Reporter: Benson Margulies

 In the modern world, Tokenizers are intended to be reusable, with input 
 supplied via #setReader. The constructors that take Reader are a vestige. 
 Worse yet, they invite people to make mistakes in handling the reader that 
 tangle them up with the state machine in Tokenizer. The sensible thing is to 
 eliminate these ctors, and force setReader usage.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5388) Eliminate construction over readers for Tokenizer

2014-01-08 Thread Benson Margulies (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13865626#comment-13865626
 ] 

Benson Margulies commented on LUCENE-5388:
--

OK, I see. If we don't do compatibility, then no one calls setReader in 
createComponents, and all is well. OK, I'm proceeding.


 Eliminate construction over readers for Tokenizer
 -

 Key: LUCENE-5388
 URL: https://issues.apache.org/jira/browse/LUCENE-5388
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/other
Reporter: Benson Margulies

 In the modern world, Tokenizers are intended to be reusable, with input 
 supplied via #setReader. The constructors that take Reader are a vestige. 
 Worse yet, they invite people to make mistakes in handling the reader that 
 tangle them up with the state machine in Tokenizer. The sensible thing is to 
 eliminate these ctors, and force setReader usage.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5388) Eliminate construction over readers for Tokenizer

2014-01-08 Thread Benson Margulies (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13865634#comment-13865634
 ] 

Benson Margulies commented on LUCENE-5388:
--

Why does the reader get passed to createComponents in this model? Should that 
param go away?


 Eliminate construction over readers for Tokenizer
 -

 Key: LUCENE-5388
 URL: https://issues.apache.org/jira/browse/LUCENE-5388
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/other
Reporter: Benson Margulies

 In the modern world, Tokenizers are intended to be reusable, with input 
 supplied via #setReader. The constructors that take Reader are a vestige. 
 Worse yet, they invite people to make mistakes in handling the reader that 
 tangle them up with the state machine in Tokenizer. The sensible thing is to 
 eliminate these ctors, and force setReader usage.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5388) Eliminate construction over readers for Tokenizer

2014-01-08 Thread Benson Margulies (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13865661#comment-13865661
 ] 

Benson Margulies commented on LUCENE-5388:
--

https://github.com/apache/lucene-solr/pull/16 is available for your read 
pleasure to see what these changes look like.


 Eliminate construction over readers for Tokenizer
 -

 Key: LUCENE-5388
 URL: https://issues.apache.org/jira/browse/LUCENE-5388
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/other
Reporter: Benson Margulies

 In the modern world, Tokenizers are intended to be reusable, with input 
 supplied via #setReader. The constructors that take Reader are a vestige. 
 Worse yet, they invite people to make mistakes in handling the reader that 
 tangle them up with the state machine in Tokenizer. The sensible thing is to 
 eliminate these ctors, and force setReader usage.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5388) Eliminate construction over readers for Tokenizer

2014-01-08 Thread Benson Margulies (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13865667#comment-13865667
 ] 

Benson Margulies commented on LUCENE-5388:
--

[~rcmuir] Next frontier is TokenizerFactory.

Do we change #create to not take a reader, or do we add 'throws IOException'? 
Based on comments above, I'd think we take out the reader.

[~mikemccand] I would love help. If you tell me your github id, I'll add you to 
my repo, and you can take up some of the ton of editing.


 Eliminate construction over readers for Tokenizer
 -

 Key: LUCENE-5388
 URL: https://issues.apache.org/jira/browse/LUCENE-5388
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/other
Reporter: Benson Margulies

 In the modern world, Tokenizers are intended to be reusable, with input 
 supplied via #setReader. The constructors that take Reader are a vestige. 
 Worse yet, they invite people to make mistakes in handling the reader that 
 tangle them up with the state machine in Tokenizer. The sensible thing is to 
 eliminate these ctors, and force setReader usage.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5388) Eliminate construction over readers for Tokenizer

2014-01-08 Thread Benson Margulies (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13865907#comment-13865907
 ] 

Benson Margulies commented on LUCENE-5388:
--

[~rcmuir] or [~mikemccand] I could really use some help here with 
TestRandomChains.

 Eliminate construction over readers for Tokenizer
 -

 Key: LUCENE-5388
 URL: https://issues.apache.org/jira/browse/LUCENE-5388
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/other
Reporter: Benson Margulies

 In the modern world, Tokenizers are intended to be reusable, with input 
 supplied via #setReader. The constructors that take Reader are a vestige. 
 Worse yet, they invite people to make mistakes in handling the reader that 
 tangle them up with the state machine in Tokenizer. The sensible thing is to 
 eliminate these ctors, and force setReader usage.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5388) Eliminate construction over readers for Tokenizer

2014-01-08 Thread Benson Margulies (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13865925#comment-13865925
 ] 

Benson Margulies commented on LUCENE-5388:
--

It does something complex with the input reader in a createComponents. the 
challenge is to move all that to initReader so that it works. I think I'm too 
fried from 1000 other edits, I'll look in after dinner but anyone who wants to 
grab my branch from github and pitch in is more than welcome.


 Eliminate construction over readers for Tokenizer
 -

 Key: LUCENE-5388
 URL: https://issues.apache.org/jira/browse/LUCENE-5388
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/other
Reporter: Benson Margulies

 In the modern world, Tokenizers are intended to be reusable, with input 
 supplied via #setReader. The constructors that take Reader are a vestige. 
 Worse yet, they invite people to make mistakes in handling the reader that 
 tangle them up with the state machine in Tokenizer. The sensible thing is to 
 eliminate these ctors, and force setReader usage.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5388) Eliminate construction over readers for Tokenizer

2014-01-08 Thread Benson Margulies (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13866077#comment-13866077
 ] 

Benson Margulies commented on LUCENE-5388:
--

You got me off the dot on RandomChains.

 Eliminate construction over readers for Tokenizer
 -

 Key: LUCENE-5388
 URL: https://issues.apache.org/jira/browse/LUCENE-5388
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/other
Reporter: Benson Margulies

 In the modern world, Tokenizers are intended to be reusable, with input 
 supplied via #setReader. The constructors that take Reader are a vestige. 
 Worse yet, they invite people to make mistakes in handling the reader that 
 tangle them up with the state machine in Tokenizer. The sensible thing is to 
 eliminate these ctors, and force setReader usage.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5388) Eliminate construction over readers for Tokenizer

2014-01-08 Thread Benson Margulies (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13866131#comment-13866131
 ] 

Benson Margulies commented on LUCENE-5388:
--

I've got all of lucene to compile, and a bunch of tests running.

 Eliminate construction over readers for Tokenizer
 -

 Key: LUCENE-5388
 URL: https://issues.apache.org/jira/browse/LUCENE-5388
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/other
Reporter: Benson Margulies

 In the modern world, Tokenizers are intended to be reusable, with input 
 supplied via #setReader. The constructors that take Reader are a vestige. 
 Worse yet, they invite people to make mistakes in handling the reader that 
 tangle them up with the state machine in Tokenizer. The sensible thing is to 
 eliminate these ctors, and force setReader usage.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5388) Eliminate construction over readers for Tokenizer

2014-01-08 Thread Benson Margulies (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13866227#comment-13866227
 ] 

Benson Margulies commented on LUCENE-5388:
--

I have Solr test failures in PreAnalyzedField, which has some stubborn fondness 
for the idea of a reader passed to a constructor.


 Eliminate construction over readers for Tokenizer
 -

 Key: LUCENE-5388
 URL: https://issues.apache.org/jira/browse/LUCENE-5388
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/other
Reporter: Benson Margulies

 In the modern world, Tokenizers are intended to be reusable, with input 
 supplied via #setReader. The constructors that take Reader are a vestige. 
 Worse yet, they invite people to make mistakes in handling the reader that 
 tangle them up with the state machine in Tokenizer. The sensible thing is to 
 eliminate these ctors, and force setReader usage.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-5388) Eliminate construction over readers for Tokenizer

2014-01-08 Thread Benson Margulies (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13866227#comment-13866227
 ] 

Benson Margulies edited comment on LUCENE-5388 at 1/9/14 2:41 AM:
--

I have Solr test failures in PreAnalyzedField, which has some stubborn fondness 
for the idea of a reader passed to a constructor. But that seems to be all 
that's broken; a few Solr failures (based on 'ant test').



was (Author: bmargulies):
I have Solr test failures in PreAnalyzedField, which has some stubborn fondness 
for the idea of a reader passed to a constructor.


 Eliminate construction over readers for Tokenizer
 -

 Key: LUCENE-5388
 URL: https://issues.apache.org/jira/browse/LUCENE-5388
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/other
Reporter: Benson Margulies

 In the modern world, Tokenizers are intended to be reusable, with input 
 supplied via #setReader. The constructors that take Reader are a vestige. 
 Worse yet, they invite people to make mistakes in handling the reader that 
 tangle them up with the state machine in Tokenizer. The sensible thing is to 
 eliminate these ctors, and force setReader usage.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-5388) Eliminate construction over readers for Tokenizer

2014-01-07 Thread Benson Margulies (JIRA)

Benson Margulies created LUCENE-5388:


 Summary: Eliminate construction over readers for Tokenizer
 Key: LUCENE-5388
 URL: https://issues.apache.org/jira/browse/LUCENE-5388
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/other
Reporter: Benson Margulies


In the modern world, Tokenizers are intended to be reusable, with input 
supplied via #setReader. The constructors that take Reader are a vestige. Worse 
yet, they invite people to make mistakes in handling the reader that tangle 
them up with the state machine in Tokenizer. The sensible thing is to eliminate 
these ctors, and force setReader usage.




--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-5389) Even more doc for construction of TokenStream components

2014-01-07 Thread Benson Margulies (JIRA)

Benson Margulies created LUCENE-5389:


 Summary: Even more doc for construction of TokenStream components
 Key: LUCENE-5389
 URL: https://issues.apache.org/jira/browse/LUCENE-5389
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Benson Margulies


There are more useful things to tell would-be authors of tokenizers. Let's tell 
them.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5389) Even more doc for construction of TokenStream components

2014-01-07 Thread Benson Margulies (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13864825#comment-13864825
 ] 

Benson Margulies commented on LUCENE-5389:
--

https://github.com/apache/lucene-solr/pull/14



 Even more doc for construction of TokenStream components
 

 Key: LUCENE-5389
 URL: https://issues.apache.org/jira/browse/LUCENE-5389
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Benson Margulies

 There are more useful things to tell would-be authors of tokenizers. Let's 
 tell them.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-5386) Make Tokenizers deliver their final offsets

2014-01-06 Thread Benson Margulies (JIRA)

Benson Margulies created LUCENE-5386:


 Summary: Make Tokenizers deliver their final offsets
 Key: LUCENE-5386
 URL: https://issues.apache.org/jira/browse/LUCENE-5386
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Benson Margulies


Tokenizers _must_ have an implementation of #end() in which they set up the 
final offset. Currently, nothing enforces this. end() has a useful 
implementation in TokenStream, so just making it abstract is not attractive.

Proposal: add

  abstract int finalOffset(); 

to tokenizer, and then make

void end() {
super.end();
int fo = finalOffset();
   offsetAttr.setOffsets(fo, fo);
   }

or something to that effect.

Other alternative to be considered depending on how this looks.




--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5386) Make Tokenizers deliver their final offsets

2014-01-06 Thread Benson Margulies (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13863521#comment-13863521
 ] 

Benson Margulies commented on LUCENE-5386:
--

How about, then:

 Tokenizer:

 abstract void tokenizerEnd();

 final void end() {
   super.end();
   tokenizerEnd();
}

?


 Make Tokenizers deliver their final offsets
 ---

 Key: LUCENE-5386
 URL: https://issues.apache.org/jira/browse/LUCENE-5386
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Benson Margulies

 Tokenizers _must_ have an implementation of #end() in which they set up the 
 final offset. Currently, nothing enforces this. end() has a useful 
 implementation in TokenStream, so just making it abstract is not attractive.
 Proposal: add
   abstract int finalOffset(); 
 to tokenizer, and then make
 void end() {
 super.end();
 int fo = finalOffset();
offsetAttr.setOffsets(fo, fo);
}
 or something to that effect.
 Other alternative to be considered depending on how this looks.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5386) Make Tokenizers deliver their final offsets

2014-01-06 Thread Benson Margulies (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13863570#comment-13863570
 ] 

Benson Margulies commented on LUCENE-5386:
--

Can you help me with how this relates to your previous remark about attributes 
other than Offset? What other attributes would get manipulated and how? 

 Make Tokenizers deliver their final offsets
 ---

 Key: LUCENE-5386
 URL: https://issues.apache.org/jira/browse/LUCENE-5386
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Benson Margulies

 Tokenizers _must_ have an implementation of #end() in which they set up the 
 final offset. Currently, nothing enforces this. end() has a useful 
 implementation in TokenStream, so just making it abstract is not attractive.
 Proposal: add
   abstract int finalOffset(); 
 to tokenizer, and then make
 void end() {
 super.end();
 int fo = finalOffset();
offsetAttr.setOffsets(fo, fo);
}
 or something to that effect.
 Other alternative to be considered depending on how this looks.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-5384) Analysis overview could mention clearAttributes and end

2014-01-05 Thread Benson Margulies (JIRA)

Benson Margulies created LUCENE-5384:


 Summary: Analysis overview could mention clearAttributes and end
 Key: LUCENE-5384
 URL: https://issues.apache.org/jira/browse/LUCENE-5384
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Benson Margulies
Assignee: Benson Margulies


It would be helpful to tokenizer implementors for the analysis package overview 
to mention more things. I'll supply a patch.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5384) Analysis overview could mention clearAttributes and end

2014-01-05 Thread Benson Margulies (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13862696#comment-13862696
 ] 

Benson Margulies commented on LUCENE-5384:
--

https://github.com/apache/lucene-solr/pull/12 contains some more documentation.

Yes, this is offered under the terms of the Apache license, in case anyone is 
still uncertain as to the relationship of github pull requests to the AL.


 Analysis overview could mention clearAttributes and end
 ---

 Key: LUCENE-5384
 URL: https://issues.apache.org/jira/browse/LUCENE-5384
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Benson Margulies
Assignee: Benson Margulies

 It would be helpful to tokenizer implementors for the analysis package 
 overview to mention more things. I'll supply a patch.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module

2013-11-12 Thread Benson Margulies (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13820031#comment-13820031
 ] 

Benson Margulies commented on LUCENE-2899:
--

I know of an NER model that looks at the entire text to bias towards consistent 
tagging of entities in larger units. However, I agree that crocks are bad. 
Perhaps this is an opportunity to think about how to expand the analysis 
protocol to support this sort of thing more smoothly?

It would be desirable if this integration were to start with a set of Token 
Attributes that could be used in any number of analysis components, inside or 
outside of Lucene, that were in a position to deliver similar items. I suppose 
I'm late to ask for this, as the UIMA component must pose the same question.

In some languages, NER is very clumsy as a token filter, because entities don't 
obey token boundaries very well. Also, in my experience, entities aren't useful 
as additional tokens in the same field as their source text, but rather in 
their own field (where they can be facetted upon, for example). Is there any 
appetite to look at Lucene support for a stream that delivers to more than one 
field? Or is there such a thing and I've missed it?

I agree with Rob about UIMA because I think that Lucene analysis attributes are 
a weak data model for interconnecting NLP modules and flowing data through them 
-- and one frequently needs to do that.



 Add OpenNLP Analysis capabilities as a module
 -

 Key: LUCENE-2899
 URL: https://issues.apache.org/jira/browse/LUCENE-2899
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 4.6

 Attachments: LUCENE-2899-RJN.patch, LUCENE-2899.patch, 
 OpenNLPFilter.java, OpenNLPTokenizer.java


 Now that OpenNLP is an ASF project and has a nice license, it would be nice 
 to have a submodule (under analysis) that exposed capabilities for it. Drew 
 Farris, Tom Morton and I have code that does:
 * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it 
 would have to change slightly to buffer tokens)
 * NamedEntity recognition as a TokenFilter
 We are also planning a Tokenizer/TokenFilter that can put parts of speech as 
 either payloads (PartOfSpeechAttribute?) on a token or at the same position.
 I'd propose it go under:
 modules/analysis/opennlp



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module

2013-11-12 Thread Benson Margulies (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13820085#comment-13820085
 ] 

Benson Margulies commented on LUCENE-2899:
--

Fair enough. Solr URP's do this very well upstream of analysis. ES doesn't have 
the concept, perhaps it should. It clarifies the situation nicely to think of 
Lucene as serial token operations.

 Add OpenNLP Analysis capabilities as a module
 -

 Key: LUCENE-2899
 URL: https://issues.apache.org/jira/browse/LUCENE-2899
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 4.6

 Attachments: LUCENE-2899-RJN.patch, LUCENE-2899.patch, 
 OpenNLPFilter.java, OpenNLPTokenizer.java


 Now that OpenNLP is an ASF project and has a nice license, it would be nice 
 to have a submodule (under analysis) that exposed capabilities for it. Drew 
 Farris, Tom Morton and I have code that does:
 * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it 
 would have to change slightly to buffer tokens)
 * NamedEntity recognition as a TokenFilter
 We are also planning a Tokenizer/TokenFilter that can put parts of speech as 
 either payloads (PartOfSpeechAttribute?) on a token or at the same position.
 I'd propose it go under:
 modules/analysis/opennlp



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries

2013-11-02 Thread Benson Margulies (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13812034#comment-13812034
 ] 

Benson Margulies commented on LUCENE-4956:
--

Looks like mapHanja.dic needs some adjustment of the legal notice? Or was this 
going to be replaced?


 the korean analyzer that has a korean morphological analyzer and dictionaries
 -

 Key: LUCENE-4956
 URL: https://issues.apache.org/jira/browse/LUCENE-4956
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Affects Versions: 4.2
Reporter: SooMyung Lee
Assignee: Christian Moen
  Labels: newbie
 Attachments: LUCENE-4956.patch, eval.patch, kr.analyzer.4x.tar, 
 lucene-4956.patch, lucene4956.patch


 Korean language has specific characteristic. When developing search service 
 with lucene  solr in korean, there are some problems in searching and 
 indexing. The korean analyer solved the problems with a korean morphological 
 anlyzer. It consists of a korean morphological analyzer, dictionaries, a 
 korean tokenizer and a korean filter. The korean anlyzer is made for lucene 
 and solr. If you develop a search service with lucene in korean, It is the 
 best idea to choose the korean analyzer.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries

2013-11-02 Thread Benson Margulies (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13812044#comment-13812044
]

Benson Margulies commented on LUCENE-4956:
--

My point is that it might have a bit too much legal notice. Generally, when
someone grants a license, the headers all move up to some global NOTICE file,
and the file is left with just an Apache license.

I also noted the following:

! Except as contained in this notice, the name of a copyright holder shall not
be
! used in advertising or otherwise to promote the sale, use or other dealings
in
! these Data Files or Software without prior written authorization of the
copyright holder.

and then noticed:

that http://www.apache.org/legal/resolved.html says that it approves of

* BSD (without advertising clause).

So that Unicode license is possibly an issue.

Right now I'm using the git clone, but I just did a pull, and the pathname is
lucene/analysis/arirang/src/data/mapHanja.dic

the korean analyzer that has a korean morphological analyzer and dictionaries
-

Key: LUCENE-4956
URL: https://issues.apache.org/jira/browse/LUCENE-4956
Project: Lucene - Core
Issue Type: New Feature
Components: modules/analysis
Affects Versions: 4.2
Reporter: SooMyung Lee
Assignee: Christian Moen
Labels: newbie
Attachments: LUCENE-4956.patch, eval.patch, kr.analyzer.4x.tar,
lucene-4956.patch, lucene4956.patch

Korean language has specific characteristic. When developing search service
with lucene solr in korean, there are some problems in searching and
indexing. The korean analyer solved the problems with a korean morphological
anlyzer. It consists of a korean morphological analyzer, dictionaries, a
korean tokenizer and a korean filter. The korean anlyzer is made for lucene
and solr. If you develop a search service with lucene in korean, It is the
best idea to choose the korean analyzer.

--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries

2013-11-02 Thread Benson Margulies (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13812050#comment-13812050
 ] 

Benson Margulies commented on LUCENE-4956:
--

That jira concerns a different license. The license on the file pointed-to 
there has no advertising clause that I can spot.


 the korean analyzer that has a korean morphological analyzer and dictionaries
 -

 Key: LUCENE-4956
 URL: https://issues.apache.org/jira/browse/LUCENE-4956
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Affects Versions: 4.2
Reporter: SooMyung Lee
Assignee: Christian Moen
  Labels: newbie
 Attachments: LUCENE-4956.patch, eval.patch, kr.analyzer.4x.tar, 
 lucene-4956.patch, lucene4956.patch


 Korean language has specific characteristic. When developing search service 
 with lucene  solr in korean, there are some problems in searching and 
 indexing. The korean analyer solved the problems with a korean morphological 
 anlyzer. It consists of a korean morphological analyzer, dictionaries, a 
 korean tokenizer and a korean filter. The korean anlyzer is made for lucene 
 and solr. If you develop a search service with lucene in korean, It is the 
 best idea to choose the korean analyzer.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries

2013-11-02 Thread Benson Margulies (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13812050#comment-13812050
]

Benson Margulies edited comment on LUCENE-4956 at 11/2/13 4:11 PM:
---

That jira concerns a different license. The license on the file pointed-to
there has no advertising clause that I can spot. Which isn't to say that legal
would have a problem with this, just that I don't think that the JIRA in
question tells us.

was (Author: bmargulies):
That jira concerns a different license. The license on the file pointed-to
there has no advertising clause that I can spot.

the korean analyzer that has a korean morphological analyzer and dictionaries
-

--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries

2013-11-02 Thread Benson Margulies (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13812053#comment-13812053
]

Benson Margulies commented on LUCENE-4956:
--

Rob, I got shat on at great length over this for merely test data over at the
WS project. I had to make the build pull the data over the network to get
certain directors off of my back. I'm trying to spare you the experience.
That's all.

As a low-intensity member of the UTC, I would also expect there to be only one
license. However, I compare:

{noformat}
# Copyright (c) 1991-2011 Unicode, Inc. All Rights reserved.
#
# This file is provided as-is by Unicode, Inc. (The Unicode Consortium). No
# claims are made as to fitness for any particular purpose. No warranties of
# any kind are expressed or implied. The recipient agrees to determine
# applicability of information provided. If this file has been provided on
# magnetic media by Unicode, Inc., the sole remedy for any claim will be
# exchange of defective media within 90 days of receipt.
#
# Unicode, Inc. hereby grants the right to freely use the information
# supplied in this file in the creation of products supporting the
# Unicode Standard, and to make copies of this file in any form for
# internal or external distribution as long as this notice remains
# attached.
{noformat}

with

{noformat}
! Copyright (c) 1991-2013 Unicode, Inc.
! All rights reserved.
! Distributed under the Terms of Use in http://www.unicode.org/copyright.html.
!
! Permission is hereby granted, free of charge, to any person obtaining a copy
! of the Unicode data files and any associated documentation (the Data Files)
! or Unicode software and any associated documentation (the Software) to deal
! in the Data Files or Software without restriction, including without
limitation
! the rights to use, copy, modify, merge, publish, distribute, and/or sell
copies
! of the Data Files or Software, and to permit persons to whom the Data Files
or
! Software are furnished to do so, provided that (a) the above copyright
notice(s)
! and this permission notice appear with all copies of the Data Files or
Software,
! (b) both the above copyright notice(s) and this permission notice appear in
! associated documentation, and (c) there is clear notice in each modified Data
! File or in the Software as well as in the documentation associated with the
Data
! File(s) or Software that the data or software has been modified.
!
! THE DATA FILES AND SOFTWARE ARE PROVIDED AS IS, WITHOUT WARRANTY OF ANY
KIND,
! EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY,
! FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT OF THIRD PARTY RIGHTS.
IN NO
! EVENT SHALL THE COPYRIGHT HOLDER OR HOLDERS INCLUDED IN THIS NOTICE BE LIABLE
FOR
! ANY CLAIM, OR ANY SPECIAL INDIRECT OR CONSEQUENTIAL DAMAGES, OR ANY DAMAGES
! WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION
OF
! CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN
CONNECTION
! WITH THE USE OR PERFORMANCE OF THE DATA FILES OR SOFTWARE.
!
! Except as contained in this notice, the name of a copyright holder shall not
be
! used in advertising or otherwise to promote the sale, use or other dealings
in
! these Data Files or Software without prior written authorization of the
copyright holder.
{noformat}

They look pretty different to me. Go figure?

the korean analyzer that has a korean morphological analyzer and dictionaries
-

--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries

2013-11-02 Thread Benson Margulies (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13812059#comment-13812059
 ] 

Benson Margulies commented on LUCENE-4956:
--

OK, I see, the email thread about Unicode data in general does certainly cover 
this. Sometimes the workings of Legal are pretty perplexing.

 the korean analyzer that has a korean morphological analyzer and dictionaries
 -

 Key: LUCENE-4956
 URL: https://issues.apache.org/jira/browse/LUCENE-4956
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Affects Versions: 4.2
Reporter: SooMyung Lee
Assignee: Christian Moen
  Labels: newbie
 Attachments: LUCENE-4956.patch, eval.patch, kr.analyzer.4x.tar, 
 lucene-4956.patch, lucene4956.patch


 Korean language has specific characteristic. When developing search service 
 with lucene  solr in korean, there are some problems in searching and 
 indexing. The korean analyer solved the problems with a korean morphological 
 anlyzer. It consists of a korean morphological analyzer, dictionaries, a 
 korean tokenizer and a korean filter. The korean anlyzer is made for lucene 
 and solr. If you develop a search service with lucene in korean, It is the 
 best idea to choose the korean analyzer.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries

2013-10-29 Thread Benson Margulies (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13807877#comment-13807877
 ] 

Benson Margulies commented on LUCENE-4956:
--

Hmm. When I followed the link, I found a .tar.gz. I guess the zip was further 
down the page.

 the korean analyzer that has a korean morphological analyzer and dictionaries
 -

 Key: LUCENE-4956
 URL: https://issues.apache.org/jira/browse/LUCENE-4956
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Affects Versions: 4.2
Reporter: SooMyung Lee
Assignee: Christian Moen
  Labels: newbie
 Attachments: eval.patch, kr.analyzer.4x.tar, lucene-4956.patch, 
 lucene4956.patch, LUCENE-4956.patch


 Korean language has specific characteristic. When developing search service 
 with lucene  solr in korean, there are some problems in searching and 
 indexing. The korean analyer solved the problems with a korean morphological 
 anlyzer. It consists of a korean morphological analyzer, dictionaries, a 
 korean tokenizer and a korean filter. The korean anlyzer is made for lucene 
 and solr. If you develop a search service with lucene in korean, It is the 
 best idea to choose the korean analyzer.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries

2013-10-29 Thread Benson Margulies (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13807918#comment-13807918
]

Benson Margulies commented on LUCENE-4956:
--

Something's funny here. On this page
(http://www.kristalinfo.com/TestCollections/), the zip file has directories like

HANTEC-2.0/relevance_file/+�++/L2.rel

The code in the patch expects the word 'full' in latin-alphabet, no funny
full-width, in the that intermediate directory. On the other hand, I can't seem
to find an unzip with a documented -O option on Linux.

the korean analyzer that has a korean morphological analyzer and dictionaries
-

Key: LUCENE-4956
URL: https://issues.apache.org/jira/browse/LUCENE-4956
Project: Lucene - Core
Issue Type: New Feature
Components: modules/analysis
Affects Versions: 4.2
Reporter: SooMyung Lee
Assignee: Christian Moen
Labels: newbie
Attachments: eval.patch, kr.analyzer.4x.tar, lucene-4956.patch,
lucene4956.patch, LUCENE-4956.patch

--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries

2013-10-29 Thread Benson Margulies (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13807918#comment-13807918
]

Benson Margulies edited comment on LUCENE-4956 at 10/29/13 12:32 PM:
-

Something's funny here. On this page
(http://www.kristalinfo.com/TestCollections/), the zip file has directories like

HANTEC-2.0/relevance_file/과학기술분야/
HANTEC-2.0/relevance_file/전체/

The code in the patch expects the word 'full' in latin-alphabet, no funny
full-width, in the that intermediate directory. So I don't see how a code-page
option to unzip got there. I'm suspecting that an 'mv' is in order.

was (Author: bmargulies):
Something's funny here. On this page
(http://www.kristalinfo.com/TestCollections/), the zip file has directories like

HANTEC-2.0/relevance_file/+�++/L2.rel

the korean analyzer that has a korean morphological analyzer and dictionaries
-

Key: LUCENE-4956
URL: https://issues.apache.org/jira/browse/LUCENE-4956
Project: Lucene - Core
Issue Type: New Feature
Components: modules/analysis
Affects Versions: 4.2
Reporter: SooMyung Lee
Assignee: Christian Moen
Labels: newbie
Attachments: eval.patch, kr.analyzer.4x.tar, lucene-4956.patch,
lucene4956.patch, LUCENE-4956.patch

--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries

2013-10-29 Thread Benson Margulies (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13807918#comment-13807918
]

Benson Margulies edited comment on LUCENE-4956 at 10/29/13 12:34 PM:
-

Something's funny here. On this page
(http://www.kristalinfo.com/TestCollections/), the zip file has directories like

HANTEC-2.0/relevance_file/과학기술분야/
HANTEC-2.0/relevance_file/전체/

The first translates as 'Science and Technology' and the second as 'All'.

was (Author: bmargulies):
Something's funny here. On this page
(http://www.kristalinfo.com/TestCollections/), the zip file has directories like

HANTEC-2.0/relevance_file/과학기술분야/
HANTEC-2.0/relevance_file/전체/

the korean analyzer that has a korean morphological analyzer and dictionaries
-

Key: LUCENE-4956
URL: https://issues.apache.org/jira/browse/LUCENE-4956
Project: Lucene - Core
Issue Type: New Feature
Components: modules/analysis
Affects Versions: 4.2
Reporter: SooMyung Lee
Assignee: Christian Moen
Labels: newbie
Attachments: eval.patch, kr.analyzer.4x.tar, lucene-4956.patch,
lucene4956.patch, LUCENE-4956.patch

--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries

2013-10-28 Thread Benson Margulies (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13807567#comment-13807567
]

Benson Margulies commented on LUCENE-4956:
--

Could you share the trick of unpacking the big tarball, locale-wise? I ended up
with:

[benson] /data/HANTEC-2.0 % ls relevance_file
%B0%FA%C7б%E2%BC%FA%BAо%DF %C0%FCü

which does not work so well.

Did you set LOCALE to something before unpacking?

the korean analyzer that has a korean morphological analyzer and dictionaries
-

Key: LUCENE-4956
URL: https://issues.apache.org/jira/browse/LUCENE-4956
Project: Lucene - Core
Issue Type: New Feature
Components: modules/analysis
Affects Versions: 4.2
Reporter: SooMyung Lee
Assignee: Christian Moen
Labels: newbie
Attachments: eval.patch, kr.analyzer.4x.tar, lucene-4956.patch,
lucene4956.patch, LUCENE-4956.patch

--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries

2013-10-17 Thread Benson Margulies (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13797876#comment-13797876
]

Benson Margulies commented on LUCENE-4956:
--

As a potential user of this technology, I'd like to ask for it to have
documentation of its linguistic approach.

* What is the goal of the tokenizer? Is it to deliver eojeol or hyung-tae-so?
If eojeol, does it split up the case where Korean writers are sometimes relaxed
about whitespace between them?
* Similarly, what does it set out to index? Does it index eojeol and them also
their contained eumjeol or hyung-tae-so, using position-increment /
position-length to indicate compound relationships.

the korean analyzer that has a korean morphological analyzer and dictionaries
-

Key: LUCENE-4956
URL: https://issues.apache.org/jira/browse/LUCENE-4956
Project: Lucene - Core
Issue Type: New Feature
Components: modules/analysis
Affects Versions: 4.2
Reporter: SooMyung Lee
Assignee: Christian Moen
Labels: newbie
Attachments: eval.patch, kr.analyzer.4x.tar, lucene-4956.patch,
lucene4956.patch, LUCENE-4956.patch

--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries

2013-10-17 Thread Benson Margulies (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13798393#comment-13798393
]

Benson Margulies commented on LUCENE-4956:
--

I am told (I don't read Korean myself) that people often leave out the white
space between eojeol that are made up entirely of Hangul letters (Korean
letters). Are you just defining these very long things to be single eojeol?
Prof Kang in his own work has a module that splits these using some rules.

the korean analyzer that has a korean morphological analyzer and dictionaries
-

Key: LUCENE-4956
URL: https://issues.apache.org/jira/browse/LUCENE-4956
Project: Lucene - Core
Issue Type: New Feature
Components: modules/analysis
Affects Versions: 4.2
Reporter: SooMyung Lee
Assignee: Christian Moen
Labels: newbie
Attachments: eval.patch, kr.analyzer.4x.tar, lucene-4956.patch,
lucene4956.patch, LUCENE-4956.patch

--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-5244) NPE in Japanese Analyzer

2013-09-25 Thread Benson Margulies (JIRA)

Benson Margulies created LUCENE-5244:


 Summary: NPE in Japanese Analyzer
 Key: LUCENE-5244
 URL: https://issues.apache.org/jira/browse/LUCENE-5244
 Project: Lucene - Core
  Issue Type: Bug
  Components: modules/analysis
Affects Versions: 4.4
Reporter: Benson Margulies


I've got a test case that shows an NPE with the Japanese analyzer.

It's all available in https://github.com/benson-basis/kuromoji-npe, and I 
explicitly grant a license to the Foundation.

If anyone would prefer that I attach a tarball here, just let me know.

{noformat}
---
 T E S T S
---
Running com.basistech.testcase.JapaneseNpeTest
Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.298 sec  
FAILURE! - in com.basistech.testcase.JapaneseNpeTest
japaneseNpe(com.basistech.testcase.JapaneseNpeTest)  Time elapsed: 0.282 sec  
 ERROR!
java.lang.NullPointerException: null
at 
org.apache.lucene.analysis.util.RollingCharBuffer.get(RollingCharBuffer.java:86)
at 
org.apache.lucene.analysis.ja.JapaneseTokenizer.parse(JapaneseTokenizer.java:618)
at 
org.apache.lucene.analysis.ja.JapaneseTokenizer.incrementToken(JapaneseTokenizer.java:468)
at 
com.basistech.testcase.JapaneseNpeTest.japaneseNpe(JapaneseNpeTest.java:28)
{noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-5244) NPE in Japanese Analyzer

2013-09-25 Thread Benson Margulies (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-5244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benson Margulies resolved LUCENE-5244.
--

Resolution: Invalid

This was pilot error, I forgot to call reset().

 NPE in Japanese Analyzer
 

 Key: LUCENE-5244
 URL: https://issues.apache.org/jira/browse/LUCENE-5244
 Project: Lucene - Core
  Issue Type: Bug
  Components: modules/analysis
Affects Versions: 4.4
Reporter: Benson Margulies

 I've got a test case that shows an NPE with the Japanese analyzer.
 It's all available in https://github.com/benson-basis/kuromoji-npe, and I 
 explicitly grant a license to the Foundation.
 If anyone would prefer that I attach a tarball here, just let me know.
 {noformat}
 ---
  T E S T S
 ---
 Running com.basistech.testcase.JapaneseNpeTest
 Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.298 sec  
 FAILURE! - in com.basistech.testcase.JapaneseNpeTest
 japaneseNpe(com.basistech.testcase.JapaneseNpeTest)  Time elapsed: 0.282 sec  
  ERROR!
 java.lang.NullPointerException: null
   at 
 org.apache.lucene.analysis.util.RollingCharBuffer.get(RollingCharBuffer.java:86)
   at 
 org.apache.lucene.analysis.ja.JapaneseTokenizer.parse(JapaneseTokenizer.java:618)
   at 
 org.apache.lucene.analysis.ja.JapaneseTokenizer.incrementToken(JapaneseTokenizer.java:468)
   at 
 com.basistech.testcase.JapaneseNpeTest.japaneseNpe(JapaneseNpeTest.java:28)
 {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (SOLR-5259) Typo in error message from missing / wrong _version_ field

2013-09-23 Thread Benson Margulies (JIRA)

Benson Margulies created SOLR-5259:
--

 Summary: Typo in error message from missing / wrong _version_ field
 Key: SOLR-5259
 URL: https://issues.apache.org/jira/browse/SOLR-5259
 Project: Solr
  Issue Type: Bug
Affects Versions: 4.4
Reporter: Benson Margulies


Note the missing space between _version_ and field.

Caused by: org.apache.solr.common.SolrException: Unable to use updateLog: 
_version_field must exist in schema, using indexed=true stored=true and 
multiValued=false (_version_ is not indexed

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5202) LookaheadTokenFilter consumes an extra token in nextToken

2013-09-08 Thread Benson Margulies (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13761254#comment-13761254
 ] 

Benson Margulies commented on LUCENE-5202:
--

Yes, that's what I have and it works, except for the problem I wrote this test 
case to demonstrate. There's a call to peekToken in nextToken used to detect 
the end of the input. When that gets called, a token 'moves' from the input to 
the positions, so the calls to peekToken in my code never see it.

Either I'm supposed to call restoreState to examine it, or there's a problem 
here. If I'm supposed to call restoreState, I need to figure out how to notice 
(by looking at positions?) that I'm in that situation. Or there's some problem 
in my logic for deciding when to do my next load of peeks, so that nextToken is 
never supposed to reach that call to peek, but I can't figure out what it is.


 LookaheadTokenFilter consumes an extra token in nextToken
 -

 Key: LUCENE-5202
 URL: https://issues.apache.org/jira/browse/LUCENE-5202
 Project: Lucene - Core
  Issue Type: Bug
Affects Versions: 4.3.1
Reporter: Benson Margulies
 Attachments: LUCENE-5202.patch, LUCENE-5202.patch


 This is a bit hard to explain except by looking at the test case. I've coded 
 a filter that uses LookaheadTokenFilter. The incrementToken method peeks some 
 tokens. Then, it seems, nextToken in the Lookahead class calls peekToken 
 itself, which seems to me to consume a token so that it's not seen when the 
 derived class sets out to process the next set of tokens.
 In passing, this test case can be used to demonstrate that it does not work 
 to try to use the afterPosition method to set up attributes of the token that 
 we're 'after'. Probably that was never intended. However, I'm hoping for some 
 feedback as to whether the rest of the structure here is as intended for 
 subclasses of LookaheadTokenFilter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5202) LookaheadTokenFilter consumes an extra token in nextToken

2013-09-08 Thread Benson Margulies (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13761475#comment-13761475
 ] 

Benson Margulies commented on LUCENE-5202:
--

OK, I see.

So I'll leave it to you to apply this patch to pick up the fix you made.

thanks

 LookaheadTokenFilter consumes an extra token in nextToken
 -

 Key: LUCENE-5202
 URL: https://issues.apache.org/jira/browse/LUCENE-5202
 Project: Lucene - Core
  Issue Type: Bug
Affects Versions: 4.3.1
Reporter: Benson Margulies
 Attachments: LUCENE-5202.patch, LUCENE-5202.patch


 This is a bit hard to explain except by looking at the test case. I've coded 
 a filter that uses LookaheadTokenFilter. The incrementToken method peeks some 
 tokens. Then, it seems, nextToken in the Lookahead class calls peekToken 
 itself, which seems to me to consume a token so that it's not seen when the 
 derived class sets out to process the next set of tokens.
 In passing, this test case can be used to demonstrate that it does not work 
 to try to use the afterPosition method to set up attributes of the token that 
 we're 'after'. Probably that was never intended. However, I'm hoping for some 
 feedback as to whether the rest of the structure here is as intended for 
 subclasses of LookaheadTokenFilter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

1 2 3 4 >

1 - 100 of 382 matches

Mail list logo