from:"Lance Norskog"


 [ 
https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lance Norskog updated LUCENE-2899:
--

Attachment: (was: LUCENE-2899-x.patch)

> Add OpenNLP Analysis capabilities as a module
> -
>
> Key: LUCENE-2899
> URL: https://issues.apache.org/jira/browse/LUCENE-2899
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 4.6
>
> Attachments: LUCENE-2899-RJN.patch, LUCENE-2899.patch, 
> OpenNLPFilter.java, OpenNLPTokenizer.java
>
>
> Now that OpenNLP is an ASF project and has a nice license, it would be nice 
> to have a submodule (under analysis) that exposed capabilities for it. Drew 
> Farris, Tom Morton and I have code that does:
> * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it 
> would have to change slightly to buffer tokens)
> * NamedEntity recognition as a TokenFilter
> We are also planning a Tokenizer/TokenFilter that can put parts of speech as 
> either payloads (PartOfSpeechAttribute?) on a token or at the same position.
> I'd propose it go under:
> modules/analysis/opennlp



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module


[ 
https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13812581#comment-13812581
 ] 

Lance Norskog edited comment on LUCENE-2899 at 11/4/13 2:55 AM:


This patch includes a fix for the problem where searching twice doesn't work. 
The file is LUCENE-2899.patch 
It has been tested with trunk, branch_4x and the 4.5.1 release.

I do not know of any outstanding issues. To avoid confusion, I have removed all 
old patches.


was (Author: lancenorskog):
This patch includes a fix for the problem where searching twice doesn't work. 
The file is LUCENE-2899.patch 
It has been tested with trunk, branch_4x and the 4.5.1 release.

I do not know of any outstanding issues.

> Add OpenNLP Analysis capabilities as a module
> -
>
> Key: LUCENE-2899
> URL: https://issues.apache.org/jira/browse/LUCENE-2899
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 4.6
>
> Attachments: LUCENE-2899-RJN.patch, LUCENE-2899.patch, 
> OpenNLPFilter.java, OpenNLPTokenizer.java
>
>
> Now that OpenNLP is an ASF project and has a nice license, it would be nice 
> to have a submodule (under analysis) that exposed capabilities for it. Drew 
> Farris, Tom Morton and I have code that does:
> * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it 
> would have to change slightly to buffer tokens)
> * NamedEntity recognition as a TokenFilter
> We are also planning a Tokenizer/TokenFilter that can put parts of speech as 
> either payloads (PartOfSpeechAttribute?) on a token or at the same position.
> I'd propose it go under:
> modules/analysis/opennlp



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module


 [ 
https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lance Norskog updated LUCENE-2899:
--

Attachment: (was: OpenNLPFilter.java)

> Add OpenNLP Analysis capabilities as a module
> -
>
> Key: LUCENE-2899
> URL: https://issues.apache.org/jira/browse/LUCENE-2899
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 4.6
>
> Attachments: LUCENE-2899-RJN.patch, LUCENE-2899-x.patch, 
> LUCENE-2899.patch, OpenNLPFilter.java, OpenNLPTokenizer.java
>
>
> Now that OpenNLP is an ASF project and has a nice license, it would be nice 
> to have a submodule (under analysis) that exposed capabilities for it. Drew 
> Farris, Tom Morton and I have code that does:
> * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it 
> would have to change slightly to buffer tokens)
> * NamedEntity recognition as a TokenFilter
> We are also planning a Tokenizer/TokenFilter that can put parts of speech as 
> either payloads (PartOfSpeechAttribute?) on a token or at the same position.
> I'd propose it go under:
> modules/analysis/opennlp



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module


 [ 
https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lance Norskog updated LUCENE-2899:
--

Attachment: (was: LUCENE-2899-x.patch)

> Add OpenNLP Analysis capabilities as a module
> -
>
> Key: LUCENE-2899
> URL: https://issues.apache.org/jira/browse/LUCENE-2899
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 4.6
>
> Attachments: LUCENE-2899-RJN.patch, LUCENE-2899-x.patch, 
> LUCENE-2899.patch, OpenNLPFilter.java, OpenNLPFilter.java, 
> OpenNLPTokenizer.java
>
>
> Now that OpenNLP is an ASF project and has a nice license, it would be nice 
> to have a submodule (under analysis) that exposed capabilities for it. Drew 
> Farris, Tom Morton and I have code that does:
> * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it 
> would have to change slightly to buffer tokens)
> * NamedEntity recognition as a TokenFilter
> We are also planning a Tokenizer/TokenFilter that can put parts of speech as 
> either payloads (PartOfSpeechAttribute?) on a token or at the same position.
> I'd propose it go under:
> modules/analysis/opennlp



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module


 [ 
https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lance Norskog updated LUCENE-2899:
--

Attachment: (was: LUCENE-2899.patch)

> Add OpenNLP Analysis capabilities as a module
> -
>
> Key: LUCENE-2899
> URL: https://issues.apache.org/jira/browse/LUCENE-2899
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 4.6
>
> Attachments: LUCENE-2899-RJN.patch, LUCENE-2899-x.patch, 
> LUCENE-2899-x.patch, LUCENE-2899-x.patch, LUCENE-2899.patch, 
> OpenNLPFilter.java, OpenNLPFilter.java, OpenNLPTokenizer.java
>
>
> Now that OpenNLP is an ASF project and has a nice license, it would be nice 
> to have a submodule (under analysis) that exposed capabilities for it. Drew 
> Farris, Tom Morton and I have code that does:
> * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it 
> would have to change slightly to buffer tokens)
> * NamedEntity recognition as a TokenFilter
> We are also planning a Tokenizer/TokenFilter that can put parts of speech as 
> either payloads (PartOfSpeechAttribute?) on a token or at the same position.
> I'd propose it go under:
> modules/analysis/opennlp



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module


 [ 
https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lance Norskog updated LUCENE-2899:
--

Attachment: (was: LUCENE-2899-x.patch)

> Add OpenNLP Analysis capabilities as a module
> -
>
> Key: LUCENE-2899
> URL: https://issues.apache.org/jira/browse/LUCENE-2899
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 4.6
>
> Attachments: LUCENE-2899-RJN.patch, LUCENE-2899-x.patch, 
> LUCENE-2899-x.patch, LUCENE-2899.patch, OpenNLPFilter.java, 
> OpenNLPFilter.java, OpenNLPTokenizer.java
>
>
> Now that OpenNLP is an ASF project and has a nice license, it would be nice 
> to have a submodule (under analysis) that exposed capabilities for it. Drew 
> Farris, Tom Morton and I have code that does:
> * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it 
> would have to change slightly to buffer tokens)
> * NamedEntity recognition as a TokenFilter
> We are also planning a Tokenizer/TokenFilter that can put parts of speech as 
> either payloads (PartOfSpeechAttribute?) on a token or at the same position.
> I'd propose it go under:
> modules/analysis/opennlp



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module


 [ 
https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lance Norskog updated LUCENE-2899:
--

Attachment: (was: LUCENE-2899-current.patch)

> Add OpenNLP Analysis capabilities as a module
> -
>
> Key: LUCENE-2899
> URL: https://issues.apache.org/jira/browse/LUCENE-2899
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 4.6
>
> Attachments: LUCENE-2899-RJN.patch, LUCENE-2899-x.patch, 
> LUCENE-2899-x.patch, LUCENE-2899-x.patch, LUCENE-2899.patch, 
> OpenNLPFilter.java, OpenNLPFilter.java, OpenNLPTokenizer.java
>
>
> Now that OpenNLP is an ASF project and has a nice license, it would be nice 
> to have a submodule (under analysis) that exposed capabilities for it. Drew 
> Farris, Tom Morton and I have code that does:
> * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it 
> would have to change slightly to buffer tokens)
> * NamedEntity recognition as a TokenFilter
> We are also planning a Tokenizer/TokenFilter that can put parts of speech as 
> either payloads (PartOfSpeechAttribute?) on a token or at the same position.
> I'd propose it go under:
> modules/analysis/opennlp



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module


 [ 
https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lance Norskog updated LUCENE-2899:
--

Attachment: (was: LUCENE-2899.patch)

> Add OpenNLP Analysis capabilities as a module
> -
>
> Key: LUCENE-2899
> URL: https://issues.apache.org/jira/browse/LUCENE-2899
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 4.6
>
> Attachments: LUCENE-2899-RJN.patch, LUCENE-2899-x.patch, 
> LUCENE-2899-x.patch, LUCENE-2899-x.patch, LUCENE-2899.patch, 
> OpenNLPFilter.java, OpenNLPFilter.java, OpenNLPTokenizer.java
>
>
> Now that OpenNLP is an ASF project and has a nice license, it would be nice 
> to have a submodule (under analysis) that exposed capabilities for it. Drew 
> Farris, Tom Morton and I have code that does:
> * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it 
> would have to change slightly to buffer tokens)
> * NamedEntity recognition as a TokenFilter
> We are also planning a Tokenizer/TokenFilter that can put parts of speech as 
> either payloads (PartOfSpeechAttribute?) on a token or at the same position.
> I'd propose it go under:
> modules/analysis/opennlp



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module


 [ 
https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lance Norskog updated LUCENE-2899:
--

Attachment: (was: LUCENE-2899.patch)

> Add OpenNLP Analysis capabilities as a module
> -
>
> Key: LUCENE-2899
> URL: https://issues.apache.org/jira/browse/LUCENE-2899
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 4.6
>
> Attachments: LUCENE-2899-RJN.patch, LUCENE-2899-x.patch, 
> LUCENE-2899-x.patch, LUCENE-2899-x.patch, LUCENE-2899.patch, 
> OpenNLPFilter.java, OpenNLPFilter.java, OpenNLPTokenizer.java
>
>
> Now that OpenNLP is an ASF project and has a nice license, it would be nice 
> to have a submodule (under analysis) that exposed capabilities for it. Drew 
> Farris, Tom Morton and I have code that does:
> * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it 
> would have to change slightly to buffer tokens)
> * NamedEntity recognition as a TokenFilter
> We are also planning a Tokenizer/TokenFilter that can put parts of speech as 
> either payloads (PartOfSpeechAttribute?) on a token or at the same position.
> I'd propose it go under:
> modules/analysis/opennlp



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module


 [ 
https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lance Norskog updated LUCENE-2899:
--

Attachment: (was: LUCENE-2899.patch)

> Add OpenNLP Analysis capabilities as a module
> -
>
> Key: LUCENE-2899
> URL: https://issues.apache.org/jira/browse/LUCENE-2899
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 4.6
>
> Attachments: LUCENE-2899-RJN.patch, LUCENE-2899-x.patch, 
> LUCENE-2899-x.patch, LUCENE-2899-x.patch, LUCENE-2899.patch, 
> OpenNLPFilter.java, OpenNLPFilter.java, OpenNLPTokenizer.java
>
>
> Now that OpenNLP is an ASF project and has a nice license, it would be nice 
> to have a submodule (under analysis) that exposed capabilities for it. Drew 
> Farris, Tom Morton and I have code that does:
> * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it 
> would have to change slightly to buffer tokens)
> * NamedEntity recognition as a TokenFilter
> We are also planning a Tokenizer/TokenFilter that can put parts of speech as 
> either payloads (PartOfSpeechAttribute?) on a token or at the same position.
> I'd propose it go under:
> modules/analysis/opennlp



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module


 [ 
https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lance Norskog updated LUCENE-2899:
--

Attachment: (was: opennlp_trunk.patch)

> Add OpenNLP Analysis capabilities as a module
> -
>
> Key: LUCENE-2899
> URL: https://issues.apache.org/jira/browse/LUCENE-2899
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 4.6
>
> Attachments: LUCENE-2899-RJN.patch, LUCENE-2899-x.patch, 
> LUCENE-2899-x.patch, LUCENE-2899-x.patch, LUCENE-2899.patch, 
> OpenNLPFilter.java, OpenNLPFilter.java, OpenNLPTokenizer.java
>
>
> Now that OpenNLP is an ASF project and has a nice license, it would be nice 
> to have a submodule (under analysis) that exposed capabilities for it. Drew 
> Farris, Tom Morton and I have code that does:
> * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it 
> would have to change slightly to buffer tokens)
> * NamedEntity recognition as a TokenFilter
> We are also planning a Tokenizer/TokenFilter that can put parts of speech as 
> either payloads (PartOfSpeechAttribute?) on a token or at the same position.
> I'd propose it go under:
> modules/analysis/opennlp



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module


 [ 
https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lance Norskog updated LUCENE-2899:
--

Attachment: (was: LUCENE-2899.patch)

> Add OpenNLP Analysis capabilities as a module
> -
>
> Key: LUCENE-2899
> URL: https://issues.apache.org/jira/browse/LUCENE-2899
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 4.6
>
> Attachments: LUCENE-2899-RJN.patch, LUCENE-2899-x.patch, 
> LUCENE-2899-x.patch, LUCENE-2899-x.patch, LUCENE-2899.patch, 
> OpenNLPFilter.java, OpenNLPFilter.java, OpenNLPTokenizer.java
>
>
> Now that OpenNLP is an ASF project and has a nice license, it would be nice 
> to have a submodule (under analysis) that exposed capabilities for it. Drew 
> Farris, Tom Morton and I have code that does:
> * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it 
> would have to change slightly to buffer tokens)
> * NamedEntity recognition as a TokenFilter
> We are also planning a Tokenizer/TokenFilter that can put parts of speech as 
> either payloads (PartOfSpeechAttribute?) on a token or at the same position.
> I'd propose it go under:
> modules/analysis/opennlp



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module


 [ 
https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lance Norskog updated LUCENE-2899:
--

Attachment: (was: LUCENE-2899.patch)

> Add OpenNLP Analysis capabilities as a module
> -
>
> Key: LUCENE-2899
> URL: https://issues.apache.org/jira/browse/LUCENE-2899
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 4.6
>
> Attachments: LUCENE-2899-RJN.patch, LUCENE-2899-x.patch, 
> LUCENE-2899-x.patch, LUCENE-2899-x.patch, LUCENE-2899.patch, 
> OpenNLPFilter.java, OpenNLPFilter.java, OpenNLPTokenizer.java
>
>
> Now that OpenNLP is an ASF project and has a nice license, it would be nice 
> to have a submodule (under analysis) that exposed capabilities for it. Drew 
> Farris, Tom Morton and I have code that does:
> * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it 
> would have to change slightly to buffer tokens)
> * NamedEntity recognition as a TokenFilter
> We are also planning a Tokenizer/TokenFilter that can put parts of speech as 
> either payloads (PartOfSpeechAttribute?) on a token or at the same position.
> I'd propose it go under:
> modules/analysis/opennlp



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module


 [ 
https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lance Norskog updated LUCENE-2899:
--

Attachment: LUCENE-2899.patch

This patch includes a fix for the problem where searching twice doesn't work. 
The file is LUCENE-2899.patch 
It has been tested with trunk, branch_4x and the 4.5.1 release.

I do not know of any outstanding issues.

> Add OpenNLP Analysis capabilities as a module
> -
>
> Key: LUCENE-2899
> URL: https://issues.apache.org/jira/browse/LUCENE-2899
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 4.6
>
> Attachments: LUCENE-2899-RJN.patch, LUCENE-2899-current.patch, 
> LUCENE-2899-x.patch, LUCENE-2899-x.patch, LUCENE-2899-x.patch, 
> LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, 
> LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, OpenNLPFilter.java, 
> OpenNLPFilter.java, OpenNLPTokenizer.java, opennlp_trunk.patch
>
>
> Now that OpenNLP is an ASF project and has a nice license, it would be nice 
> to have a submodule (under analysis) that exposed capabilities for it. Drew 
> Farris, Tom Morton and I have code that does:
> * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it 
> would have to change slightly to buffer tokens)
> * NamedEntity recognition as a TokenFilter
> We are also planning a Tokenizer/TokenFilter that can put parts of speech as 
> either payloads (PartOfSpeechAttribute?) on a token or at the same position.
> I'd propose it go under:
> modules/analysis/opennlp



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module

2013-10-23 Thread Lance Norskog (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13802982#comment-13802982
 ] 

Lance Norskog commented on LUCENE-2899:
---

Hi-

The latest patch is LUCENE-2899-x.patch, pls try that. Also, apply it with:
patch -p0 < patchfile

Lance




> Add OpenNLP Analysis capabilities as a module
> -
>
> Key: LUCENE-2899
> URL: https://issues.apache.org/jira/browse/LUCENE-2899
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 4.6
>
> Attachments: LUCENE-2899-current.patch, LUCENE-2899.patch, 
> LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, 
> LUCENE-2899.patch, LUCENE-2899-RJN.patch, LUCENE-2899-x.patch, 
> LUCENE-2899-x.patch, LUCENE-2899-x.patch, OpenNLPFilter.java, 
> OpenNLPFilter.java, OpenNLPTokenizer.java, opennlp_trunk.patch
>
>
> Now that OpenNLP is an ASF project and has a nice license, it would be nice 
> to have a submodule (under analysis) that exposed capabilities for it. Drew 
> Farris, Tom Morton and I have code that does:
> * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it 
> would have to change slightly to buffer tokens)
> * NamedEntity recognition as a TokenFilter
> We are also planning a Tokenizer/TokenFilter that can put parts of speech as 
> either payloads (PartOfSpeechAttribute?) on a token or at the same position.
> I'd propose it go under:
> modules/analysis/opennlp



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

A little nit: TokenStream.end() throws an IOException but does not need to

2013-10-20 Thread Lance Norskog

org.apache.lucene.analysis.TokenStream.end()

  public void end() throws IOException {
clearAttributes(); // LUCENE-3849: don't consume dirty atts
if (hasAttribute(PositionIncrementAttribute.class)) {

getAttribute(PositionIncrementAttribute.class).setPositionIncrement(0);
}
  }


This does not need IOException.

-- 
Lance Norskog
goks...@gmail.com

[jira] [Commented] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module

2013-08-03 Thread Lance Norskog (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13728617#comment-13728617
 ] 

Lance Norskog commented on LUCENE-2899:
---

Wow! Brat looks bitchin! Looking forward to using it.

> Add OpenNLP Analysis capabilities as a module
> -
>
> Key: LUCENE-2899
> URL: https://issues.apache.org/jira/browse/LUCENE-2899
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 5.0, 4.5
>
> Attachments: LUCENE-2899-current.patch, LUCENE-2899.patch, 
> LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, 
> LUCENE-2899.patch, LUCENE-2899-RJN.patch, LUCENE-2899-x.patch, 
> LUCENE-2899-x.patch, LUCENE-2899-x.patch, OpenNLPFilter.java, 
> OpenNLPFilter.java, OpenNLPTokenizer.java, opennlp_trunk.patch
>
>
> Now that OpenNLP is an ASF project and has a nice license, it would be nice 
> to have a submodule (under analysis) that exposed capabilities for it. Drew 
> Farris, Tom Morton and I have code that does:
> * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it 
> would have to change slightly to buffer tokens)
> * NamedEntity recognition as a TokenFilter
> We are also planning a Tokenizer/TokenFilter that can put parts of speech as 
> either payloads (PartOfSpeechAttribute?) on a token or at the same position.
> I'd propose it go under:
> modules/analysis/opennlp

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module

2013-07-31 Thread Lance Norskog (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13725911#comment-13725911
 ] 

Lance Norskog commented on LUCENE-2899:
---

Yup! Another NER is always helpful.  But the big problem with NLP software is 
not the code but the models- do you have a good source of free models? 

> Add OpenNLP Analysis capabilities as a module
> -
>
> Key: LUCENE-2899
> URL: https://issues.apache.org/jira/browse/LUCENE-2899
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 5.0, 4.5
>
> Attachments: LUCENE-2899-current.patch, LUCENE-2899.patch, 
> LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, 
> LUCENE-2899.patch, LUCENE-2899-RJN.patch, LUCENE-2899-x.patch, 
> LUCENE-2899-x.patch, LUCENE-2899-x.patch, OpenNLPFilter.java, 
> OpenNLPFilter.java, OpenNLPTokenizer.java, opennlp_trunk.patch
>
>
> Now that OpenNLP is an ASF project and has a nice license, it would be nice 
> to have a submodule (under analysis) that exposed capabilities for it. Drew 
> Farris, Tom Morton and I have code that does:
> * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it 
> would have to change slightly to buffer tokens)
> * NamedEntity recognition as a TokenFilter
> We are also planning a Tokenizer/TokenFilter that can put parts of speech as 
> either payloads (PartOfSpeechAttribute?) on a token or at the same position.
> I'd propose it go under:
> modules/analysis/opennlp

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module

2013-06-16 Thread Lance Norskog (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lance Norskog updated LUCENE-2899:
--

Attachment: LUCENE-2899-x.patch

Fixed the Chunker problem. I switched to the new released version of the 
OpenNLP packages. The MaxEnt implementation (statistical modeling) for chunking 
changed slightly, and my test data now produces different noun&verb phrase 
chunks for the sample text.

At this point the only problems I know of are that the licenses are slightly 
wrong, and so 
'ant validate' fails.

These comments only apply to LUCENE-2899-x.patch, which applies to the current 
4.x and trunk codelines. LUCENE-2899.patch applies to the release 4.0->4.3 
releases. It is not upgraded to the new OpenNLP release.

> Add OpenNLP Analysis capabilities as a module
> -
>
> Key: LUCENE-2899
> URL: https://issues.apache.org/jira/browse/LUCENE-2899
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 4.4
>
> Attachments: LUCENE-2899-current.patch, LUCENE-2899.patch, 
> LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, 
> LUCENE-2899.patch, LUCENE-2899-RJN.patch, LUCENE-2899-x.patch, 
> LUCENE-2899-x.patch, LUCENE-2899-x.patch, OpenNLPFilter.java, 
> OpenNLPFilter.java, OpenNLPTokenizer.java, opennlp_trunk.patch
>
>
> Now that OpenNLP is an ASF project and has a nice license, it would be nice 
> to have a submodule (under analysis) that exposed capabilities for it. Drew 
> Farris, Tom Morton and I have code that does:
> * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it 
> would have to change slightly to buffer tokens)
> * NamedEntity recognition as a TokenFilter
> We are also planning a Tokenizer/TokenFilter that can put parts of speech as 
> either payloads (PartOfSpeechAttribute?) on a token or at the same position.
> I'd propose it go under:
> modules/analysis/opennlp

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module

2013-06-10 Thread Lance Norskog (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13679293#comment-13679293
 ] 

Lance Norskog edited comment on LUCENE-2899 at 6/10/13 5:45 PM:


I did not make the right changes to OpenNLPFilter.java to handle the API 
changes. I have attached a fixed version of this to this issue. Please try it 
and see if it fixes what you see.

A-a-a-a-a-a-n-n-n-n-d chunking is broken. Oy.



  was (Author: lancenorskog):
I did not make the right changes to OpenNLPFilter.java to handle the API 
changes. I have attached a fixed version of this to this issue. Please try it 
and see if it fixes what you see.


  
> Add OpenNLP Analysis capabilities as a module
> -
>
> Key: LUCENE-2899
> URL: https://issues.apache.org/jira/browse/LUCENE-2899
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 4.4
>
> Attachments: LUCENE-2899-current.patch, LUCENE-2899.patch, 
> LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, 
> LUCENE-2899.patch, LUCENE-2899-RJN.patch, LUCENE-2899-x.patch, 
> LUCENE-2899-x.patch, OpenNLPFilter.java, OpenNLPFilter.java, 
> OpenNLPTokenizer.java, opennlp_trunk.patch
>
>
> Now that OpenNLP is an ASF project and has a nice license, it would be nice 
> to have a submodule (under analysis) that exposed capabilities for it. Drew 
> Farris, Tom Morton and I have code that does:
> * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it 
> would have to change slightly to buffer tokens)
> * NamedEntity recognition as a TokenFilter
> We are also planning a Tokenizer/TokenFilter that can put parts of speech as 
> either payloads (PartOfSpeechAttribute?) on a token or at the same position.
> I'd propose it go under:
> modules/analysis/opennlp

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module

2013-06-10 Thread Lance Norskog (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13679293#comment-13679293
 ] 

Lance Norskog edited comment on LUCENE-2899 at 6/10/13 8:56 AM:


I did not make the right changes to OpenNLPFilter.java to handle the API 
changes. I have attached a fixed version of this to this issue. Please try it 
and see if it fixes what you see.



  was (Author: lancenorskog):
I did not make the right changes to OpenNLPFilter.java to handle the API 
changes. Substitute this OpenNLPFilter.java for your version and see if that 
fixes the problem for you.
  
> Add OpenNLP Analysis capabilities as a module
> -
>
> Key: LUCENE-2899
> URL: https://issues.apache.org/jira/browse/LUCENE-2899
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 4.4
>
> Attachments: LUCENE-2899-current.patch, LUCENE-2899.patch, 
> LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, 
> LUCENE-2899.patch, LUCENE-2899-RJN.patch, LUCENE-2899-x.patch, 
> LUCENE-2899-x.patch, OpenNLPFilter.java, OpenNLPFilter.java, 
> OpenNLPTokenizer.java, opennlp_trunk.patch
>
>
> Now that OpenNLP is an ASF project and has a nice license, it would be nice 
> to have a submodule (under analysis) that exposed capabilities for it. Drew 
> Farris, Tom Morton and I have code that does:
> * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it 
> would have to change slightly to buffer tokens)
> * NamedEntity recognition as a TokenFilter
> We are also planning a Tokenizer/TokenFilter that can put parts of speech as 
> either payloads (PartOfSpeechAttribute?) on a token or at the same position.
> I'd propose it go under:
> modules/analysis/opennlp

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module

2013-06-09 Thread Lance Norskog (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lance Norskog updated LUCENE-2899:
--

Attachment: OpenNLPFilter.java

I did not make the right changes to OpenNLPFilter.java to handle the API 
changes. Substitute this OpenNLPFilter.java for your version and see if that 
fixes the problem for you.

> Add OpenNLP Analysis capabilities as a module
> -
>
> Key: LUCENE-2899
> URL: https://issues.apache.org/jira/browse/LUCENE-2899
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 4.4
>
> Attachments: LUCENE-2899-current.patch, LUCENE-2899.patch, 
> LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, 
> LUCENE-2899.patch, LUCENE-2899-RJN.patch, LUCENE-2899-x.patch, 
> LUCENE-2899-x.patch, OpenNLPFilter.java, OpenNLPFilter.java, 
> OpenNLPTokenizer.java, opennlp_trunk.patch
>
>
> Now that OpenNLP is an ASF project and has a nice license, it would be nice 
> to have a submodule (under analysis) that exposed capabilities for it. Drew 
> Farris, Tom Morton and I have code that does:
> * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it 
> would have to change slightly to buffer tokens)
> * NamedEntity recognition as a TokenFilter
> We are also planning a Tokenizer/TokenFilter that can put parts of speech as 
> either payloads (PartOfSpeechAttribute?) on a token or at the same position.
> I'd propose it go under:
> modules/analysis/opennlp

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module

2013-06-06 Thread Lance Norskog (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13677336#comment-13677336
 ] 

Lance Norskog commented on LUCENE-2899:
---

Yup- upgrading to 1.5.3 is next on the list.

> Add OpenNLP Analysis capabilities as a module
> -
>
> Key: LUCENE-2899
> URL: https://issues.apache.org/jira/browse/LUCENE-2899
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 4.4
>
> Attachments: LUCENE-2899-current.patch, LUCENE-2899.patch, 
> LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, 
> LUCENE-2899.patch, LUCENE-2899-RJN.patch, LUCENE-2899-x.patch, 
> LUCENE-2899-x.patch, OpenNLPFilter.java, OpenNLPTokenizer.java, 
> opennlp_trunk.patch
>
>
> Now that OpenNLP is an ASF project and has a nice license, it would be nice 
> to have a submodule (under analysis) that exposed capabilities for it. Drew 
> Farris, Tom Morton and I have code that does:
> * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it 
> would have to change slightly to buffer tokens)
> * NamedEntity recognition as a TokenFilter
> We are also planning a Tokenizer/TokenFilter that can put parts of speech as 
> either payloads (PartOfSpeechAttribute?) on a token or at the same position.
> I'd propose it go under:
> modules/analysis/opennlp

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module

2013-06-05 Thread Lance Norskog (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lance Norskog updated LUCENE-2899:
--

Attachment: LUCENE-2899-x.patch

> Add OpenNLP Analysis capabilities as a module
> -
>
> Key: LUCENE-2899
> URL: https://issues.apache.org/jira/browse/LUCENE-2899
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 4.4
>
> Attachments: LUCENE-2899-current.patch, LUCENE-2899.patch, 
> LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, 
> LUCENE-2899.patch, LUCENE-2899-RJN.patch, LUCENE-2899-x.patch, 
> LUCENE-2899-x.patch, OpenNLPFilter.java, OpenNLPTokenizer.java, 
> opennlp_trunk.patch
>
>
> Now that OpenNLP is an ASF project and has a nice license, it would be nice 
> to have a submodule (under analysis) that exposed capabilities for it. Drew 
> Farris, Tom Morton and I have code that does:
> * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it 
> would have to change slightly to buffer tokens)
> * NamedEntity recognition as a TokenFilter
> We are also planning a Tokenizer/TokenFilter that can put parts of speech as 
> either payloads (PartOfSpeechAttribute?) on a token or at the same position.
> I'd propose it go under:
> modules/analysis/opennlp

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module

2013-06-05 Thread Lance Norskog (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13676698#comment-13676698
 ] 

Lance Norskog commented on LUCENE-2899:
---

I found the problem with multiple documents. The API for reusing Tokenizers 
changed something more sensible, but I only noticed and implemented part of the 
change. The result was than when you upload multiple documents, it just 
re-processed the first document.

File LUCENE-2899-x.patch has this fix. It applies against the 4.x branch and 
the trunk. It does not apply against Lucene 4.0, 4.1, 4.2 or 4.3. For all 
released Solr versions you want LUCENE-2899.patch from August 27, 2012. There 
are no new features since that release.

> Add OpenNLP Analysis capabilities as a module
> -
>
> Key: LUCENE-2899
> URL: https://issues.apache.org/jira/browse/LUCENE-2899
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 4.4
>
> Attachments: LUCENE-2899-current.patch, LUCENE-2899.patch, 
> LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, 
> LUCENE-2899.patch, LUCENE-2899-RJN.patch, LUCENE-2899-x.patch, 
> OpenNLPFilter.java, OpenNLPTokenizer.java, opennlp_trunk.patch
>
>
> Now that OpenNLP is an ASF project and has a nice license, it would be nice 
> to have a submodule (under analysis) that exposed capabilities for it. Drew 
> Farris, Tom Morton and I have code that does:
> * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it 
> would have to change slightly to buffer tokens)
> * NamedEntity recognition as a TokenFilter
> We are also planning a Tokenizer/TokenFilter that can put parts of speech as 
> either payloads (PartOfSpeechAttribute?) on a token or at the same position.
> I'd propose it go under:
> modules/analysis/opennlp

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module

2013-06-03 Thread Lance Norskog (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lance Norskog updated LUCENE-2899:
--

Attachment: LUCENE-2899-x.patch

> Add OpenNLP Analysis capabilities as a module
> -
>
> Key: LUCENE-2899
> URL: https://issues.apache.org/jira/browse/LUCENE-2899
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 4.4
>
> Attachments: LUCENE-2899-current.patch, LUCENE-2899.patch, 
> LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, 
> LUCENE-2899.patch, LUCENE-2899-RJN.patch, LUCENE-2899-x.patch, 
> OpenNLPFilter.java, OpenNLPTokenizer.java, opennlp_trunk.patch
>
>
> Now that OpenNLP is an ASF project and has a nice license, it would be nice 
> to have a submodule (under analysis) that exposed capabilities for it. Drew 
> Farris, Tom Morton and I have code that does:
> * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it 
> would have to change slightly to buffer tokens)
> * NamedEntity recognition as a TokenFilter
> We are also planning a Tokenizer/TokenFilter that can put parts of speech as 
> either payloads (PartOfSpeechAttribute?) on a token or at the same position.
> I'd propose it go under:
> modules/analysis/opennlp

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Securing Lucene indexes

2013-05-26 Thread Lance Norskog

I would like to store Lucene indexes in an encrypted format. The only
security requirement is that if an intruder copies files from the file
system, no file will have raw data. It is acceptable for raw data to be
visible in raw disk scans. All I want to do is encrypt the readable index
files.

Here is one way to encrypt Lucene indexes: encrypt the entire file on disk
and store the decrypted version in memory. This is ok with a RAMdirectory,
but does not scale. Using a little-known feature of Posix, it is possible
to create a memory-mapped file with a raw copy of the data which cannot be
found from the file system. The Posix feature is that when you open a file
and then delete it, the file still exists in the file system but is not
visible through the file system. The data exists as an invisible file in
the file system, and the file is deleted when you close the file
descriptor. (This does not work on Windows.) Let's call this a 'ghost
file'.

If memory-mapping works with ghost files, this seems like it should work: a
new Directory class will create a file and immediately delete it, then
memory-map it. The memory-mapped file will stay allocated inside the JVM
until the JVM closes the associated Directory object. The Directory class
would create an entire 'ghost Lucene index'.

This sequence opens an index:
* open encrypted segment file in memory-mapped format
* create ghost memory-mapped file
* decrypt from encrypted memory into ghost file memory
* close the encrypted index file
Directory.close() wipes the ghost file data, closes the ghost file,  and
the file system reclaims the disk space.

This sequence creates an index:
Directory.createOutput makes a ghost file and a real file.
All data is saved to the ghost file.
Close on the file encrypts the ghost file data into the real file, and
wipes the ghost data.
Both files are then closed.

One glaring flaw is: what if close() is not called? The raw data will still
exist in the free disk space.
There are two cases where this would happen:
1) the user fails to call close() but the program finishes normally. This
can be countered by adding a finalize() method that makes sure to clear the
memory.
2) the JVM fails and shutdown code is not run. The freed ghost data is on
the hard disk in the free disk space. It can only be found by scanning the
raw disks. One counter to this is to run the app in a virtual machine which
does not have access to the raw disk drivers.

Is this a workable design? Are there any quirks of the Directory
abstraction that make this impossible or pointless? Or quirks in
memory-mapped files or how the JVM implements them?

Thanks for your time,

Lance Norskog






-- 
Lance Norskog
goks...@gmail.com

[jira] [Updated] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module

2013-05-19 Thread Lance Norskog (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lance Norskog updated LUCENE-2899:
--

Attachment: LUCENE-2899-current.patch

> Add OpenNLP Analysis capabilities as a module
> -
>
> Key: LUCENE-2899
> URL: https://issues.apache.org/jira/browse/LUCENE-2899
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 4.4
>
> Attachments: LUCENE-2899-current.patch, LUCENE-2899.patch, 
> LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, 
> LUCENE-2899.patch, LUCENE-2899-RJN.patch, OpenNLPFilter.java, 
> OpenNLPTokenizer.java, opennlp_trunk.patch
>
>
> Now that OpenNLP is an ASF project and has a nice license, it would be nice 
> to have a submodule (under analysis) that exposed capabilities for it. Drew 
> Farris, Tom Morton and I have code that does:
> * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it 
> would have to change slightly to buffer tokens)
> * NamedEntity recognition as a TokenFilter
> We are also planning a Tokenizer/TokenFilter that can put parts of speech as 
> either payloads (PartOfSpeechAttribute?) on a token or at the same position.
> I'd propose it go under:
> modules/analysis/opennlp

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module

2013-05-19 Thread Lance Norskog (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13661758#comment-13661758
 ] 

Lance Norskog commented on LUCENE-2899:
---

I'm updating the patches for 4.x and trunk. Kai's fix works. The unit tests did 
not attempt to analyse text that is longer than the fixed size temp buffer, and 
thus the code for copying successive buffers was never exercised. Kai's fix 
handles this problem. I've added a unit test. 

Em: the Lucene Tokenizer lifecyle is that the Tokenizer is created with a 
Reader, and each call to incrementToken() walks the input. When 
incrementToken() returns false, that is all- the Tokenizer is finished. 
TokenStream can support a 'stateful' token stream: with OpenNLPFilter, you call 
incrementToken() until it returns false, and then you can call 'reset' and it 
will start over from the beginning. The unit tests include a check that reset() 
works. The changes you made support a feature that is not supported by Lucene. 
Also, the changes break most of the unit tests. Please create a unit test that 
shows the bug, and fix the existing unit tests. No unit test = no bug report.

I'm posting a patch for the current 4.x and trunk. It includes some changes for 
TokenStream/TokenFilter method signatures, some refactoring in the unit tests, 
a little tightening in the Tokenizer & Filter, and Kai's fix. There are unit 
tests for the problem Kai found, and also a test that has TokenizerFactory 
create multiple Tokenizer streams. If there is a bug in this patch, please 
write a unit test which demonstrates it.

The patch is called LUCENE-2899-current.patch. It is tested against the current 
4.x branch and the current trunk.

Thanks for your interest and hard work- I know it is really tedious to 
understand this code :)

Lance Norskog


> Add OpenNLP Analysis capabilities as a module
> -
>
> Key: LUCENE-2899
> URL: https://issues.apache.org/jira/browse/LUCENE-2899
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 4.4
>
> Attachments: LUCENE-2899-current.patch, LUCENE-2899.patch, 
> LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, 
> LUCENE-2899.patch, LUCENE-2899-RJN.patch, OpenNLPFilter.java, 
> OpenNLPTokenizer.java, opennlp_trunk.patch
>
>
> Now that OpenNLP is an ASF project and has a nice license, it would be nice 
> to have a submodule (under analysis) that exposed capabilities for it. Drew 
> Farris, Tom Morton and I have code that does:
> * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it 
> would have to change slightly to buffer tokens)
> * NamedEntity recognition as a TokenFilter
> We are also planning a Tokenizer/TokenFilter that can put parts of speech as 
> either payloads (PartOfSpeechAttribute?) on a token or at the same position.
> I'd propose it go under:
> modules/analysis/opennlp

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module

2013-04-25 Thread Lance Norskog (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13641968#comment-13641968
 ] 

Lance Norskog commented on LUCENE-2899:
---

Maciej- This is a good point. This package needs changes in a lot of places and 
it might be easier to package it the way you say. 

Zack- The "churn" in the APIs is a major problem in the Lucene code management. 
The original patch worked in the 4.x branch and trunk when it was posted. What 
Em fixed is in an area which is very very basic to Lucene. The API changed with 
no notice and no change in versions or method names. 

Everyone- It's great that this has gained some interest. Please create a new 
master patch with whatever changes are needed for the current code base.

Lucene grand masters- Please don't say "hey kids, write plugins, they're cool!" 
and then make subtle incompatible changes in APIs. 

> Add OpenNLP Analysis capabilities as a module
> -
>
> Key: LUCENE-2899
> URL: https://issues.apache.org/jira/browse/LUCENE-2899
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 4.3
>
> Attachments: LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, 
> LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, 
> LUCENE-2899-RJN.patch, OpenNLPFilter.java, OpenNLPTokenizer.java, 
> opennlp_trunk.patch
>
>
> Now that OpenNLP is an ASF project and has a nice license, it would be nice 
> to have a submodule (under analysis) that exposed capabilities for it. Drew 
> Farris, Tom Morton and I have code that does:
> * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it 
> would have to change slightly to buffer tokens)
> * NamedEntity recognition as a TokenFilter
> We are also planning a Tokenizer/TokenFilter that can put parts of speech as 
> either payloads (PartOfSpeechAttribute?) on a token or at the same position.
> I'd propose it go under:
> modules/analysis/opennlp

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Closed] (SOLR-1413) Add MockSolrServer to SolrJ client tests

2013-04-17 Thread Lance Norskog (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-1413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lance Norskog closed SOLR-1413.
---

   Resolution: Implemented
Fix Version/s: 3.3

The test infrastructure has had a huge upgrade since 3 years ago. This is no 
longer a valid thang.

> Add MockSolrServer to SolrJ client tests
> 
>
> Key: SOLR-1413
> URL: https://issues.apache.org/jira/browse/SOLR-1413
> Project: Solr
>  Issue Type: Test
>  Components: clients - java
> Environment: Any Solr distribution. Uses only the SolrJ client code, 
> nothing in the Solr core.
>    Reporter: Lance Norskog
>Priority: Minor
> Fix For: 3.3
>
> Attachments: SOLR-1413.patch, SOLR-1413.patch
>
>
> The SolrJ unit test suite has no "mock" solr server for HTTP access, and 
> there are no low-level tests of the Solrj HTTP wire protocols.
> This patch includes org.apache.solr.client.solrj.MockHTTPServer.java and 
> org.apache.solr.client.solrj.TestHTTP_XML_single.java. The mock server does 
> not parse its input and responds with pre-configured byte streams. The latter 
> does a few tests in the XML wire format. Most of the tests do one request and 
> set up success and failure responses.
> Unfortunately, there is a bug: I could not get 2 successive requests to work. 
> The mock server's TCP socket does not work when reading the second request.  
> If someone who knows the JDK socket classes could look at the mock server, I 
> would greatly appreciate it.
> The alternative is to steal a bunch of files from the apache commons 
> httpclient test suite. This is a quite sophisticate bunch of code:
> http://svn.apache.org/repos/asf/httpcomponents/oac.hc3x/trunk/src/test/org/apache/commons/httpclient/server/

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module

2013-01-31 Thread Lance Norskog (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13568150#comment-13568150
 ] 

Lance Norskog commented on LUCENE-2899:
---

Thank you. Have you tried this on the trunk? The Solr components did not work, 
they could not find the OpenNLP jars.


> Add OpenNLP Analysis capabilities as a module
> -
>
> Key: LUCENE-2899
> URL: https://issues.apache.org/jira/browse/LUCENE-2899
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 4.2, 5.0
>
> Attachments: LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, 
> LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, OpenNLPFilter.java, 
> OpenNLPTokenizer.java, opennlp_trunk.patch
>
>
> Now that OpenNLP is an ASF project and has a nice license, it would be nice 
> to have a submodule (under analysis) that exposed capabilities for it. Drew 
> Farris, Tom Morton and I have code that does:
> * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it 
> would have to change slightly to buffer tokens)
> * NamedEntity recognition as a TokenFilter
> We are also planning a Tokenizer/TokenFilter that can put parts of speech as 
> either payloads (PartOfSpeechAttribute?) on a token or at the same position.
> I'd propose it go under:
> modules/analysis/opennlp

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module

2012-12-30 Thread Lance Norskog (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13541285#comment-13541285
 ] 

Lance Norskog commented on LUCENE-2899:
---

Wow, someone tried it! I apologize for not noticing your question.

bq. I'm able to get the posTagger working, yet I still have not found a way to 
incorporate either the Chunker or the NER Models into my Solr project.

The schema.xml file includes samples for all of the models:

{{/lusolr_4x_opennlp/solr/contrib/opennlp/src/test-files/opennlp/solr/collection1/conf/schema.xml}}

This is for the chunker. The chunker works from parts-of-speech tags, not the 
original words. The chunker needs a parts-of-speech model as well as a chunker 
model. This should throw an error if the parts-of-speech model is not there. I 
will fix this.

{code:xml}
 
{code}

Is the NER configuration still not working?


> Add OpenNLP Analysis capabilities as a module
> -
>
> Key: LUCENE-2899
> URL: https://issues.apache.org/jira/browse/LUCENE-2899
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 4.1
>
> Attachments: LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, 
> LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, OpenNLPFilter.java, 
> OpenNLPTokenizer.java, opennlp_trunk.patch
>
>
> Now that OpenNLP is an ASF project and has a nice license, it would be nice 
> to have a submodule (under analysis) that exposed capabilities for it. Drew 
> Farris, Tom Morton and I have code that does:
> * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it 
> would have to change slightly to buffer tokens)
> * NamedEntity recognition as a TokenFilter
> We are also planning a Tokenizer/TokenFilter that can put parts of speech as 
> either payloads (PartOfSpeechAttribute?) on a token or at the same position.
> I'd propose it go under:
> modules/analysis/opennlp

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-1972) Need additional query stats in admin interface - median, 95th and 99th percentile

2012-12-28 Thread Lance Norskog (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13540735#comment-13540735
 ] 

Lance Norskog commented on SOLR-1972:
-

bq. 2) Solr already includes mahout-math 0.3 as a dependency of carrot2. 
I did not mean to suggest using the Mahout libraries. I would just copy the 
class source code and change the weights. It has no other dependencies inside 
the Mahout project.

> Need additional query stats in admin interface - median, 95th and 99th 
> percentile
> -
>
> Key: SOLR-1972
> URL: https://issues.apache.org/jira/browse/SOLR-1972
> Project: Solr
>  Issue Type: Improvement
>  Components: web gui
>Affects Versions: 1.4
>Reporter: Shawn Heisey
>Assignee: Alan Woodward
>Priority: Minor
> Fix For: 4.2, 5.0
>
> Attachments: elyograg-1972-3.2.patch, elyograg-1972-3.2.patch, 
> elyograg-1972-trunk.patch, elyograg-1972-trunk.patch, leak-closeable.patch, 
> leak.patch, revert-SOLR-1972.patch, SOLR-1972-branch3x-url_pattern.patch, 
> SOLR-1972-branch4x.patch, SOLR-1972-branch4x.patch, SOLR-1972_metrics.patch, 
> SOLR-1972_metrics.patch, SOLR-1972_metrics.patch, SOLR-1972_metrics.patch, 
> SOLR-1972_metrics.patch, SOLR-1972_metrics.patch, SOLR-1972_metrics.patch, 
> SOLR-1972_metrics.patch, solr1972-metricsregistry-branch4x-failure.log, 
> SOLR-1972.patch, SOLR-1972.patch, SOLR-1972.patch, SOLR-1972.patch, 
> SOLR-1972-url_pattern.patch, stacktraces.tar.gz
>
>
> I would like to see more detailed query statistics from the admin GUI.  This 
> is what you can get now:
> requests : 809
> errors : 0
> timeouts : 0
> totalTime : 70053
> avgTimePerRequest : 86.59209
> avgRequestsPerSecond : 0.8148785 
> I'd like to see more data on the time per request - median, 95th percentile, 
> 99th percentile, and any other statistical function that makes sense to 
> include.  In my environment, the first bunch of queries after startup tend to 
> take several seconds each.  I find that the average value tends to be useless 
> until it has several thousand queries under its belt and the caches are 
> thoroughly warmed.  The statistical functions I have mentioned would quickly 
> eliminate the influence of those initial slow queries.
> The system will have to store individual data about each query.  I don't know 
> if this is something Solr does already.  It would be nice to have a 
> configurable count of how many of the most recent data points are kept, to 
> control the amount of memory the feature uses.  The default value could be 
> something like 1024 or 4096.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3413) CombiningFilter to recombine tokens into a single token for sorting

2012-12-27 Thread Lance Norskog (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13540206#comment-13540206
 ] 

Lance Norskog commented on LUCENE-3413:
---

For sorting, would you want 'grapes_of_wrath"? This distinguishes the word 
'grapes' from words that might start with 'grapes'. (I don't know of any, but 
you see the problem :)

Also, in this use case numerical canonicalization makes sense for searching and 
sorting. Twenty-two -> 22, and also 'twenty two' -> 22. Or maybe 'twenty two' 
-> 'twenty-two'.



> CombiningFilter to recombine tokens into a single token for sorting
> ---
>
> Key: LUCENE-3413
> URL: https://issues.apache.org/jira/browse/LUCENE-3413
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Affects Versions: 2.9.3
>Reporter: Chris A. Mattmann
>Priority: Minor
> Attachments: LUCENE-3413.Mattmann.090311.patch.txt, 
> LUCENE-3413.Mattmann.090511.patch.txt
>
>
> I whipped up this CombiningFilter for the following use case:
> I've got a bunch of titles of e.g., Books, such as:
> The Grapes of Wrath
> Tommy Tommerson saves the World
> Top of the World
> The Tales of Beedle the Bard
> Born Free
> etc.
> I want to sort these titles using a String field that includes stopword 
> analysis (e.g., to remove "The"), and synonym filtering (e.g., for grouping), 
> etc. I created an analysis chain in Solr for this that was based off of 
> *alphaOnlySort*, which looks like this:
> {code:xml}
>  omitNorms="true">
>
> 
> 
> 
> 
> 
> 
> 
>  pattern="([^a-z])" replacement="" replace="all"
> /> 
>
> 
> {code}
> The issue with alphaOnlySort is that it doesn't support stopword remove or 
> synonyms because those are based on the original token level instead of the 
> full strings produced by the KeywordTokenizer (which does not do 
> tokenization). I needed a filter that would allow me to change alphaOnlySort 
> and its analysis chain from using KeywordTokenizer to using 
> WhitespaceTokenizer, and then a way to recombine the tokens at the end. So, 
> take "The Grapes of Wrath". I needed a way for it to get turned into:
> {noformat}
> grapes of wrath
> {noformat}
> And then to combine those tokens into a single token:
> {noformat}
> grapesofwrath
> {noformat}
> The attached CombiningFilter takes care of that. It doesn't do it super 
> efficiently I'm guessing (since I used a StringBuffer), but I'm open to 
> suggestions on how to make it better. 
> One other thing is that apparently this analyzer works fine for analysis 
> (e.g., it produces the desired tokens), however, for sorting in Solr I'm 
> getting null sort tokens. Need to figure out why. 
> Here ya go!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-1972) Need additional query stats in admin interface - median, 95th and 99th percentile

2012-12-27 Thread Lance Norskog (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13540200#comment-13540200
 ] 

Lance Norskog commented on SOLR-1972:
-

The 25/75 values come from weights, and can be changed to 99/95. I have a patch 
for that but never submitted it.

> Need additional query stats in admin interface - median, 95th and 99th 
> percentile
> -
>
> Key: SOLR-1972
> URL: https://issues.apache.org/jira/browse/SOLR-1972
> Project: Solr
>  Issue Type: Improvement
>  Components: web gui
>Affects Versions: 1.4
>Reporter: Shawn Heisey
>Assignee: Alan Woodward
>Priority: Minor
> Fix For: 4.2, 5.0
>
> Attachments: elyograg-1972-3.2.patch, elyograg-1972-3.2.patch, 
> elyograg-1972-trunk.patch, elyograg-1972-trunk.patch, leak-closeable.patch, 
> leak.patch, revert-SOLR-1972.patch, SOLR-1972-branch3x-url_pattern.patch, 
> SOLR-1972-branch4x.patch, SOLR-1972-branch4x.patch, SOLR-1972_metrics.patch, 
> SOLR-1972_metrics.patch, SOLR-1972_metrics.patch, SOLR-1972_metrics.patch, 
> SOLR-1972_metrics.patch, SOLR-1972_metrics.patch, SOLR-1972_metrics.patch, 
> SOLR-1972_metrics.patch, solr1972-metricsregistry-branch4x-failure.log, 
> SOLR-1972.patch, SOLR-1972.patch, SOLR-1972.patch, SOLR-1972.patch, 
> SOLR-1972-url_pattern.patch, stacktraces.tar.gz
>
>
> I would like to see more detailed query statistics from the admin GUI.  This 
> is what you can get now:
> requests : 809
> errors : 0
> timeouts : 0
> totalTime : 70053
> avgTimePerRequest : 86.59209
> avgRequestsPerSecond : 0.8148785 
> I'd like to see more data on the time per request - median, 95th percentile, 
> 99th percentile, and any other statistical function that makes sense to 
> include.  In my environment, the first bunch of queries after startup tend to 
> take several seconds each.  I find that the average value tends to be useless 
> until it has several thousand queries under its belt and the caches are 
> thoroughly warmed.  The statistical functions I have mentioned would quickly 
> eliminate the influence of those initial slow queries.
> The system will have to store individual data about each query.  I don't know 
> if this is something Solr does already.  It would be nice to have a 
> configurable count of how many of the most recent data points are kept, to 
> control the amount of memory the feature uses.  The default value could be 
> something like 1024 or 4096.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-1972) Need additional query stats in admin interface - median, 95th and 99th percentile

2012-12-26 Thread Lance Norskog (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13539779#comment-13539779
 ] 

Lance Norskog commented on SOLR-1972:
-

The OnlineSummary class in Mahout does the calculations you want. One little 
class you can steal. No dependencies necessary.

> Need additional query stats in admin interface - median, 95th and 99th 
> percentile
> -
>
> Key: SOLR-1972
> URL: https://issues.apache.org/jira/browse/SOLR-1972
> Project: Solr
>  Issue Type: Improvement
>  Components: web gui
>Affects Versions: 1.4
>Reporter: Shawn Heisey
>Assignee: Alan Woodward
>Priority: Minor
> Fix For: 4.1, 5.0
>
> Attachments: elyograg-1972-3.2.patch, elyograg-1972-3.2.patch, 
> elyograg-1972-trunk.patch, elyograg-1972-trunk.patch, leak-closeable.patch, 
> leak.patch, SOLR-1972-branch3x-url_pattern.patch, SOLR-1972-branch4x.patch, 
> SOLR-1972-branch4x.patch, SOLR-1972_metrics.patch, SOLR-1972_metrics.patch, 
> SOLR-1972_metrics.patch, SOLR-1972_metrics.patch, SOLR-1972_metrics.patch, 
> SOLR-1972_metrics.patch, SOLR-1972_metrics.patch, SOLR-1972_metrics.patch, 
> solr1972-metricsregistry-branch4x-failure.log, SOLR-1972.patch, 
> SOLR-1972.patch, SOLR-1972.patch, SOLR-1972.patch, 
> SOLR-1972-url_pattern.patch, stacktraces.tar.gz
>
>
> I would like to see more detailed query statistics from the admin GUI.  This 
> is what you can get now:
> requests : 809
> errors : 0
> timeouts : 0
> totalTime : 70053
> avgTimePerRequest : 86.59209
> avgRequestsPerSecond : 0.8148785 
> I'd like to see more data on the time per request - median, 95th percentile, 
> 99th percentile, and any other statistical function that makes sense to 
> include.  In my environment, the first bunch of queries after startup tend to 
> take several seconds each.  I find that the average value tends to be useless 
> until it has several thousand queries under its belt and the caches are 
> thoroughly warmed.  The statistical functions I have mentioned would quickly 
> eliminate the influence of those initial slow queries.
> The system will have to store individual data about each query.  I don't know 
> if this is something Solr does already.  It would be nice to have a 
> configurable count of how many of the most recent data points are kept, to 
> control the amount of memory the feature uses.  The default value could be 
> something like 1024 or 4096.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-4164) Result Grouping fails if no hits

2012-12-16 Thread Lance Norskog (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-4164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13533675#comment-13533675
 ] 

Lance Norskog commented on SOLR-4164:
-

I can't recreate it. It may have been another problem I was having: a shard 
server ran out of memory during the query and threw an exception to the 
distributor. Maybe the group query collection code ignores these remote 
exceptions?

> Result Grouping fails if no hits
> 
>
> Key: SOLR-4164
> URL: https://issues.apache.org/jira/browse/SOLR-4164
> Project: Solr
>  Issue Type: Bug
>  Components: SearchComponents - other, SolrCloud
>Affects Versions: 4.0
>Reporter: Lance Norskog
>
> In SolrCloud, found a result grouping bug in the 4.0 release.
> A distributed result grouping request under SolrCloud got this result:
> {noformat}
> Dec 10, 2012 10:32:07 PM org.apache.solr.common.SolrException log
> SEVERE: null:java.lang.IllegalArgumentException: numHits must be > 0; please 
> use TotalHitCountCollector if you just need the total hit count
> at 
> org.apache.lucene.search.TopFieldCollector.create(TopFieldCollector.java:1120)
> at 
> org.apache.lucene.search.TopFieldCollector.create(TopFieldCollector.java:1069)
> at 
> org.apache.lucene.search.grouping.AbstractSecondPassGroupingCollector.(AbstractSecondPassGroupingCollector.java:75)
> at 
> org.apache.lucene.search.grouping.term.TermSecondPassGroupingCollector.(TermSecondPassGroupingCollector.java:49)
> at 
> org.apache.solr.search.grouping.distributed.command.TopGroupsFieldCommand.create(TopGroupsFieldCommand.java:128)
> at 
> org.apache.solr.search.grouping.CommandHandler.execute(CommandHandler.java:132)
> at 
> org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:339)
> at 
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:206)
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (SOLR-4164) Result Grouping fails if no hits

2012-12-10 Thread Lance Norskog (JIRA)

Lance Norskog created SOLR-4164:
---

 Summary: Result Grouping fails if no hits
 Key: SOLR-4164
 URL: https://issues.apache.org/jira/browse/SOLR-4164
 Project: Solr
  Issue Type: Bug
  Components: SearchComponents - other, SolrCloud
Affects Versions: 4.0
Reporter: Lance Norskog


In SolrCloud, found a result grouping bug in the 4.0 release.
A distributed result grouping request under SolrCloud got this result:

{noformat}
Dec 10, 2012 10:32:07 PM org.apache.solr.common.SolrException log
SEVERE: null:java.lang.IllegalArgumentException: numHits must be > 0; please 
use TotalHitCountCollector if you just need the total hit count
at 
org.apache.lucene.search.TopFieldCollector.create(TopFieldCollector.java:1120)
at 
org.apache.lucene.search.TopFieldCollector.create(TopFieldCollector.java:1069)
at 
org.apache.lucene.search.grouping.AbstractSecondPassGroupingCollector.(AbstractSecondPassGroupingCollector.java:75)
at 
org.apache.lucene.search.grouping.term.TermSecondPassGroupingCollector.(TermSecondPassGroupingCollector.java:49)
at 
org.apache.solr.search.grouping.distributed.command.TopGroupsFieldCommand.create(TopGroupsFieldCommand.java:128)
at 
org.apache.solr.search.grouping.CommandHandler.execute(CommandHandler.java:132)
at 
org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:339)
at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:206)
{noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (SOLR-4150) NPE in distributed result grouping if group.query has no results

2012-12-05 Thread Lance Norskog (JIRA)

Lance Norskog created SOLR-4150:
---

 Summary: NPE in distributed result grouping if group.query has no 
results
 Key: SOLR-4150
 URL: https://issues.apache.org/jira/browse/SOLR-4150
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: 4.0
Reporter: Lance Norskog


If group.query has no results in a distributed search, there is an NPE in the 
front-end:
{noformat}
Dec 5, 2012 10:40:31 PM org.apache.solr.core.SolrCore execute
INFO: [collection1] webapp=/solr path=/select 
params={debugQuery=true&group.ngroups=true&fl=thing,eventid&indent=true&q=thing:("CODE:20517")&group.field=eventid&group.query=thing:CODE*&group=true&wt=json&fq=source:somewhere}
 status=500 QTime=745 
Dec 5, 2012 10:40:31 PM org.apache.solr.common.SolrException log
SEVERE: null:java.lang.NullPointerException
at 
org.apache.solr.search.grouping.distributed.shardresultserializer.TopGroupsResultTransformer.transformToNative(TopGroupsResultTransformer.java:110)
at 
org.apache.solr.search.grouping.distributed.responseprocessor.TopGroupsShardResponseProcessor.process(TopGroupsShardResponseProcessor.java:80)
at 
org.apache.solr.handler.component.QueryComponent.handleGroupedResponses(QueryComponent.java:620)
at 
org.apache.solr.handler.component.QueryComponent.handleResponses(QueryComponent.java:603)
at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:309)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1699)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:455)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:276)
at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1337)
at 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:484)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119)
at 
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524)
at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:233)
at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1065)
at 
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:413)
at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:192)
at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:999)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117)
at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:250)
at 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:149)
at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:111)
at org.eclipse.jetty.server.Server.handle(Server.java:351)
at 
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:454)
at 
org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:47)
at 
org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:890)
at 
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:944)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:634)
at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:230)
at 
org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:66)
at 
org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:254)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:599)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:534)
at java.lang.Thread.run(Thread.java:662)
{noformat}

(This is in sharding, maybe not a SolrCloud problem.)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Active 4.x branches?

2012-11-27 Thread Lance Norskog

To be fair, it is worth having other people look at your patches. 
Not that anybody looks at mine :(

- Original Message -
| From: "Mark Miller" 
| To: dev@lucene.apache.org
| Sent: Tuesday, November 27, 2012 4:55:21 PM
| Subject: Re: Active 4.x branches?
| 
| 
| On Nov 27, 2012, at 7:50 PM, Radim Kolar  wrote:
| 
| > why you do not have more committers to process patches quickly?
| 
| There are almost like 40 committers. Most with day jobs, families,
| hobbies, etc. Many, many, many that are not paid to commit your
| patches.
| 
| It's a marathon, not a race. Welcome to Open Source :)
| 
| - Mark
| -
| To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
| For additional commands, e-mail: dev-h...@lucene.apache.org
| 
| 

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-4041) Allow segment merge monitoring in Solr Admin gui

2012-11-27 Thread Lance Norskog (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-4041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13505129#comment-13505129
 ] 

Lance Norskog commented on SOLR-4041:
-

Cool! I have done monitoring of segment sizes with fixed-time polling, and 
post-commit polling of the data/index directory. This makes it easier to chart 
other aspects of merging. Another useful number is the current number of 
segments.

> Allow segment merge monitoring in Solr Admin gui
> 
>
> Key: SOLR-4041
> URL: https://issues.apache.org/jira/browse/SOLR-4041
> Project: Solr
>  Issue Type: Improvement
>  Components: web gui
>Reporter: Radim Kolar
>Assignee: Mark Miller
>Priority: Minor
>  Labels: patch
> Fix For: 4.1, 5.0
>
> Attachments: solr-monitormerge.txt
>
>
> add solrMbean for ConcurrentMergeScheduler

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-1306) Support pluggable persistence/loading of solr.xml details

2012-11-13 Thread Lance Norskog (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-1306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13496844#comment-13496844
 ] 

Lance Norskog commented on SOLR-1306:
-

bq.  think we should drop the top level config (eg solr.xml). Instead, we 
should auto load folders 
+1 

There are often groups of cores with the same schema- shards in the same solr, 
for example. How would this dynamic discovery support groups of collections?



> Support pluggable persistence/loading of solr.xml details
> -
>
> Key: SOLR-1306
> URL: https://issues.apache.org/jira/browse/SOLR-1306
> Project: Solr
>  Issue Type: New Feature
>  Components: multicore
>Reporter: Noble Paul
>Assignee: Erick Erickson
> Fix For: 4.1
>
> Attachments: SOLR-1306.patch, SOLR-1306.patch, SOLR-1306.patch, 
> SOLR-1306.patch
>
>
> Persisting and loading details from one xml is fine if the no:of cores are 
> small and the no:of cores are few/fixed . If there are 10's of thousands of 
> cores in a single box adding a new core (with persistent=true) becomes very 
> expensive because every core creation has to write this huge xml. 
> Moreover , there is a good chance that the file gets corrupted and all the 
> cores become unusable . In that case I would prefer it to be stored in a 
> centralized DB which is backed up/replicated and all the information is 
> available in a centralized location. 
> We may need to refactor CoreContainer to have a pluggable implementation 
> which can load/persist the details . The default implementation should 
> write/read from/to solr.xml . And the class should be pluggable as follows in 
> solr.xml
> {code:xml}
> 
>   
> 
> {code}
> There will be a new interface (or abstract class ) called SolrDataProvider 
> which this class must implement

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-1487) Add expungeDelete to SolrJ's SolrServer.commit(..)

2012-10-31 Thread Lance Norskog (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-1487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488406#comment-13488406
 ] 

Lance Norskog commented on SOLR-1487:
-

a) They both require changes to the same path, so should probably be in one 
commit.
b) SOLR-3938 has a unit test, while this does not. It is really easy for this 
kind of feature to stop working. The SolrJ code paths for 
commit/rollback/prepareCommit etc. need unit tests. 



> Add  expungeDelete to SolrJ's SolrServer.commit(..)
> ---
>
> Key: SOLR-1487
> URL: https://issues.apache.org/jira/browse/SOLR-1487
> Project: Solr
>  Issue Type: Improvement
>  Components: clients - java
>Affects Versions: 1.3
> Environment: N/A
>Reporter: Jibo John
> Attachments: expunge-patch.txt
>
>
> Add  expungeDelete to SolrJ's SolrServer.commit(..).
> Currently, this can be done only through updatehandler (  ( curl update -F 
> stream.body=' ' )) 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-4007) Morfologik dictionaries not available in Solr field type

2012-10-31 Thread Lance Norskog (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-4007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13487627#comment-13487627
 ] 

Lance Norskog commented on SOLR-4007:
-

What is the change? I would like to change my OpenNLP patch to work in the same 
directory/jar structure.

> Morfologik dictionaries not available in Solr field type
> 
>
> Key: SOLR-4007
> URL: https://issues.apache.org/jira/browse/SOLR-4007
> Project: Solr
>  Issue Type: Bug
>  Components: Schema and Analysis
>Affects Versions: 4.1
>    Reporter: Lance Norskog
>Assignee: Dawid Weiss
>Priority: Minor
> Fix For: 4.1
>
>
> The Polish Morfologik type does not find its dictionaries when used in Solr. 
> To demonstrate:
> 1) Add this to example/solr/collection1/conf/schema.xml:
> {noformat}
> 
>  positionIncrementGap="100">
>   
> 
>  />
>   
> 
> {noformat}
> 2) Add this to example/solr/collection1/conf/solrconfig.xml:
> {noformat}
>   
>   
>   
> {noformat}
> 3) Test 'text_pl' in the analysis page. You will get an exception.
> {noformat}
> Oct 28, 2012 8:27:19 PM org.apache.solr.core.SolrCore execute
> INFO: [collection1] webapp=/solr path=/analysis/field 
> params={analysis.showmatch=true&analysis.query=&wt=json&analysis.fieldvalue=blah+blah&analysis.fieldtype=text_pl}
>  status=500 QTime=26 
> Oct 28, 2012 8:27:19 PM org.apache.solr.common.SolrException log
> SEVERE: null:java.lang.RuntimeException: Default dictionary resource for 
> language 'plnot found.
>   at morfologik.stemming.Dictionary.getForLanguage(Dictionary.java:163)
>   at morfologik.stemming.PolishStemmer.(PolishStemmer.java:64)
>   at 
> org.apache.lucene.analysis.morfologik.MorfologikFilter.(MorfologikFilter.java:70)
>   at 
> org.apache.lucene.analysis.morfologik.MorfologikFilterFactory.create(MorfologikFilterFactory.java:63)
>   at 
> org.apache.solr.handler.AnalysisRequestHandlerBase.analyzeValue(AnalysisRequestHandlerBase.java:125)
>   at 
> org.apache.solr.handler.FieldAnalysisRequestHandler.analyzeValues(FieldAnalysisRequestHandler.java:220)
>   at 
> org.apache.solr.handler.FieldAnalysisRequestHandler.handleAnalysisRequest(FieldAnalysisRequestHandler.java:181)
>   at 
> org.apache.solr.handler.FieldAnalysisRequestHandler.doAnalysis(FieldAnalysisRequestHandler.java:100)
>   at 
> [...]
> Caused by: java.io.IOException: Could not locate resource: 
> morfologik/dictionaries/pl.dict
>   at morfologik.util.ResourceUtils.openInputStream(ResourceUtils.java:56)
>   at morfologik.stemming.Dictionary.getForLanguage(Dictionary.java:156)
>   ... 38 more
> {noformat}
> {{morfologik-polish-1.5.3.jar}} has {{morfologik/dictionaries/pl.dict}}.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Problem with package resolution - 'ant resolve' hangs

2012-10-31 Thread Lance Norskog

With a full trunk checkout and no ~/.ivy2 repository, 'ant -d resolve' hangs. 
Here is the final 85 lines of debug output, after all of the ant startup 
logging. It hangs after: 
[ivy:retrieve] don't use cache for 
com.carrotsearch.randomizedtesting#junit4-ant;2.0.4: checkModified=true 

Any ideas? 

resolve: 
Setting project property: ivy.version -> 2.2.0 
[ivy:retrieve] parameter not found: ivy.organisation 
[ivy:retrieve] parameter not found: ivy.module 
[ivy:retrieve] parameter not found: ivy.resolved.file 
[ivy:retrieve] using standard ensure resolved 
[ivy:retrieve] parameter found as attribute value: 
ivy.resolved.configurations=* 
[ivy:retrieve] calculating configurations to resolve 
[ivy:retrieve] module not yet resolved, all confs still need to be resolved 
[ivy:retrieve] no resolved descriptor found: launching default resolve 
Overriding previous definition of property "ivy.version" 
Setting project property: ivy.version -> 2.2.0 
[ivy:retrieve] parameter found as attribute value: ivy.configurations=* 
[ivy:retrieve] parameter found as ivy variable: 
ivy.resolve.default.type.filter=* 
[ivy:retrieve] parameter found as ivy variable: ivy.dep.file=ivy.xml 
[ivy:retrieve] using ivy parser to parse 
file:/Users/lancenorskog/Documents/open/solr/trunk/lucene/test-framework/ivy.xml
 
[ivy:retrieve] post 1.3 ivy file: using exact as default matcher 
[ivy:retrieve] :: resolving dependencies :: 
org.apache.lucene#core-test-framework;working@Lance-Norskogs-MacBook-Pro.local 
[ivy:retrieve] confs: [default, junit4-stdalone] 
[ivy:retrieve] validate = true 
[ivy:retrieve] refresh = false 
[ivy:retrieve] resolving dependencies for configuration 'default' 
[ivy:retrieve] == resolving dependencies for 
org.apache.lucene#core-test-framework;working@Lance-Norskogs-MacBook-Pro.local 
[default] 
[ivy:retrieve] loadData of 
org.apache.lucene#core-test-framework;working@Lance-Norskogs-MacBook-Pro.local 
of rootConf=default 
[ivy:retrieve] == resolving dependencies 
org.apache.lucene#core-test-framework;working@Lance-Norskogs-MacBook-Pro.local->org.apache.ant#ant;1.8.2
 [default->default] 
[ivy:retrieve] loadData of org.apache.ant#ant;1.8.2 of rootConf=default 
[ivy:retrieve] using default to resolve org.apache.ant#ant;1.8.2 
[ivy:retrieve] default: Checking cache for: dependency: 
org.apache.ant#ant;1.8.2 {default=[default]} 
[ivy:retrieve] don't use cache for org.apache.ant#ant;1.8.2: checkModified=true 
[ivy:retrieve] No entry is found in the ModuleDescriptorCache : 
/Users/lancenorskog/.ivy2/cache/org.apache.ant/ant/ivy-1.8.2.xml 
[ivy:retrieve] post 1.3 ivy file: using exact as default matcher 
[ivy:retrieve] found ivy file in cache for org.apache.ant#ant;1.8.2 (resolved 
by public): /Users/lancenorskog/.ivy2/cache/org.apache.ant/ant/ivy-1.8.2.xml 
[ivy:retrieve] found module in cache but with a different resolver: discarding: 
org.apache.ant#ant;1.8.2; expected resolver=local; resolver=public 
[ivy:retrieve] trying 
/Users/lancenorskog/.ivy2/local/org.apache.ant/ant/1.8.2/ivys/ivy.xml 
[ivy:retrieve] tried 
/Users/lancenorskog/.ivy2/local/org.apache.ant/ant/1.8.2/ivys/ivy.xml 
[ivy:retrieve] local: resource not reachable for org.apache.ant#ant;1.8.2: 
res=/Users/lancenorskog/.ivy2/local/org.apache.ant/ant/1.8.2/ivys/ivy.xml 
[ivy:retrieve] trying 
/Users/lancenorskog/.ivy2/local/org.apache.ant/ant/1.8.2/jars/ant.jar 
[ivy:retrieve] tried 
/Users/lancenorskog/.ivy2/local/org.apache.ant/ant/1.8.2/jars/ant.jar 
[ivy:retrieve] local: resource not reachable for org.apache.ant#ant;1.8.2: 
res=/Users/lancenorskog/.ivy2/local/org.apache.ant/ant/1.8.2/jars/ant.jar 
[ivy:retrieve] local: no ivy file nor artifact found for 
org.apache.ant#ant;1.8.2 
[ivy:retrieve] main: Checking cache for: dependency: org.apache.ant#ant;1.8.2 
{default=[default]} 
[ivy:retrieve] Entry is found in the ModuleDescriptorCache : 
/Users/lancenorskog/.ivy2/cache/org.apache.ant/ant/ivy-1.8.2.xml 
[ivy:retrieve] found ivy file in cache for org.apache.ant#ant;1.8.2 (resolved 
by public): /Users/lancenorskog/.ivy2/cache/org.apache.ant/ant/ivy-1.8.2.xml 
[ivy:retrieve] main: module revision found in cache: org.apache.ant#ant;1.8.2 
[ivy:retrieve] found org.apache.ant#ant;1.8.2 in public 
[ivy:retrieve] == resolving dependencies 
org.apache.lucene#core-test-framework;working@Lance-Norskogs-MacBook-Pro.local->org.apache.ant#ant;1.8.2
 [default->runtime] 
[ivy:retrieve] loadData of org.apache.ant#ant;1.8.2 of rootConf=default 
[ivy:retrieve] == resolving dependencies 
org.apache.lucene#core-test-framework;working@Lance-Norskogs-MacBook-Pro.local->org.apache.ant#ant;1.8.2
 [default->compile] 
[ivy:retrieve] loadData of org.apache.ant#ant;1.8.2 of rootConf=default 
[ivy:retrieve] == resolving dependencies 
org.apache.lucene#core-test-framework;working@Lance-Norskogs-MacBook-Pro.local->org.apache.ant#ant;1.8.2
 [default->master] 
[ivy:retrieve] loadData of org.apache.ant#ant;1.8.2 of rootConf=default 
[ivy:retrieve] == resolving dependencies 
org.apache.luc

[jira] [Commented] (SOLR-1487) Add expungeDelete to SolrJ's SolrServer.commit(..)

2012-10-30 Thread Lance Norskog (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-1487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13487421#comment-13487421
 ] 

Lance Norskog commented on SOLR-1487:
-

[SOLR-3938]-unit.patch adds prepareCommit().

A question to the experts: what is a good unit test to enhance for this? It 
needs to check numDoc v.s. maxDoc, to the test would be one that add docs and 
then reads back the stats.


> Add  expungeDelete to SolrJ's SolrServer.commit(..)
> ---
>
> Key: SOLR-1487
> URL: https://issues.apache.org/jira/browse/SOLR-1487
> Project: Solr
>  Issue Type: Improvement
>  Components: clients - java
>Affects Versions: 1.3
> Environment: N/A
>Reporter: Jibo John
> Attachments: expunge-patch.txt
>
>
> Add  expungeDelete to SolrJ's SolrServer.commit(..).
> Currently, this can be done only through updatehandler (  ( curl update -F 
> stream.body=' ' )) 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-1972) Need additional query stats in admin interface - median, 95th and 99th percentile

2012-10-29 Thread Lance Norskog (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13486272#comment-13486272
 ] 

Lance Norskog commented on SOLR-1972:
-

The 5th percentile is really useful. There is always a maximum query time of 
30s just because of a garbage collection failure, and people look at that 
number and freak out. For query times, the 5th percentile shows what is 
repeatedly "too slow". 

> Need additional query stats in admin interface - median, 95th and 99th 
> percentile
> -
>
> Key: SOLR-1972
> URL: https://issues.apache.org/jira/browse/SOLR-1972
> Project: Solr
>  Issue Type: Improvement
>  Components: web gui
>Affects Versions: 1.4
>Reporter: Shawn Heisey
>Assignee: Alan Woodward
>Priority: Minor
> Fix For: 4.1
>
> Attachments: elyograg-1972-3.2.patch, elyograg-1972-3.2.patch, 
> elyograg-1972-trunk.patch, elyograg-1972-trunk.patch, 
> SOLR-1972-branch3x-url_pattern.patch, SOLR-1972-branch4x.patch, 
> SOLR-1972-branch4x.patch, SOLR-1972_metrics.patch, SOLR-1972_metrics.patch, 
> SOLR-1972_metrics.patch, SOLR-1972_metrics.patch, SOLR-1972_metrics.patch, 
> SOLR-1972_metrics.patch, SOLR-1972_metrics.patch, 
> solr1972-metricsregistry-branch4x-failure.log, SOLR-1972.patch, 
> SOLR-1972.patch, SOLR-1972.patch, SOLR-1972.patch, SOLR-1972-url_pattern.patch
>
>
> I would like to see more detailed query statistics from the admin GUI.  This 
> is what you can get now:
> requests : 809
> errors : 0
> timeouts : 0
> totalTime : 70053
> avgTimePerRequest : 86.59209
> avgRequestsPerSecond : 0.8148785 
> I'd like to see more data on the time per request - median, 95th percentile, 
> 99th percentile, and any other statistical function that makes sense to 
> include.  In my environment, the first bunch of queries after startup tend to 
> take several seconds each.  I find that the average value tends to be useless 
> until it has several thousand queries under its belt and the caches are 
> thoroughly warmed.  The statistical functions I have mentioned would quickly 
> eliminate the influence of those initial slow queries.
> The system will have to store individual data about each query.  I don't know 
> if this is something Solr does already.  It would be nice to have a 
> configurable count of how many of the most recent data points are kept, to 
> control the amount of memory the feature uses.  The default value could be 
> something like 1024 or 4096.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (SOLR-4007) Morfologik dictionaries not available in Solr field type

2012-10-28 Thread Lance Norskog (JIRA)

Lance Norskog created SOLR-4007:
---

 Summary: Morfologik dictionaries not available in Solr field type
 Key: SOLR-4007
 URL: https://issues.apache.org/jira/browse/SOLR-4007
 Project: Solr
  Issue Type: Bug
  Components: Schema and Analysis
Affects Versions: 4.1
Reporter: Lance Norskog
Priority: Minor


The Polish Morfologik type does not find its dictionaries when used in Solr. To 
demonstrate:

1) Add this to example/solr/collection1/conf/schema.xml:
{noformat}


  


  

{noformat}

2) Add this to example/solr/collection1/conf/solrconfig.xml:

{noformat}
  
  
  
{noformat}

3) Test 'text_pl' in the analysis page. You will get an exception.
{noformat}
Oct 28, 2012 8:27:19 PM org.apache.solr.core.SolrCore execute
INFO: [collection1] webapp=/solr path=/analysis/field 
params={analysis.showmatch=true&analysis.query=&wt=json&analysis.fieldvalue=blah+blah&analysis.fieldtype=text_pl}
 status=500 QTime=26 
Oct 28, 2012 8:27:19 PM org.apache.solr.common.SolrException log
SEVERE: null:java.lang.RuntimeException: Default dictionary resource for 
language 'plnot found.
at morfologik.stemming.Dictionary.getForLanguage(Dictionary.java:163)
at morfologik.stemming.PolishStemmer.(PolishStemmer.java:64)
at 
org.apache.lucene.analysis.morfologik.MorfologikFilter.(MorfologikFilter.java:70)
at 
org.apache.lucene.analysis.morfologik.MorfologikFilterFactory.create(MorfologikFilterFactory.java:63)
at 
org.apache.solr.handler.AnalysisRequestHandlerBase.analyzeValue(AnalysisRequestHandlerBase.java:125)
at 
org.apache.solr.handler.FieldAnalysisRequestHandler.analyzeValues(FieldAnalysisRequestHandler.java:220)
at 
org.apache.solr.handler.FieldAnalysisRequestHandler.handleAnalysisRequest(FieldAnalysisRequestHandler.java:181)
at 
org.apache.solr.handler.FieldAnalysisRequestHandler.doAnalysis(FieldAnalysisRequestHandler.java:100)
at 

[...]

Caused by: java.io.IOException: Could not locate resource: 
morfologik/dictionaries/pl.dict
at morfologik.util.ResourceUtils.openInputStream(ResourceUtils.java:56)
at morfologik.stemming.Dictionary.getForLanguage(Dictionary.java:156)
... 38 more

{noformat}

{{morfologik-polish-1.5.3.jar}} has {{morfologik/dictionaries/pl.dict}}.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-3938) prepareCommit command omits commitData

2012-10-26 Thread Lance Norskog (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-3938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lance Norskog updated SOLR-3938:


Attachment: SOLR-3938-unit.patch

Add unit test to TestReplicationHandler. This requires solrj support for 
prepareCommit, and thus includes that. 

> prepareCommit command omits commitData
> --
>
> Key: SOLR-3938
> URL: https://issues.apache.org/jira/browse/SOLR-3938
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 4.0
>Reporter: Yonik Seeley
>  Labels: 4.0.1_Candidate
> Fix For: 4.1
>
> Attachments: SOLR-3938.patch, SOLR-3938-unit.patch
>
>
> Solr's prepareCommit doesn't set any commitData, and then when a commit is 
> done, it's too late.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2216) Highlighter query exceeds maxBooleanClause limit due to range query

2012-10-26 Thread Lance Norskog (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13485241#comment-13485241
 ] 

Lance Norskog commented on SOLR-2216:
-

Is this still a problem in 3.6, 4.0 or the trunk?

> Highlighter query exceeds maxBooleanClause limit due to range query
> ---
>
> Key: SOLR-2216
> URL: https://issues.apache.org/jira/browse/SOLR-2216
> Project: Solr
>  Issue Type: Bug
>  Components: highlighter
>Affects Versions: 1.4.1
> Environment: Linux solr-2.bizjournals.int 2.6.18-194.3.1.el5 #1 SMP 
> Thu May 13 13:08:30 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux
> java version "1.6.0_21"
> Java(TM) SE Runtime Environment (build 1.6.0_21-b06)
> Java HotSpot(TM) 64-Bit Server VM (build 17.0-b16, mixed mode)
> JAVA_OPTS="-client -Dcom.sun.management.jmxremote=true 
> -Dcom.sun.management.jmxremote.port= 
> -Dcom.sun.management.jmxremote.authenticate=true 
> -Dcom.sun.management.jmxremote.access.file=/root/.jmxaccess 
> -Dcom.sun.management.jmxremote.password.file=/root/.jmxpasswd 
> -Dcom.sun.management.jmxremote.ssl=false -XX:+UseCompressedOops 
> -XX:MaxPermSize=512M -Xms10240M -Xmx15360M -XX:+UseParallelGC 
> -XX:+AggressiveOpts -XX:NewRatio=5"
> top - 11:38:49 up 124 days, 22:37,  1 user,  load average: 5.20, 4.35, 3.90
> Tasks: 220 total,   1 running, 219 sleeping,   0 stopped,   0 zombie
> Cpu(s): 47.5%us,  2.9%sy,  0.0%ni, 49.5%id,  0.1%wa,  0.0%hi,  0.0%si,  0.0%st
> Mem:  24679008k total, 18179980k used,  6499028k free,   125424k buffers
> Swap: 26738680k total,29276k used, 26709404k free,  8187444k cached
>Reporter: Ken Stanley
>
> For a full detail of the issue, please see the mailing list: 
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201011.mbox/%3CAANLkTimE8z8yOni+u0Nsbgct1=ef7e+su0_waku2c...@mail.gmail.com%3E
> The nutshell version of the issue is that when I have a query that contains 
> ranges on a specific (non-highlighted) field, the highlighter component is 
> attempting to create a query that exceeds the value of maxBooleanClauses set 
> from solrconfig.xml. This is despite my explicit setting of hl.field, 
> hl.requireFieldMatch, and various other hightlight options in the query. 
> As suggested by Koji in the follow-up response, I removed the range queries 
> from my main query, and SOLR and highlighting were happy to fulfill my 
> request. It was suggested that if removing the range queries worked that this 
> might potentially be a bug, hence my filing this JIRA ticket. For what it is 
> worth, if I move my range queries into an fq, I do not get the exception 
> about exceeding maxBooleanClauses, and I get the effect that I was looking 
> for. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-3975) Document Summarization toolkit, using LSA techniques

2012-10-25 Thread Lance Norskog (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-3975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13484559#comment-13484559
 ] 

Lance Norskog commented on SOLR-3975:
-

It's a first draft, not ready for committing. It needs strategies for 
controlling processing time, and code cleanups. I wanted to get it out for 
review before sinking even more time into it.

> Document Summarization toolkit, using LSA techniques
> 
>
> Key: SOLR-3975
> URL: https://issues.apache.org/jira/browse/SOLR-3975
> Project: Solr
>  Issue Type: New Feature
>    Reporter: Lance Norskog
>Priority: Minor
> Attachments: 4.1.summary.patch, reuters.sh
>
>
> This package analyzes sentences and words as used across sentences to rank 
> the most important sentences and words. The general topic is called "document 
> summarization" and is a popular research topic in textual analysis. 
> How to use:
> 1) Check out the 4.x branch, apply the patch, build, and run the solr/example 
> instance.
> 2) Download the first Reuters article corpus from:
> http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.tar.gz
> 3) Unpack this into a directory.
> 4) Run the attached 'reuters.sh' script:
> sh reuters.sh directory http://localhost:8983/solr/collection1
> 5) Wait several minutes.
> Now go to http://localhost:8983/solr/collection1/browse?summary=true and look 
> at the large gray box marked 'Document Summary'. This has a table of 
> statistics about the analysis, the three most important sentences, and 
> several of the most important words in the documents. The sentences have the 
> important words in italics.
> The code is packaged as a search component and as an analysis handler. The 
> /browse demo uses the search component, and you can also post raw text to  
> http://localhost:8983/solr/collection1/analysis/summary. Here is a sample 
> command:
> {code}
> curl -s 
> "http://localhost:8983/solr/analysis/summary?indent=true&echoParams=all&file=$FILE&wt=xml";
>  --data-binary @$FILE -H 'Content-type:application/xml'
> {code}
> This is an implementation of LSA-based document summarization. A short 
> explanation and a long evaluation are described in my blog, [Uncle Lance's 
> Ultra Whiz Bang|http://ultrawhizbang.blogspot.com], starting here: 
> [http://ultrawhizbang.blogspot.com/2012/09/document-summarization-with-lsa-1.html]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Lucene build & ivy problems

2012-10-23 Thread Lance Norskog

Yes. That's what I thought. And that is how it should work. But that is not 
what happened. Ivy did not need to resolve anything, but it called out to a 
resolver which it could not see. After that, it called out over and over. I had 
to completely re-download everything. Since a full download worked, I think it 
boogered up its cache and then confused itself. 

Lance 

- Original Message -

| From: "Uwe Schindler" 
| To: dev@lucene.apache.org
| Sent: Monday, October 22, 2012 12:03:35 AM
| Subject: RE: Lucene build & ivy problems

| It only downloads on the first try, later builds never download
| anything unless dependencies have changed. And if you would be able
| to * not * download them, your build would not succeed.

[jira] [Updated] (SOLR-3975) Document Summarization toolkit, using LSA techniques

2012-10-22 Thread Lance Norskog (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-3975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lance Norskog updated SOLR-3975:


Description: 
This package analyzes sentences and words as used across sentences to rank the 
most important sentences and words. The general topic is called "document 
summarization" and is a popular research topic in textual analysis. 

How to use:
1) Check out the 4.x branch, apply the patch, build, and run the solr/example 
instance.
2) Download the first Reuters article corpus from:
http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.tar.gz
3) Unpack this into a directory.
4) Run the attached 'reuters.sh' script:
sh reuters.sh directory http://localhost:8983/solr/collection1
5) Wait several minutes.

Now go to http://localhost:8983/solr/collection1/browse?summary=true and look 
at the large gray box marked 'Document Summary'. This has a table of statistics 
about the analysis, the three most important sentences, and several of the most 
important words in the documents. The sentences have the important words in 
italics.

The code is packaged as a search component and as an analysis handler. The 
/browse demo uses the search component, and you can also post raw text to  
http://localhost:8983/solr/collection1/analysis/summary. Here is a sample 
command:
{code}
curl -s 
"http://localhost:8983/solr/analysis/summary?indent=true&echoParams=all&file=$FILE&wt=xml";
 --data-binary @$FILE -H 'Content-type:application/xml'
{code}

This is an implementation of LSA-based document summarization. A short 
explanation and a long evaluation are described in my blog, [Uncle Lance's 
Ultra Whiz Bang|http://ultrawhizbang.blogspot.com], starting here: 
[http://ultrawhizbang.blogspot.com/2012/09/document-summarization-with-lsa-1.html]



  was:
This package analyzes sentences and words as used across sentences to rank the 
most important sentences and words. The general topic is called "document 
summarization" and is a popular research topic in textual analysis. 

How to use:
1) Check out the 4.x branch, apply the patch, build, and run the solr/example 
instance.
2) Download the first Reuters article corpus from:
http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.tar.gz
3) Unpack this into a directory.
4) Run the attached 'reuters.sh' script:
sh reuters.sh directory http://localhost:8983/solr/collection1
5) Wait several minutes.

Now go to http://localhost:8983/solr/collection1/browse?summary=true and look 
at the large gray box marked 'Document Summary'. This has a table of statistics 
about the analysis, the three most important sentences, and several of the most 
important words in the documents. The sentences have the important tags in 
italics.

The code is packaged as a search component and as an analysis handler. The 
/browse demo uses the search component, and you can also post raw text to  
http://localhost:8983/solr/collection1/analysis/summary. Here is a sample 
command:
curl -s 
"http://localhost:8983/solr/analysis/summary?indent=true&echoParams=all&file=$FILE&wt=xml";
 --data-binary @$FILE -H 'Content-type:application/xml'

This is an implementation of LSA-based document summarization. A short 
explanation and a long evaluation are described in my blog, [Uncle Lance's 
Ultra Whiz Bang|http://ultrawhizbang.blogspot.com], starting here: 
[http://ultrawhizbang.blogspot.com/2012/09/document-summarization-with-lsa-1.html]




> Document Summarization toolkit, using LSA techniques
> 
>
> Key: SOLR-3975
>     URL: https://issues.apache.org/jira/browse/SOLR-3975
> Project: Solr
>  Issue Type: New Feature
>Reporter: Lance Norskog
>Priority: Minor
> Attachments: 4.1.summary.patch, reuters.sh
>
>
> This package analyzes sentences and words as used across sentences to rank 
> the most important sentences and words. The general topic is called "document 
> summarization" and is a popular research topic in textual analysis. 
> How to use:
> 1) Check out the 4.x branch, apply the patch, build, and run the solr/example 
> instance.
> 2) Download the first Reuters article corpus from:
> http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.tar.gz
> 3) Unpack this into a directory.
> 4) Run the attached 'reuters.sh' script:
> sh reuters.sh directory http://localhost:8983/solr/collection1
> 5) Wait several minutes.
> Now go to http://localhost:8983/solr/collection1/browse?summary=true and look 
> at the large gray box marked 'Document Summary'. This has a table of 
> statistics about the analysis, the three most important sentences, a

[jira] [Updated] (SOLR-3975) Document Summarization toolkit, using LSA techniques

2012-10-22 Thread Lance Norskog (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-3975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lance Norskog updated SOLR-3975:


Attachment: reuters.sh
4.1.summary.patch

> Document Summarization toolkit, using LSA techniques
> 
>
> Key: SOLR-3975
> URL: https://issues.apache.org/jira/browse/SOLR-3975
> Project: Solr
>  Issue Type: New Feature
>        Reporter: Lance Norskog
>Priority: Minor
> Attachments: 4.1.summary.patch, reuters.sh
>
>
> This package analyzes sentences and words as used across sentences to rank 
> the most important sentences and words. The general topic is called "document 
> summarization" and is a popular research topic in textual analysis. 
> How to use:
> 1) Check out the 4.x branch, apply the patch, build, and run the solr/example 
> instance.
> 2) Download the first Reuters article corpus from:
> http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.tar.gz
> 3) Unpack this into a directory.
> 4) Run the attached 'reuters.sh' script:
> sh reuters.sh directory http://localhost:8983/solr/collection1
> 5) Wait several minutes.
> Now go to http://localhost:8983/solr/collection1/browse?summary=true and look 
> at the large gray box marked 'Document Summary'. This has a table of 
> statistics about the analysis, the three most important sentences, and 
> several of the most important words in the documents. The sentences have the 
> important tags in italics.
> The code is packaged as a search component and as an analysis handler. The 
> /browse demo uses the search component, and you can also post raw text to  
> http://localhost:8983/solr/collection1/analysis/summary. Here is a sample 
> command:
> curl -s 
> "http://localhost:8983/solr/analysis/summary?indent=true&echoParams=all&file=$FILE&wt=xml";
>  --data-binary @$FILE -H 'Content-type:application/xml'
> This is an implementation of LSA-based document summarization. A short 
> explanation and a long evaluation are described in my blog, [Uncle Lance's 
> Ultra Whiz Bang|http://ultrawhizbang.blogspot.com], starting here: 
> [http://ultrawhizbang.blogspot.com/2012/09/document-summarization-with-lsa-1.html]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (SOLR-3975) Document Summarization toolkit, using LSA techniques

2012-10-22 Thread Lance Norskog (JIRA)

Lance Norskog created SOLR-3975:
---

Summary: Document Summarization toolkit, using LSA techniques
Key: SOLR-3975
URL: https://issues.apache.org/jira/browse/SOLR-3975
Project: Solr
Issue Type: New Feature
Reporter: Lance Norskog
Priority: Minor
Attachments: 4.1.summary.patch, reuters.sh

This package analyzes sentences and words as used across sentences to rank the
most important sentences and words. The general topic is called "document
summarization" and is a popular research topic in textual analysis.

How to use:
1) Check out the 4.x branch, apply the patch, build, and run the solr/example
instance.
2) Download the first Reuters article corpus from:
http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.tar.gz
3) Unpack this into a directory.
4) Run the attached 'reuters.sh' script:
sh reuters.sh directory http://localhost:8983/solr/collection1
5) Wait several minutes.

Now go to http://localhost:8983/solr/collection1/browse?summary=true and look
at the large gray box marked 'Document Summary'. This has a table of statistics
about the analysis, the three most important sentences, and several of the most
important words in the documents. The sentences have the important tags in
italics.

The code is packaged as a search component and as an analysis handler. The
/browse demo uses the search component, and you can also post raw text to
http://localhost:8983/solr/collection1/analysis/summary. Here is a sample
command:
curl -s
"http://localhost:8983/solr/analysis/summary?indent=true&echoParams=all&file=$FILE&wt=xml";
--data-binary @$FILE -H 'Content-type:application/xml'

This is an implementation of LSA-based document summarization. A short
explanation and a long evaluation are described in my blog, [Uncle Lance's
Ultra Whiz Bang|http://ultrawhizbang.blogspot.com], starting here:
[http://ultrawhizbang.blogspot.com/2012/09/document-summarization-with-lsa-1.html]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Lucene build & ivy problems

2012-10-21 Thread Lance Norskog

If I have all of the dependencies downloaded, how can I tell the build to skip 
checking the repositories? 

I'm working on a somewhat dodgy internet connection. I ran 'ant example' a 
hundred times. On the 101st, I had an internet outage and the Ivy stuff 
blocked. Ever after that the resolver hangs. I had to remove the home/.ivy2 
directory and start over. And now all of the dependencies are slowly 
downloading again over a dodgy internet cafe connection. 

Is there some flag to the ant build that says "just pretend everything is 
downloaded"?

[jira] [Commented] (LUCENE-4494) Add phoenetic algorithm Match Rating approach to lucene

2012-10-19 Thread Lance Norskog (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13480615#comment-13480615
 ] 

Lance Norskog commented on LUCENE-4494:
---

Cool! Is it this algorithm? 
[http://en.wikipedia.org/wiki/Match_rating_approach]



> Add phoenetic algorithm Match Rating approach to lucene
> ---
>
> Key: LUCENE-4494
> URL: https://issues.apache.org/jira/browse/LUCENE-4494
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Affects Versions: 4.0-ALPHA
>Reporter: Colm Rice
>Priority: Minor
> Fix For: 4.1
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> I want to add MatchRatingApproach algorithm to the Lucene project. 
> What I have at the moment is a class called 
> org.apache.lucene.analysis.phoenetic.MatchRatingApproach implementing 
> StringEncoder
> I have a pretty comprehensive test file located at: 
> org.apache.lucene.analysis.phonetic.MatchRatingApproachTests
> It's not exactly existing pattern so I'm going to need a bit of advice here. 
> Thanks! Feel free to email.
> FYI: It my first contribitution so be gentle :-) C# is my native.
> Reference: http://en.wikipedia.org/wiki/Match_rating_approach

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: broken links on the site

2012-10-11 Thread Lance Norskog

The Javadoc directories are directly under the apache.org/core and
apache.org/solr directories, and so to find the latest javadocs you
have to crawl the main html and hunt for them. Or, the code crawler
has to be maintained with the links for the current releases.

Would it be possible to move javadocs to a subdirectory? That way,
code crawlers can just crawl that directory and always find the latest
released javadocs. For example:

http://lucene.apache.org/releases/4_0_0/core/
tutorial.html etc.
http://lucene.apache.org/releases/4_0_0/core/api
http://lucene.apache.org/releases/4_0_0/solr/
tutorial.html etc.
http://lucene.apache.org/releases/4_0_0/core/api

- Original Message -
| From: "Robert Muir" 
| To: dev@lucene.apache.org
| Sent: Thursday, October 11, 2012 5:17:25 PM
| Subject: Re: broken links on the site
|
| FYI I think we fixed these! the "hossman broken link detector" only
| finds 10 broken links on staging, all of which are false failures due
| to the way we deploy javadocs
| 
(http://wiki.apache.org/lucene-java/ReleaseTodo#Push_javadocs_to_the_CMS_production_tree):
|
| Found 10 broken links.
|
| http://lucene.staging.apache.org/solr/4_0_0/tutorial.html
| http://lucene.staging.apache.org/solr/api-3_6_1/
| http://lucene.staging.apache.org/core/4_0_0/index.html
| http://lucene.staging.apache.org/core/4_0_0/changes/Changes.html
| http://lucene.staging.apache.org/solr/4_0_0/
| http://lucene.staging.apache.org/core/3_6_1/gettingstarted.html
| http://lucene.staging.apache.org/core/3_6_1/index.html
| http://lucene.staging.apache.org/solr/4_0_0/changes/Changes.html
| http://lucene.staging.apache.org/solr/api-3_6_1/doc-files/tutorial.html
| http://lucene.staging.apache.org/core/4_0_0/demo/overview-summary.html
|
|
| On Thu, Oct 11, 2012 at 7:05 PM, Robert Muir 
| wrote:
| > Thanks a lot for running this!
| >
| >> ..and this uncovered some genuine broken links that we should try
| >> to fix at
| >> some point (see below) but unfortunately this really simple wget
| >> approach
| >> doens't tell you *where* the broken link comes from -- just have
| >> to do a bit
| >> of intuition/grepping.
| >
| > Right but at least for now we can take stabs at fixing things and
| > then
| > see if the number goes down.
|
| -
| To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
| For additional commands, e-mail: dev-h...@lucene.apache.org
|
|

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-3921) Add decompose compound Japanese Katakana token capability to Kuromoji

2012-10-06 Thread Lance Norskog (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13471121#comment-13471121
 ] 

Lance Norskog edited comment on LUCENE-3921 at 10/7/12 12:33 AM:
-

Statistical models and rule-based models always have a failure rate. When you 
use them you have to decide what to do about the failures. Attacking the 
failures with another model drives toward Xeno's Paradox. For Chinese language 
search, breaking the failures into bigrams makes a lot of sense. The CJK bigram 
generator creates a massive amount of bogus bigrams. Bogus bigrams case bogus 
results from sloppy phrase searches.

Smart Chinese and Kuromoji are not systems for doing natural-language 
processing). They are systems for minimizing bogus bigrams. This allows sloppy 
phrase queries to find fewer bogus results. In my use case, Smart Chinese 
created only 2% (40k/1.8m) of the possible bigrams. [SOLR-3653] is the result 
of my experience in supporting searching Chinese legal documents. I have some 
useful numbers at the end of the page.



  was (Author: lancenorskog):
Statistical models and rule-based models always have a failure rate. When 
you use them you have to decide what to do about the failures. Attacking the 
failures with another model drives toward Xeno's Paradox. For Chinese language 
search, breaking the failures into bigrams makes a lot of sense.

Another way to look at this is that Smart Chinese and Kuromoji are systems for 
minimizing bogus bigrams. This allows phrase queries to function without 
finding bogus results. The CJK bigram creator generates bogus bigrams, which 
cause phrase queries to find bogus results. [SOLR-3653] is the result of my 
experience in supporting searching Chinese legal documents. I have some useful 
numbers at the end of the page.


  
> Add decompose compound Japanese Katakana token capability to Kuromoji
> -
>
> Key: LUCENE-3921
> URL: https://issues.apache.org/jira/browse/LUCENE-3921
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Affects Versions: 4.0-ALPHA
> Environment: Cent OS 5, IPA Dictionary, Run with "Search mdoe"
>Reporter: Kazuaki Hiraga
>  Labels: features
>
> Japanese morphological analyzer, Kuromoji doesn't have a capability to 
> decompose every Japanese Katakana compound tokens to sub-tokens. It seems 
> that some Katakana tokens can be decomposed, but it cannot be applied every 
> Katakana compound tokens. For instance, "トートバッグ(tote bag)" and "ショルダーバッグ" 
> don't decompose into "トート バッグ" and "ショルダー バッグ" although the IPA dictionary 
> has "バッグ" in its entry.  I would like to apply the decompose feature to every 
> Katakana tokens if the sub-tokens are in the dictionary or add the capability 
> to force apply the decompose feature to every Katakana tokens.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3921) Add decompose compound Japanese Katakana token capability to Kuromoji

2012-10-06 Thread Lance Norskog (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13471121#comment-13471121
 ] 

Lance Norskog commented on LUCENE-3921:
---

Statistical models and rule-based models always have a failure rate. When you 
use them you have to decide what to do about the failures. Attacking the 
failures with another model drives toward Xeno's Paradox. For Chinese language 
search, breaking the failures into bigrams makes a lot of sense.

Another way to look at this is that Smart Chinese and Kuromoji are systems for 
minimizing bogus bigrams. This allows phrase queries to function without 
finding bogus results. The CJK bigram creator generates bogus bigrams, which 
cause phrase queries to find bogus results. [SOLR-3653] is the result of my 
experience in supporting searching Chinese legal documents. I have some useful 
numbers at the end of the page.



> Add decompose compound Japanese Katakana token capability to Kuromoji
> -
>
> Key: LUCENE-3921
> URL: https://issues.apache.org/jira/browse/LUCENE-3921
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Affects Versions: 4.0-ALPHA
> Environment: Cent OS 5, IPA Dictionary, Run with "Search mdoe"
>Reporter: Kazuaki Hiraga
>  Labels: features
>
> Japanese morphological analyzer, Kuromoji doesn't have a capability to 
> decompose every Japanese Katakana compound tokens to sub-tokens. It seems 
> that some Katakana tokens can be decomposed, but it cannot be applied every 
> Katakana compound tokens. For instance, "トートバッグ(tote bag)" and "ショルダーバッグ" 
> don't decompose into "トート バッグ" and "ショルダー バッグ" although the IPA dictionary 
> has "バッグ" in its entry.  I would like to apply the decompose feature to every 
> Katakana tokens if the sub-tokens are in the dictionary or add the capability 
> to force apply the decompose feature to every Katakana tokens.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji

2012-10-06 Thread Lance Norskog (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13471117#comment-13471117
 ] 

Lance Norskog commented on LUCENE-3922:
---

bq. On the other hand, I agree with Christian to not preserving leading zeros. 
So, "◯◯七" doesn't need to become "007".
This example shows why leading zeros should be preserved :)

There are different kinds of text search. Searching for media titles like James 
Bond movies is a very different thing from searching newspaper articles. You 
might want to find "◯◯七" as the Japanese-language release and "007" as the 
English-language release. These numbers are brands, not numbers. 

> Add Japanese Kanji number normalization to Kuromoji
> ---
>
> Key: LUCENE-3922
> URL: https://issues.apache.org/jira/browse/LUCENE-3922
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Affects Versions: 4.0-ALPHA
>Reporter: Kazuaki Hiraga
>  Labels: features
> Attachments: LUCENE-3922.patch
>
>
> Japanese people use Kanji numerals instead of Arabic numerals for writing 
> price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and 
> 十二月(December).  So, we would like to normalize those Kanji numerals to Arabic 
> numerals (I don't think we need to have a capability to normalize to Kanji 
> numerals).
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji

2012-10-04 Thread Lance Norskog (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13469936#comment-13469936
 ] 

Lance Norskog commented on LUCENE-3922:
---

Kazuaki, do have any comment on this fix?

> Add Japanese Kanji number normalization to Kuromoji
> ---
>
> Key: LUCENE-3922
> URL: https://issues.apache.org/jira/browse/LUCENE-3922
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Affects Versions: 4.0-ALPHA
>Reporter: Kazuaki Hiraga
>  Labels: features
> Attachments: LUCENE-3922.patch
>
>
> Japanese people use Kanji numerals instead of Arabic numerals for writing 
> price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and 
> 十二月(December).  So, we would like to normalize those Kanji numerals to Arabic 
> numerals (I don't think we need to have a capability to normalize to Kanji 
> numerals).
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3921) Add decompose compound Japanese Katakana token capability to Kuromoji

2012-10-04 Thread Lance Norskog (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13469934#comment-13469934
 ] 

Lance Norskog commented on LUCENE-3921:
---

I have discovered a similar problem with the Smart Chinese toolkit. Would the 
same approach work for both languages? Would it be worth solving this problem 
with a generic tool rather than language-specific?

> Add decompose compound Japanese Katakana token capability to Kuromoji
> -
>
> Key: LUCENE-3921
> URL: https://issues.apache.org/jira/browse/LUCENE-3921
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Affects Versions: 4.0-ALPHA
> Environment: Cent OS 5, IPA Dictionary, Run with "Search mdoe"
>Reporter: Kazuaki Hiraga
>  Labels: features
>
> Japanese morphological analyzer, Kuromoji doesn't have a capability to 
> decompose every Japanese Katakana compound tokens to sub-tokens. It seems 
> that some Katakana tokens can be decomposed, but it cannot be applied every 
> Katakana compound tokens. For instance, "トートバッグ(tote bag)" and "ショルダーバッグ" 
> don't decompose into "トート バッグ" and "ショルダー バッグ" although the IPA dictionary 
> has "バッグ" in its entry.  I would like to apply the decompose feature to every 
> Katakana tokens if the sub-tokens are in the dictionary or add the capability 
> to force apply the decompose feature to every Katakana tokens.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Closed] (SOLR-3760) Build packaging of complex contrib packages just plain does not work

2012-10-04 Thread Lance Norskog (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-3760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lance Norskog closed SOLR-3760.
---

   Resolution: Fixed
Fix Version/s: 4.0

The Solr factories have all been moved into Lucene, and so the zig-zag 
dependency problem no longer exists. For the rest of the topic, some other time.

> Build packaging of complex contrib packages just plain does not work
> 
>
> Key: SOLR-3760
> URL: https://issues.apache.org/jira/browse/SOLR-3760
> Project: Solr
>  Issue Type: Improvement
>  Components: Build
>    Reporter: Lance Norskog
> Fix For: 4.0
>
>
> The build system packages Lucene libraries in the Solr war, but they do not 
> pack libraries required by the Lucene libraries. The UIMA and analysis-extras 
> contrib packages have factories for the Lucene libraries.
> The net effect is that when solrconfig.xml include  directives for 
> dist/xxx-contribX-xxx.jar and solr/contrib/contribX/lib, this fails because 
> the lucene analyzer file inside the solr war cannot find the library files in 
> solr/contrib/contribX/lib because the classloader for the war does not find 
> the libraries from the  directives.
> Two alternative fixes are presented below.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module

2012-09-30 Thread Lance Norskog (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13466565#comment-13466565
 ] 

Lance Norskog commented on LUCENE-2899:
---

Thank you!

This worked when I posted it. There have been many changes in 4.x and trunk 
since then. For example, all of the tokenizer and filter factories moved to 
Lucene from Solr. I'm waiting until 4.0 is finished before I redo this patch. 




> Add OpenNLP Analysis capabilities as a module
> -
>
> Key: LUCENE-2899
> URL: https://issues.apache.org/jira/browse/LUCENE-2899
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, 
> LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, OpenNLPFilter.java, 
> OpenNLPTokenizer.java, opennlp_trunk.patch
>
>
> Now that OpenNLP is an ASF project and has a nice license, it would be nice 
> to have a submodule (under analysis) that exposed capabilities for it. Drew 
> Farris, Tom Morton and I have code that does:
> * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it 
> would have to change slightly to buffer tokens)
> * NamedEntity recognition as a TokenFilter
> We are also planning a Tokenizer/TokenFilter that can put parts of speech as 
> either payloads (PartOfSpeechAttribute?) on a token or at the same position.
> I'd propose it go under:
> modules/analysis/opennlp

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-3218) Range faceting support for CurrencyField


[ 
https://issues.apache.org/jira/browse/SOLR-3218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13465322#comment-13465322
 ] 

Lance Norskog commented on SOLR-3218:
-

+1 for this feature. It makes the currency type 10x more compelling.

> Range faceting support for CurrencyField
> 
>
> Key: SOLR-3218
> URL: https://issues.apache.org/jira/browse/SOLR-3218
> Project: Solr
>  Issue Type: Improvement
>  Components: Schema and Analysis
>Reporter: Jan Høydahl
> Fix For: 4.1
>
> Attachments: SOLR-3218-1.patch, SOLR-3218-2.patch, SOLR-3218.patch, 
> SOLR-3218.patch, SOLR-3218.patch
>
>
> Spinoff from SOLR-2202. Need to add range faceting capabilities for 
> CurrencyField

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Resolved] (SOLR-3898) Mouse-over help in Analysis Browser does nothing


 [ 
https://issues.apache.org/jira/browse/SOLR-3898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lance Norskog resolved SOLR-3898.
-

Resolution: Invalid

> Mouse-over help in Analysis Browser does nothing
> 
>
> Key: SOLR-3898
> URL: https://issues.apache.org/jira/browse/SOLR-3898
> Project: Solr
>  Issue Type: Improvement
>  Components: web gui
>    Reporter: Lance Norskog
>Priority: Minor
> Attachments: Screen-Shot-2012-09-27-at-12.06.58-PM.png, 
> Screen-Shot-2012-09-27-at-9.55.22-AM.png
>
>
> The Analysis UI has a mouse-over question mark for the acronyms shows for 
> every stage in the analysis pipeline. Clicking on this does nothing.
> I guess this on the 'todo' list?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-3898) Mouse-over help in Analysis Browser does nothing


[ 
https://issues.apache.org/jira/browse/SOLR-3898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13465080#comment-13465080
 ] 

Lance Norskog commented on SOLR-3898:
-

I'm on Snow Leopard: Version 5.1.7 (6534.57.2). I do not know the versions or 
updates. I will close this.

> Mouse-over help in Analysis Browser does nothing
> 
>
> Key: SOLR-3898
> URL: https://issues.apache.org/jira/browse/SOLR-3898
> Project: Solr
>  Issue Type: Improvement
>  Components: web gui
>Reporter: Lance Norskog
>Priority: Minor
> Attachments: Screen-Shot-2012-09-27-at-12.06.58-PM.png, 
> Screen-Shot-2012-09-27-at-9.55.22-AM.png
>
>
> The Analysis UI has a mouse-over question mark for the acronyms shows for 
> every stage in the analysis pipeline. Clicking on this does nothing.
> I guess this on the 'todo' list?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-3898) Mouse-over help in Analysis Browser does nothing


[ 
https://issues.apache.org/jira/browse/SOLR-3898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13464581#comment-13464581
 ] 

Lance Norskog commented on SOLR-3898:
-

This does not work in Safari. Oh well.

> Mouse-over help in Analysis Browser does nothing
> 
>
> Key: SOLR-3898
> URL: https://issues.apache.org/jira/browse/SOLR-3898
> Project: Solr
>  Issue Type: Improvement
>  Components: web gui
>Reporter: Lance Norskog
>Priority: Minor
> Attachments: Screen-Shot-2012-09-27-at-9.55.22-AM.png
>
>
> The Analysis UI has a mouse-over question mark for the acronyms shows for 
> every stage in the analysis pipeline. Clicking on this does nothing.
> I guess this on the 'todo' list?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (SOLR-3898) Mouse-over help in Analysis Browser does nothing

2012-09-26 Thread Lance Norskog (JIRA)

Lance Norskog created SOLR-3898:
---

 Summary: Mouse-over help in Analysis Browser does nothing
 Key: SOLR-3898
 URL: https://issues.apache.org/jira/browse/SOLR-3898
 Project: Solr
  Issue Type: Improvement
  Components: web gui
Reporter: Lance Norskog
Priority: Minor


The Analysis UI has a mouse-over question mark for the acronyms shows for every 
stage in the analysis pipeline. Clicking on this does nothing.

I guess this on the 'todo' list?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-3734) Forever loop in schema browser

2012-09-26 Thread Lance Norskog (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-3734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13464413#comment-13464413
 ] 

Lance Norskog commented on SOLR-3734:
-

Hi-

Yes, this works in branch_4x, using the schema I submitted. I do not have 
ability to test whether it handles exceptions well. When you are writing new 
analyzer components, it is helpful for the UI to say "your code blew up".


> Forever loop in schema browser
> --
>
> Key: SOLR-3734
> URL: https://issues.apache.org/jira/browse/SOLR-3734
> Project: Solr
>  Issue Type: Bug
>  Components: Schema and Analysis, web gui
>Reporter: Lance Norskog
>Assignee: Stefan Matheis (steffkes)
> Attachments: SOLR-3734.patch, SOLR-3734.patch, 
> SOLR-3734_schema_browser_blocks_solr_conf_dir.zip
>
>
> When I start Solr with the attached conf directory, and hit the Schema 
> Browser, the loading circle spins permanently. 
> I don't know if the problem is in the UI or in Solr. The UI does not display 
> the Ajax solr calls, and I don't have a debugging proxy.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-2510) migrate solr analysis factories to analyzers module


 [ 
https://issues.apache.org/jira/browse/LUCENE-2510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lance Norskog updated LUCENE-2510:
--

Comment: was deleted

(was: bq. We should open new issues for:
* Update the goddamn wiki

If you're going to move the walls, please update the blueprints :))

> migrate solr analysis factories to analyzers module
> ---
>
> Key: LUCENE-2510
> URL: https://issues.apache.org/jira/browse/LUCENE-2510
> Project: Lucene - Core
>  Issue Type: Task
>  Components: modules/analysis
>Affects Versions: 4.0-ALPHA
>Reporter: Robert Muir
>Assignee: Uwe Schindler
> Fix For: 4.0-BETA, 5.0
>
> Attachments: LUCENE-2510-movefactories.sh, 
> LUCENE-2510-movefactories.sh, LUCENE-2510-multitermcomponent.patch, 
> LUCENE-2510-multitermcomponent.patch, LUCENE-2510-parent-classes.patch, 
> LUCENE-2510-parent-classes.patch, LUCENE-2510-parent-classes.patch, 
> LUCENE-2510.patch, LUCENE-2510.patch, LUCENE-2510.patch, 
> LUCENE-2510-resourceloader-bw.patch, LUCENE-2510-simplify-tests.patch
>
>
> In LUCENE-2413 all TokenStreams were consolidated into the analyzers module.
> This is a good step, but I think the next step is to put the Solr factories 
> into the analyzers module, too.
> This would make analyzers artifacts plugins to both lucene and solr, with 
> benefits such as:
> * users could use the old analyzers module with solr, too. This is a good 
> step to use real library versions instead of Version for backwards compat.
> * analyzers modules such as smartcn and icu, that aren't currently available 
> to solr users due to large file sizes or dependencies, would be simple 
> optional plugins to solr and easily available to users that want them.
> Rough sketch in this thread: 
> http://www.lucidimagination.com/search/document/3465a0e55ba94d58/solr_and_analyzers_module
> Practically, I havent looked much and don't really have a plan for how this 
> will work yet, so ideas are very welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-2510) migrate solr analysis factories to analyzers module


[ 
https://issues.apache.org/jira/browse/LUCENE-2510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13461598#comment-13461598
 ] 

Lance Norskog edited comment on LUCENE-2510 at 9/24/12 3:47 PM:


bq. We should open new issues for:
* Update the goddamn wiki

If you're going to move the walls, please update the blueprints :)

  was (Author: lancenorskog):
bq. We should open new issues for:
* Update the goddamn wiki
* Add support to "solr.class" for classes under org.apache.lucene

If you're going to move the walls, please update the blueprints :)
  
> migrate solr analysis factories to analyzers module
> ---
>
> Key: LUCENE-2510
> URL: https://issues.apache.org/jira/browse/LUCENE-2510
> Project: Lucene - Core
>  Issue Type: Task
>  Components: modules/analysis
>Affects Versions: 4.0-ALPHA
>Reporter: Robert Muir
>Assignee: Uwe Schindler
> Fix For: 4.0-BETA, 5.0
>
> Attachments: LUCENE-2510-movefactories.sh, 
> LUCENE-2510-movefactories.sh, LUCENE-2510-multitermcomponent.patch, 
> LUCENE-2510-multitermcomponent.patch, LUCENE-2510-parent-classes.patch, 
> LUCENE-2510-parent-classes.patch, LUCENE-2510-parent-classes.patch, 
> LUCENE-2510.patch, LUCENE-2510.patch, LUCENE-2510.patch, 
> LUCENE-2510-resourceloader-bw.patch, LUCENE-2510-simplify-tests.patch
>
>
> In LUCENE-2413 all TokenStreams were consolidated into the analyzers module.
> This is a good step, but I think the next step is to put the Solr factories 
> into the analyzers module, too.
> This would make analyzers artifacts plugins to both lucene and solr, with 
> benefits such as:
> * users could use the old analyzers module with solr, too. This is a good 
> step to use real library versions instead of Version for backwards compat.
> * analyzers modules such as smartcn and icu, that aren't currently available 
> to solr users due to large file sizes or dependencies, would be simple 
> optional plugins to solr and easily available to users that want them.
> Rough sketch in this thread: 
> http://www.lucidimagination.com/search/document/3465a0e55ba94d58/solr_and_analyzers_module
> Practically, I havent looked much and don't really have a plan for how this 
> will work yet, so ideas are very welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2510) migrate solr analysis factories to analyzers module


[ 
https://issues.apache.org/jira/browse/LUCENE-2510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13461598#comment-13461598
 ] 

Lance Norskog commented on LUCENE-2510:
---

bq. We should open new issues for:
* Update the goddamn wiki
* Add support to "solr.class" for classes under org.apache.lucene

If you're going to move the walls, please update the blueprints :)

> migrate solr analysis factories to analyzers module
> ---
>
> Key: LUCENE-2510
> URL: https://issues.apache.org/jira/browse/LUCENE-2510
> Project: Lucene - Core
>  Issue Type: Task
>  Components: modules/analysis
>Affects Versions: 4.0-ALPHA
>Reporter: Robert Muir
>Assignee: Uwe Schindler
> Fix For: 4.0-BETA, 5.0
>
> Attachments: LUCENE-2510-movefactories.sh, 
> LUCENE-2510-movefactories.sh, LUCENE-2510-multitermcomponent.patch, 
> LUCENE-2510-multitermcomponent.patch, LUCENE-2510-parent-classes.patch, 
> LUCENE-2510-parent-classes.patch, LUCENE-2510-parent-classes.patch, 
> LUCENE-2510.patch, LUCENE-2510.patch, LUCENE-2510.patch, 
> LUCENE-2510-resourceloader-bw.patch, LUCENE-2510-simplify-tests.patch
>
>
> In LUCENE-2413 all TokenStreams were consolidated into the analyzers module.
> This is a good step, but I think the next step is to put the Solr factories 
> into the analyzers module, too.
> This would make analyzers artifacts plugins to both lucene and solr, with 
> benefits such as:
> * users could use the old analyzers module with solr, too. This is a good 
> step to use real library versions instead of Version for backwards compat.
> * analyzers modules such as smartcn and icu, that aren't currently available 
> to solr users due to large file sizes or dependencies, would be simple 
> optional plugins to solr and easily available to users that want them.
> Rough sketch in this thread: 
> http://www.lucidimagination.com/search/document/3465a0e55ba94d58/solr_and_analyzers_module
> Practically, I havent looked much and don't really have a plan for how this 
> will work yet, so ideas are very welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-3653) Custom bigramming filter for to handle Smart Chinese edge cases


[ 
https://issues.apache.org/jira/browse/SOLR-3653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13461573#comment-13461573
 ] 

Lance Norskog commented on SOLR-3653:
-

Another note: one trigram is the number 15. There are several conventions for 
representing integers, including regional quirks. There is no 'number 
canonicalizer' in the Smart Chinese toolkit. This could be a problem with 
formal documents: historical, government docs, treaties and the like.

[http://en.wikipedia.org/wiki/Chinese_numerals#Whole_numbers]

> Custom bigramming filter for to handle Smart Chinese edge cases
> ---
>
> Key: SOLR-3653
> URL: https://issues.apache.org/jira/browse/SOLR-3653
> Project: Solr
>  Issue Type: New Feature
>  Components: Schema and Analysis
>Reporter: Lance Norskog
> Attachments: SmartChineseType.pdf, SOLR-3653.patch, 
> translations_450.five2thirteen.txt, translations_first_500.quad.txt, 
> translations_first_500.trigrams.txt
>
>
> The "Smart" Simplified Chinese toolkit in lucene/analysis/smartcn does not 
> work in some edge cases. It fails to split certain words which were not part 
> of the dictionary or training corpus. 
> This patch supplies a bigramming class to handle these occasional mistakes. 
> The algorithm creates bigrams out of all "words" longer than two ideograms.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-3653) Custom bigramming filter for to handle Smart Chinese edge cases


 [ 
https://issues.apache.org/jira/browse/SOLR-3653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lance Norskog updated SOLR-3653:


Attachment: translations_450.five2thirteen.txt
translations_first_500.trigrams.txt
translations_first_500.quad.txt

> Custom bigramming filter for to handle Smart Chinese edge cases
> ---
>
> Key: SOLR-3653
> URL: https://issues.apache.org/jira/browse/SOLR-3653
> Project: Solr
>  Issue Type: New Feature
>  Components: Schema and Analysis
>    Reporter: Lance Norskog
> Attachments: SmartChineseType.pdf, SOLR-3653.patch, 
> translations_450.five2thirteen.txt, translations_first_500.quad.txt, 
> translations_first_500.trigrams.txt
>
>
> The "Smart" Simplified Chinese toolkit in lucene/analysis/smartcn does not 
> work in some edge cases. It fails to split certain words which were not part 
> of the dictionary or training corpus. 
> This patch supplies a bigramming class to handle these occasional mistakes. 
> The algorithm creates bigrams out of all "words" longer than two ideograms.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-3653) Custom bigramming filter for to handle Smart Chinese edge cases


[ 
https://issues.apache.org/jira/browse/SOLR-3653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13461572#comment-13461572
 ] 

Lance Norskog commented on SOLR-3653:
-

I ran some counts on a database of 300k Chinese legal documents. The index has 
a unigram field based on the StandardAnalyzer, a bigram field based on the CJK 
analyzer, and a Smart Chinese field. I pulled the terms for all of them and 
filtered for Chinese ideograms only. These are text unigrams, with 

* The unigram field had 55k terms. 
* The bigram field had 1.8 million terms. 
* The Smart Chinese field had 417k terms:
** unigrams: 9.6k
** bigrams: 40k
** trigrams: 14.6k
** four: 5.6k
** five: 300
** six: 70
** seven: 51
** eight: 19
** nine: 7
** ten: 2
** eleven: 3
** twelve: 2
** thirteen: 3

The 4+ ngrams are essentially parsing failures by the Smart Chinese tokenizer. 
I have attached three Google Translate versions of the longer ngrams. 
'translations_first_500.trigrams.txt' and 'translations_first_500.quad.txt' are 
the most common 3-ideogram and 4-ideogram terms. They have a lot of phrases 
which should have been split.  'translations_450.five2thirteen.txt' are 450 
ngrams which are 5 ideograms or longer.  The longer ones have a lot of formal 
geographical names, government organization names and official propaganda 
phrases, more as the length increases. 

For this corpus, based the above breakdown and on other experience:
# CJK is a waste of disk space. Bigrams introduce a ton of noise.
# Unigrams might work well if you only do strict phrase searches. But searching 
for A, B, and C separately when given ABC is useless.
# If you search for raw country names, Smart Chinese lets you down when the 
document uses the formal name. 

Smart Chinese really does need to be split into bigrams. To cut bigram noise, I 
would take the database of bigrams that it generates, and then use these to 
guide splitting 3+ grams into bigrams. That is, if it ever generates AB, then 
the splitter turns ABCD into (AB CD). BC would be considered 'bigram noise'. 
Similarly, if Smart Chinese generates EF, then DEFG would become (D EF G).

However, a good fallback would be to have two fields, Smart Chinese and 
unigrams, with Smart Chinese boosted upwards and unigrams only with strict 
phrase search. With a high term count, bigrams are not helpful. You might even 
want to search Smart Chinese first, and then do unigram loose phrase search 
only if the recall is too low or the user is unhappy with the Smart Chinese 
results.


> Custom bigramming filter for to handle Smart Chinese edge cases
> ---
>
> Key: SOLR-3653
> URL: https://issues.apache.org/jira/browse/SOLR-3653
> Project: Solr
>  Issue Type: New Feature
>      Components: Schema and Analysis
>Reporter: Lance Norskog
> Attachments: SmartChineseType.pdf, SOLR-3653.patch
>
>
> The "Smart" Simplified Chinese toolkit in lucene/analysis/smartcn does not 
> work in some edge cases. It fails to split certain words which were not part 
> of the dictionary or training corpus. 
> This patch supplies a bigramming class to handle these occasional mistakes. 
> The algorithm creates bigrams out of all "words" longer than two ideograms.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4345) Create a Classification module

2012-09-16 Thread Lance Norskog (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13456734#comment-13456734
 ] 

Lance Norskog commented on LUCENE-4345:
---

bq. I don't think this should be using payloads to pull POS tags: the purpose 
of payloads
is when you need something stored in the actual index (and should be limited to 
e.g. a single byte),
its not type-safe but application-specific.
Yes, some NLP applications want actual payloads. For entity resolution you can 
have a UI add little icons for person, place, etc. In the OpenNLP patch it just 
seemed silly to add another Attribute type.

bq. If we think its useful for classifiers to limit the analysis to certain POS 
categories, then instead we should factor out a minimal POSAttribute 
sub-interface with something very generic like isNominal()/isVerbal() that can 
actually be implemented by different taggers with different tag sets across 
different languages.
There is a generic subset with mapping lists for most common tagsets for 
different languages. They map these tags down to 12 POS tags. Adding this 
mapper to the OpenNLP patch is on my large TODO list. They even have a mapping 
set for the Twitter Parts-of-Speech tagger.

bq. This is currently how Kuromoji works, it has a POS-based stopfilter. these 
are trivial to write. I also added a filter to remove payloads. If you use a 
different Attribute for the analysis chain, then you need a 'change 
POSAttribute to PayloadAttribute' at the bottom of the analysis chain.
Yes, I added one also. Some of the Kuromoji Attributes should be pulled up into 
the generic set.

> Create a Classification module
> --
>
> Key: LUCENE-4345
> URL: https://issues.apache.org/jira/browse/LUCENE-4345
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
>Priority: Minor
> Attachments: LUCENE-4345_2.patch, LUCENE-4345.patch, 
> SOLR-3700_2.patch, SOLR-3700.patch
>
>
> Lucene/Solr can host huge sets of documents containing lots of information in 
> fields so that these can be used as training examples (w/ features) in order 
> to very quickly create classifiers algorithms to use on new documents and / 
> or to provide an additional service.
> So the idea is to create a contrib module (called 'classification') to host a 
> ClassificationComponent that will use already seen data (the indexed 
> documents / fields) to classify new documents / text fragments.
> The first version will contain a (simplistic) Lucene based Naive Bayes 
> classifier but more implementations should be added in the future.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-4345) Create a Classification module


[ 
https://issues.apache.org/jira/browse/LUCENE-4345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13456053#comment-13456053
 ] 

Lance Norskog edited comment on LUCENE-4345 at 9/15/12 10:20 AM:
-

I recently did some related research in text analysis and found that limiting 
terms to nouns&verbs was a 10-15% increase in all variations of the test.

So, filtering terms from Parts-of-Speech annotation will be very helpful. In my 
OpenNLP patch is a FilterPayloadsFilter which keeps or rips out terms based on 
a list of text payloads.

[http://ultrawhizbang.blogspot.com/2012/09/document-summarization-with-lsa-1.html]

  was (Author: lancenorskog):
I recently did some related research in text analysis and found that 
limiting terms to nouns&verbs was a 10-15% increase in all variations of the 
test.

So, filtering terms from Parts-of-Speech annotation will be very helpful. In my 
OpenNLP patch is a FilterPayloadsFilter which keeps or rips out from a list of 
text payloads.

[http://ultrawhizbang.blogspot.com/2012/09/document-summarization-with-lsa-1.html]
  
> Create a Classification module
> --
>
> Key: LUCENE-4345
> URL: https://issues.apache.org/jira/browse/LUCENE-4345
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
>Priority: Minor
> Attachments: LUCENE-4345_2.patch, LUCENE-4345.patch, 
> SOLR-3700_2.patch, SOLR-3700.patch
>
>
> Lucene/Solr can host huge sets of documents containing lots of information in 
> fields so that these can be used as training examples (w/ features) in order 
> to very quickly create classifiers algorithms to use on new documents and / 
> or to provide an additional service.
> So the idea is to create a contrib module (called 'classification') to host a 
> ClassificationComponent that will use already seen data (the indexed 
> documents / fields) to classify new documents / text fragments.
> The first version will contain a (simplistic) Lucene based Naive Bayes 
> classifier but more implementations should be added in the future.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4345) Create a Classification module


[ 
https://issues.apache.org/jira/browse/LUCENE-4345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13456053#comment-13456053
 ] 

Lance Norskog commented on LUCENE-4345:
---

I recently did some related research in text analysis and found that limiting 
terms to nouns&verbs was a 10-15% increase in all variations of the test.

So, filtering terms from Parts-of-Speech annotation will be very helpful. In my 
OpenNLP patch is a FilterPayloadsFilter which keeps or rips out from a list of 
text payloads.

[http://ultrawhizbang.blogspot.com/2012/09/document-summarization-with-lsa-1.html]

> Create a Classification module
> --
>
> Key: LUCENE-4345
> URL: https://issues.apache.org/jira/browse/LUCENE-4345
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
>Priority: Minor
> Attachments: LUCENE-4345_2.patch, LUCENE-4345.patch, 
> SOLR-3700_2.patch, SOLR-3700.patch
>
>
> Lucene/Solr can host huge sets of documents containing lots of information in 
> fields so that these can be used as training examples (w/ features) in order 
> to very quickly create classifiers algorithms to use on new documents and / 
> or to provide an additional service.
> So the idea is to create a contrib module (called 'classification') to host a 
> ClassificationComponent that will use already seen data (the indexed 
> documents / fields) to classify new documents / text fragments.
> The first version will contain a (simplistic) Lucene based Naive Bayes 
> classifier but more implementations should be added in the future.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-3625) Solr conf class loader does not find indirect jars - regression


[ 
https://issues.apache.org/jira/browse/SOLR-3625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13455684#comment-13455684
 ] 

Lance Norskog commented on SOLR-3625:
-

I did not find a problem with the order of  directives inside 
solrconfig.xml. All  directives in solrconfig.xml seem to have one 
classloader. The problem happens when a Lucene jar refers to a third-party jar 
and Solr code outside the war tries to load the factory.

I have added a [detailed explanation of the 
problem|https://issues.apache.org/jira/browse/SOLR-3760?focusedCommentId=13455683&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13455683]
 to [SOLR-3760]

> Solr conf class loader does not find indirect jars - regression
> ---
>
> Key: SOLR-3625
> URL: https://issues.apache.org/jira/browse/SOLR-3625
> Project: Solr
>  Issue Type: Bug
>  Components: Schema and Analysis
>Reporter: Lance Norskog
>Assignee: Hoss Man
> Fix For: 4.0, 5.0
>
>
> The SolrConf class loader does not find indirectly used jars from external 
> lib directories. This is a regression. It worked as of July 2, 2012, when I 
> posted the most recent OpenNLP patch ([LUCENE-2899]). Something has broken 
> since then.
> This regression is true in both 4.x and the trunk. Both worked on July 2, 
> 2012.
> I have a project (the OpenNLP plugin) which uses jars from three places: 
> # solr/contrib/opennlp/src 
> ** tokenizer and filter factories
> # solr/contrib/opennlp/lib 
> ** OpenNLP project libraries
> # lucene/analysis/opennlp/src 
> ** tokenizer and filter
> SolrConf can only find the OpenNLP project jars when I add them to the 
> solr.war libraries. It cannot find them from any of these directories: 
> {code}
> solr/example/solr/lib
> solr/example/solr/collection1/lib
> solr/contrib/opennlp/lib
> {code}
> Here are the relevant config file entries. From solrconfig.xml:
> {code}
>   
>   
> {code}
> (yes, it needs to be three dot-dot-slash, not two. See [SOLR-3624].)
> From schema.xml:
> {code}
> 
>  positionIncrementGap="
> 100"
> >
>   
>sentenceModel="opennlp/en-test-sent.bin"
>   tokenizerModel="opennlp/en-test-tokenizer.bin"
> />
>   
> 
> {code}
> Here is the log. {{opennlp.tools.sentdetect.SentenceModel}} is a class in the 
> OpenNLP jar.
> {code}
> INFO: Adding 
> 'file:/Users/lancenorskog/Documents/open/solr/trunk/solr/dist/apache-solr-opennlp-5.0-SNAPSHOT.jar'
>  to classloader
> Jul 15, 2012 6:17:37 PM org.apache.solr.core.SolrResourceLoader 
> replaceClassLoader
> INFO: Adding 
> 'file:/Users/lancenorskog/Documents/open/solr/trunk/solr/contrib/opennlp/lib/jwnl-1.3.3.jar'
>  to classloader
> Jul 15, 2012 6:17:37 PM org.apache.solr.core.SolrResourceLoader 
> replaceClassLoader
> INFO: Adding 
> 'file:/Users/lancenorskog/Documents/open/solr/trunk/solr/contrib/opennlp/lib/opennlp-maxent-3.0.2-incubating.jar'
>  to classloader
> Jul 15, 2012 6:17:37 PM org.apache.solr.core.SolrResourceLoader 
> replaceClassLoader
> INFO: Adding 
> 'file:/Users/lancenorskog/Documents/open/solr/trunk/solr/contrib/opennlp/lib/opennlp-tools-1.5.2-incubating.jar'
>  to classloader
> Jul 15, 2012 6:17:37 PM org.apache.solr.core.SolrConfig 
> INFO: Using Lucene MatchVersion: LUCENE_50
> Jul 15, 2012 6:17:37 PM org.apache.solr.core.SolrConfig 
> INFO: Loaded SolrConfig: solrconfig.xml
> Jul 15, 2012 6:17:37 PM org.apache.solr.schema.IndexSchema readSchema
> INFO: Reading Solr Schema
> Jul 15, 2012 6:17:37 PM org.apache.solr.schema.IndexSchema readSchema
> INFO: Schema name=example
> Jul 15, 2012 6:17:38 PM org.apache.solr.schema.IndexSchema readSchema
> INFO: unique key field: id
> Jul 15, 2012 6:17:38 PM org.apache.solr.schema.FileExchangeRateProvider reload
> INFO: Reloading exchange rates from file currency.xml
> Jul 15, 2012 6:17:38 PM org.apache.solr.schema.FileExchangeRateProvider reload
> INFO: Reloading exchange rates from file currency.xml
> Jul 15, 2012 6:17:38 PM org.apache.solr.common.SolrException log
> SEVERE: null:java.lang.NoClassDefFoundError: 
> opennlp/tools/sentdetect/SentenceModel
>   at 
> org.apache.lucene.analysis.opennlp.tools.OpenNLPOpsFactory.getSentenceModel(OpenNLPOpsFactory.java:60)
>   at 
> org.apache.solr.analysis.OpenNLPTokenizerFactory.inform(OpenNLPTokenizerFactory.java:90)
>   at 
> org.apache.solr.core.SolrResourceLoader.inform(S

[jira] [Commented] (SOLR-3760) Build packaging of complex contrib packages just plain does not work