[jira] [Commented] (OPENNLP-830) Huge runtime improvement on training (POS, Chunk, ...)

2016-10-28 Thread Joern Kottmann (JIRA)

[ 
https://issues.apache.org/jira/browse/OPENNLP-830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15615780#comment-15615780
 ] 

Joern Kottmann commented on OPENNLP-830:


Fast exp speeds things up by around 2 to 3 % in my test case with the name 
finder, same for the improved beam search. 

> Huge runtime improvement on training (POS, Chunk, ...)
> --
>
> Key: OPENNLP-830
> URL: https://issues.apache.org/jira/browse/OPENNLP-830
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Machine Learning, POS Tagger
>Affects Versions: 1.6.0
> Environment: Any
>Reporter: Julien Subercaze
>Assignee: Joern Kottmann
>  Labels: performance
> Fix For: 1.6.1
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> opennlp.tools.ml.model.IndexHashTable is custom-made Hashtable that is used 
> to store mapping index. This Hashtable is heavily used in openlp.tools.ml.* 
> (i.e. every model) and leads to disastrous performance.
> This hashtable is probably legacy some legacy and is highly inefficient. A 
> simple drop-in replacement by a java.util.HashMap wrapper solves the issue, 
> doesn't break compatibility and does not add any dependency.
> Training a pos-tagger on a large dataset with custom tags, I see a factor 5 
> improvement. It also seems to improve all ML models training pipeline.
> See : 
> https://github.com/jsubercaze/opennlp/blob/trunk/opennlp-tools/src/main/java/opennlp/tools/ml/model/IndexHashTable.java
> For a quick fix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OPENNLP-776) Model Objects should be Serializable

2016-10-28 Thread Joern Kottmann (JIRA)

[ 
https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15615803#comment-15615803
 ] 

Joern Kottmann commented on OPENNLP-776:


I pushed this to branch 776. It would be really nice if you or someone else 
could review these changes. I tested this with Spark and the Doccat Model and 
it worked without any issues. 

> Model Objects should be Serializable
> 
>
> Key: OPENNLP-776
> URL: https://issues.apache.org/jira/browse/OPENNLP-776
> Project: OpenNLP
>  Issue Type: Improvement
>Affects Versions: tools-1.5.3
>Reporter: Tristan Nixon
>Assignee: Joern Kottmann
>Priority: Minor
>  Labels: features, patch
> Fix For: 1.6.1
>
> Attachments: externalizable.patch, 
> serializable-basemodel-joern.patch, serializable-basemodel.patch, 
> serialization_proxy.patch
>
>
> Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can 
> enable a number of features offered by other Java frameworks (my own use case 
> is described below). You've already got a good mechanism for 
> (de-)serialization, but it cannot be leveraged by other frameworks without 
> implementing the Serializable interface. I'm attaching a patch to BaseModel 
> that implements the methods in the java.io.Externalizable interface as 
> wrappers to the existing (de-)serialization methods. This simple change can 
> open up a number of useful opportunities for integrating OpenNLP with other 
> frameworks.
> My use case is that I am incorporating OpenNLP into a Spark application. This 
> requires that components of the system be distributed between the driver and 
> worker nodes within the cluster. In order to do this, Spark uses Java 
> serialization API to transmit objects between nodes. This is far more 
> efficient than instantiating models on each node independently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (OPENNLP-871) Clean up code base for release

2016-10-28 Thread Joern Kottmann (JIRA)
Joern Kottmann created OPENNLP-871:
--

 Summary: Clean up code base for release
 Key: OPENNLP-871
 URL: https://issues.apache.org/jira/browse/OPENNLP-871
 Project: OpenNLP
  Issue Type: Improvement
Reporter: Joern Kottmann
Priority: Trivial
 Fix For: 1.6.1


The usual pre-release code clea nup with tools like eclipse/intellij should be 
performed.

Clean ups:
- Remove unused imports
- Remove unnecessary casts
- Remove trailing white spaces



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (OPENNLP-872) Use try with resourcs where possible

2016-10-28 Thread Joern Kottmann (JIRA)
Joern Kottmann created OPENNLP-872:
--

 Summary: Use try with resourcs where possible
 Key: OPENNLP-872
 URL: https://issues.apache.org/jira/browse/OPENNLP-872
 Project: OpenNLP
  Issue Type: Improvement
Reporter: Joern Kottmann
Priority: Trivial
 Fix For: 1.6.1


Try with resources should be used where it is possible or currently missing. 
This will help to improve the resource handling (and places where it is not so 
great) and will make the code more readable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (OPENNLP-872) Use try with resourcs where possible

2016-10-28 Thread Joern Kottmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joern Kottmann reassigned OPENNLP-872:
--

Assignee: Joern Kottmann

> Use try with resourcs where possible
> 
>
> Key: OPENNLP-872
> URL: https://issues.apache.org/jira/browse/OPENNLP-872
> Project: OpenNLP
>  Issue Type: Improvement
>Reporter: Joern Kottmann
>Assignee: Joern Kottmann
>Priority: Trivial
> Fix For: 1.6.1
>
>
> Try with resources should be used where it is possible or currently missing. 
> This will help to improve the resource handling (and places where it is not 
> so great) and will make the code more readable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (OPENNLP-872) Use try with resourcs where possible

2016-10-28 Thread Joern Kottmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joern Kottmann closed OPENNLP-872.
--
Resolution: Fixed

> Use try with resourcs where possible
> 
>
> Key: OPENNLP-872
> URL: https://issues.apache.org/jira/browse/OPENNLP-872
> Project: OpenNLP
>  Issue Type: Improvement
>Reporter: Joern Kottmann
>Assignee: Joern Kottmann
>Priority: Trivial
> Fix For: 1.6.1
>
>
> Try with resources should be used where it is possible or currently missing. 
> This will help to improve the resource handling (and places where it is not 
> so great) and will make the code more readable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OPENNLP-861) Add Chi-Squared Data Indexer for Feature Selection

2016-10-28 Thread Joern Kottmann (JIRA)

[ 
https://issues.apache.org/jira/browse/OPENNLP-861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15616712#comment-15616712
 ] 

Joern Kottmann commented on OPENNLP-861:


It would be nice if we could get this contribution. Did you make some progress?

> Add Chi-Squared Data Indexer for Feature Selection
> --
>
> Key: OPENNLP-861
> URL: https://issues.apache.org/jira/browse/OPENNLP-861
> Project: OpenNLP
>  Issue Type: New Feature
>  Components: Machine Learning
>Affects Versions: 1.6.0
>Reporter: Joey Hong
>Priority: Minor
>  Labels: features
> Fix For: 1.6.1
>
>
> Text classification will naturally produce a lot of features. A lot of them 
> are independent of the category, and provide no real information gain in the 
> classification.
> The Chi-Squared feature selection method will allow features that do not pass 
> a threshold for dependency to be removed from the feature list, keeping the 
> feature list a reasonable size without significantly affecting the 
> classification accuracy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OPENNLP-776) Model Objects should be Serializable

2016-10-28 Thread Tristan Nixon (JIRA)

[ 
https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15616742#comment-15616742
 ] 

Tristan Nixon commented on OPENNLP-776:
---

Thanks, I think this patch looks good!
I have been using my previous patch in spark for a while now. I'll add yours 
and give it a try.

Do you know when will this appear in a  release? I've been using my own build 
of the lib in my project, but I'll switch to a standard build once it is 
available.

> Model Objects should be Serializable
> 
>
> Key: OPENNLP-776
> URL: https://issues.apache.org/jira/browse/OPENNLP-776
> Project: OpenNLP
>  Issue Type: Improvement
>Affects Versions: tools-1.5.3
>Reporter: Tristan Nixon
>Assignee: Joern Kottmann
>Priority: Minor
>  Labels: features, patch
> Fix For: 1.6.1
>
> Attachments: externalizable.patch, 
> serializable-basemodel-joern.patch, serializable-basemodel.patch, 
> serialization_proxy.patch
>
>
> Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can 
> enable a number of features offered by other Java frameworks (my own use case 
> is described below). You've already got a good mechanism for 
> (de-)serialization, but it cannot be leveraged by other frameworks without 
> implementing the Serializable interface. I'm attaching a patch to BaseModel 
> that implements the methods in the java.io.Externalizable interface as 
> wrappers to the existing (de-)serialization methods. This simple change can 
> open up a number of useful opportunities for integrating OpenNLP with other 
> frameworks.
> My use case is that I am incorporating OpenNLP into a Spark application. This 
> requires that components of the system be distributed between the driver and 
> worker nodes within the cluster. In order to do this, Spark uses Java 
> serialization API to transmit objects between nodes. This is far more 
> efficient than instantiating models on each node independently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OPENNLP-776) Model Objects should be Serializable

2016-10-28 Thread Joern Kottmann (JIRA)

[ 
https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15616758#comment-15616758
 ] 

Joern Kottmann commented on OPENNLP-776:


I started now to work through everything we need to get done for the release. 
This issue is part of that, so if you give positive feedback here then I will 
merge 776 into trunk. It is a bit difficult to predict when we will be done but 
I hope it is next month, or at least this year still.

> Model Objects should be Serializable
> 
>
> Key: OPENNLP-776
> URL: https://issues.apache.org/jira/browse/OPENNLP-776
> Project: OpenNLP
>  Issue Type: Improvement
>Affects Versions: tools-1.5.3
>Reporter: Tristan Nixon
>Assignee: Joern Kottmann
>Priority: Minor
>  Labels: features, patch
> Fix For: 1.6.1
>
> Attachments: externalizable.patch, 
> serializable-basemodel-joern.patch, serializable-basemodel.patch, 
> serialization_proxy.patch
>
>
> Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can 
> enable a number of features offered by other Java frameworks (my own use case 
> is described below). You've already got a good mechanism for 
> (de-)serialization, but it cannot be leveraged by other frameworks without 
> implementing the Serializable interface. I'm attaching a patch to BaseModel 
> that implements the methods in the java.io.Externalizable interface as 
> wrappers to the existing (de-)serialization methods. This simple change can 
> open up a number of useful opportunities for integrating OpenNLP with other 
> frameworks.
> My use case is that I am incorporating OpenNLP into a Spark application. This 
> requires that components of the system be distributed between the driver and 
> worker nodes within the cluster. In order to do this, Spark uses Java 
> serialization API to transmit objects between nodes. This is far more 
> efficient than instantiating models on each node independently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OPENNLP-776) Model Objects should be Serializable

2016-10-28 Thread Tristan Nixon (JIRA)

[ 
https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15616763#comment-15616763
 ] 

Tristan Nixon commented on OPENNLP-776:
---

Great, I'll give the patch at try ASAP and let you know how it goes.

> Model Objects should be Serializable
> 
>
> Key: OPENNLP-776
> URL: https://issues.apache.org/jira/browse/OPENNLP-776
> Project: OpenNLP
>  Issue Type: Improvement
>Affects Versions: tools-1.5.3
>Reporter: Tristan Nixon
>Assignee: Joern Kottmann
>Priority: Minor
>  Labels: features, patch
> Fix For: 1.6.1
>
> Attachments: externalizable.patch, 
> serializable-basemodel-joern.patch, serializable-basemodel.patch, 
> serialization_proxy.patch
>
>
> Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can 
> enable a number of features offered by other Java frameworks (my own use case 
> is described below). You've already got a good mechanism for 
> (de-)serialization, but it cannot be leveraged by other frameworks without 
> implementing the Serializable interface. I'm attaching a patch to BaseModel 
> that implements the methods in the java.io.Externalizable interface as 
> wrappers to the existing (de-)serialization methods. This simple change can 
> open up a number of useful opportunities for integrating OpenNLP with other 
> frameworks.
> My use case is that I am incorporating OpenNLP into a Spark application. This 
> requires that components of the system be distributed between the driver and 
> worker nodes within the cluster. In order to do this, Spark uses Java 
> serialization API to transmit objects between nodes. This is far more 
> efficient than instantiating models on each node independently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)