[jira] [Commented] (OPENNLP-776) Model Objects should be Serializable
[ https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15616763#comment-15616763 ] Tristan Nixon commented on OPENNLP-776: --- Great, I'll give the patch at try ASAP and let you know how it goes. > Model Objects should be Serializable > > > Key: OPENNLP-776 > URL: https://issues.apache.org/jira/browse/OPENNLP-776 > Project: OpenNLP > Issue Type: Improvement >Affects Versions: tools-1.5.3 >Reporter: Tristan Nixon >Assignee: Joern Kottmann >Priority: Minor > Labels: features, patch > Fix For: 1.6.1 > > Attachments: externalizable.patch, > serializable-basemodel-joern.patch, serializable-basemodel.patch, > serialization_proxy.patch > > > Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can > enable a number of features offered by other Java frameworks (my own use case > is described below). You've already got a good mechanism for > (de-)serialization, but it cannot be leveraged by other frameworks without > implementing the Serializable interface. I'm attaching a patch to BaseModel > that implements the methods in the java.io.Externalizable interface as > wrappers to the existing (de-)serialization methods. This simple change can > open up a number of useful opportunities for integrating OpenNLP with other > frameworks. > My use case is that I am incorporating OpenNLP into a Spark application. This > requires that components of the system be distributed between the driver and > worker nodes within the cluster. In order to do this, Spark uses Java > serialization API to transmit objects between nodes. This is far more > efficient than instantiating models on each node independently. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OPENNLP-776) Model Objects should be Serializable
[ https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15616758#comment-15616758 ] Joern Kottmann commented on OPENNLP-776: I started now to work through everything we need to get done for the release. This issue is part of that, so if you give positive feedback here then I will merge 776 into trunk. It is a bit difficult to predict when we will be done but I hope it is next month, or at least this year still. > Model Objects should be Serializable > > > Key: OPENNLP-776 > URL: https://issues.apache.org/jira/browse/OPENNLP-776 > Project: OpenNLP > Issue Type: Improvement >Affects Versions: tools-1.5.3 >Reporter: Tristan Nixon >Assignee: Joern Kottmann >Priority: Minor > Labels: features, patch > Fix For: 1.6.1 > > Attachments: externalizable.patch, > serializable-basemodel-joern.patch, serializable-basemodel.patch, > serialization_proxy.patch > > > Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can > enable a number of features offered by other Java frameworks (my own use case > is described below). You've already got a good mechanism for > (de-)serialization, but it cannot be leveraged by other frameworks without > implementing the Serializable interface. I'm attaching a patch to BaseModel > that implements the methods in the java.io.Externalizable interface as > wrappers to the existing (de-)serialization methods. This simple change can > open up a number of useful opportunities for integrating OpenNLP with other > frameworks. > My use case is that I am incorporating OpenNLP into a Spark application. This > requires that components of the system be distributed between the driver and > worker nodes within the cluster. In order to do this, Spark uses Java > serialization API to transmit objects between nodes. This is far more > efficient than instantiating models on each node independently. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OPENNLP-776) Model Objects should be Serializable
[ https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15616742#comment-15616742 ] Tristan Nixon commented on OPENNLP-776: --- Thanks, I think this patch looks good! I have been using my previous patch in spark for a while now. I'll add yours and give it a try. Do you know when will this appear in a release? I've been using my own build of the lib in my project, but I'll switch to a standard build once it is available. > Model Objects should be Serializable > > > Key: OPENNLP-776 > URL: https://issues.apache.org/jira/browse/OPENNLP-776 > Project: OpenNLP > Issue Type: Improvement >Affects Versions: tools-1.5.3 >Reporter: Tristan Nixon >Assignee: Joern Kottmann >Priority: Minor > Labels: features, patch > Fix For: 1.6.1 > > Attachments: externalizable.patch, > serializable-basemodel-joern.patch, serializable-basemodel.patch, > serialization_proxy.patch > > > Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can > enable a number of features offered by other Java frameworks (my own use case > is described below). You've already got a good mechanism for > (de-)serialization, but it cannot be leveraged by other frameworks without > implementing the Serializable interface. I'm attaching a patch to BaseModel > that implements the methods in the java.io.Externalizable interface as > wrappers to the existing (de-)serialization methods. This simple change can > open up a number of useful opportunities for integrating OpenNLP with other > frameworks. > My use case is that I am incorporating OpenNLP into a Spark application. This > requires that components of the system be distributed between the driver and > worker nodes within the cluster. In order to do this, Spark uses Java > serialization API to transmit objects between nodes. This is far more > efficient than instantiating models on each node independently. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OPENNLP-861) Add Chi-Squared Data Indexer for Feature Selection
[ https://issues.apache.org/jira/browse/OPENNLP-861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15616712#comment-15616712 ] Joern Kottmann commented on OPENNLP-861: It would be nice if we could get this contribution. Did you make some progress? > Add Chi-Squared Data Indexer for Feature Selection > -- > > Key: OPENNLP-861 > URL: https://issues.apache.org/jira/browse/OPENNLP-861 > Project: OpenNLP > Issue Type: New Feature > Components: Machine Learning >Affects Versions: 1.6.0 >Reporter: Joey Hong >Priority: Minor > Labels: features > Fix For: 1.6.1 > > > Text classification will naturally produce a lot of features. A lot of them > are independent of the category, and provide no real information gain in the > classification. > The Chi-Squared feature selection method will allow features that do not pass > a threshold for dependency to be removed from the feature list, keeping the > feature list a reasonable size without significantly affecting the > classification accuracy. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (OPENNLP-872) Use try with resourcs where possible
[ https://issues.apache.org/jira/browse/OPENNLP-872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joern Kottmann closed OPENNLP-872. -- Resolution: Fixed > Use try with resourcs where possible > > > Key: OPENNLP-872 > URL: https://issues.apache.org/jira/browse/OPENNLP-872 > Project: OpenNLP > Issue Type: Improvement >Reporter: Joern Kottmann >Assignee: Joern Kottmann >Priority: Trivial > Fix For: 1.6.1 > > > Try with resources should be used where it is possible or currently missing. > This will help to improve the resource handling (and places where it is not > so great) and will make the code more readable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (OPENNLP-872) Use try with resourcs where possible
[ https://issues.apache.org/jira/browse/OPENNLP-872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joern Kottmann reassigned OPENNLP-872: -- Assignee: Joern Kottmann > Use try with resourcs where possible > > > Key: OPENNLP-872 > URL: https://issues.apache.org/jira/browse/OPENNLP-872 > Project: OpenNLP > Issue Type: Improvement >Reporter: Joern Kottmann >Assignee: Joern Kottmann >Priority: Trivial > Fix For: 1.6.1 > > > Try with resources should be used where it is possible or currently missing. > This will help to improve the resource handling (and places where it is not > so great) and will make the code more readable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (OPENNLP-872) Use try with resourcs where possible
Joern Kottmann created OPENNLP-872: -- Summary: Use try with resourcs where possible Key: OPENNLP-872 URL: https://issues.apache.org/jira/browse/OPENNLP-872 Project: OpenNLP Issue Type: Improvement Reporter: Joern Kottmann Priority: Trivial Fix For: 1.6.1 Try with resources should be used where it is possible or currently missing. This will help to improve the resource handling (and places where it is not so great) and will make the code more readable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (OPENNLP-871) Clean up code base for release
Joern Kottmann created OPENNLP-871: -- Summary: Clean up code base for release Key: OPENNLP-871 URL: https://issues.apache.org/jira/browse/OPENNLP-871 Project: OpenNLP Issue Type: Improvement Reporter: Joern Kottmann Priority: Trivial Fix For: 1.6.1 The usual pre-release code clea nup with tools like eclipse/intellij should be performed. Clean ups: - Remove unused imports - Remove unnecessary casts - Remove trailing white spaces -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OPENNLP-776) Model Objects should be Serializable
[ https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15615803#comment-15615803 ] Joern Kottmann commented on OPENNLP-776: I pushed this to branch 776. It would be really nice if you or someone else could review these changes. I tested this with Spark and the Doccat Model and it worked without any issues. > Model Objects should be Serializable > > > Key: OPENNLP-776 > URL: https://issues.apache.org/jira/browse/OPENNLP-776 > Project: OpenNLP > Issue Type: Improvement >Affects Versions: tools-1.5.3 >Reporter: Tristan Nixon >Assignee: Joern Kottmann >Priority: Minor > Labels: features, patch > Fix For: 1.6.1 > > Attachments: externalizable.patch, > serializable-basemodel-joern.patch, serializable-basemodel.patch, > serialization_proxy.patch > > > Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can > enable a number of features offered by other Java frameworks (my own use case > is described below). You've already got a good mechanism for > (de-)serialization, but it cannot be leveraged by other frameworks without > implementing the Serializable interface. I'm attaching a patch to BaseModel > that implements the methods in the java.io.Externalizable interface as > wrappers to the existing (de-)serialization methods. This simple change can > open up a number of useful opportunities for integrating OpenNLP with other > frameworks. > My use case is that I am incorporating OpenNLP into a Spark application. This > requires that components of the system be distributed between the driver and > worker nodes within the cluster. In order to do this, Spark uses Java > serialization API to transmit objects between nodes. This is far more > efficient than instantiating models on each node independently. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OPENNLP-830) Huge runtime improvement on training (POS, Chunk, ...)
[ https://issues.apache.org/jira/browse/OPENNLP-830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15615780#comment-15615780 ] Joern Kottmann commented on OPENNLP-830: Fast exp speeds things up by around 2 to 3 % in my test case with the name finder, same for the improved beam search. > Huge runtime improvement on training (POS, Chunk, ...) > -- > > Key: OPENNLP-830 > URL: https://issues.apache.org/jira/browse/OPENNLP-830 > Project: OpenNLP > Issue Type: Improvement > Components: Machine Learning, POS Tagger >Affects Versions: 1.6.0 > Environment: Any >Reporter: Julien Subercaze >Assignee: Joern Kottmann > Labels: performance > Fix For: 1.6.1 > > Original Estimate: 1h > Remaining Estimate: 1h > > opennlp.tools.ml.model.IndexHashTable is custom-made Hashtable that is used > to store mapping index. This Hashtable is heavily used in openlp.tools.ml.* > (i.e. every model) and leads to disastrous performance. > This hashtable is probably legacy some legacy and is highly inefficient. A > simple drop-in replacement by a java.util.HashMap wrapper solves the issue, > doesn't break compatibility and does not add any dependency. > Training a pos-tagger on a large dataset with custom tags, I see a factor 5 > improvement. It also seems to improve all ML models training pipeline. > See : > https://github.com/jsubercaze/opennlp/blob/trunk/opennlp-tools/src/main/java/opennlp/tools/ml/model/IndexHashTable.java > For a quick fix. -- This message was sent by Atlassian JIRA (v6.3.4#6332)