[jira] [Created] (OPENNLP-777) Naive Bayesian Classifier

2015-05-19 Thread Cohan Sujay Carlos (JIRA)
Cohan Sujay Carlos created OPENNLP-777:
--

 Summary: Naive Bayesian Classifier
 Key: OPENNLP-777
 URL: https://issues.apache.org/jira/browse/OPENNLP-777
 Project: OpenNLP
  Issue Type: New Feature
  Components: Machine Learning
Affects Versions: 1.6.0
 Environment: J2SE 1.5 and above
Reporter: Cohan Sujay Carlos
Priority: Minor
 Fix For: 1.6.0


I thought it would be nice to have a Naive Bayesian classifier in OpenNLP (it 
lacks one at present).

Implementation details:  We have a production-hardened piece of Java code for a 
multinomial Naive Bayesian classifier (with default Laplace smoothing) that 
we'd like to contribute.  The code is Java 1.5 compatible.  I'd have to write 
an adapter to make the interface compatible with the ME classifier in OpenNLP.  
I expect the patch to be available 1 to 3 weeks from now.

Below is the email trail of a discussion in the dev mailing list around this 
dated May 19th, 2015.


Tommaso Teofili via opennlp.apache.org 

to dev 
Hi Cohan,

I think that'd be a very valuable contribution, as NB is one of the
foundation algorithms, often used as basis for comparisons.
It would be good if you could create a Jira issue and provide more details
about the implementation and, eventually, a patch.

Thanks and regards,
Tommaso



2015-05-19 9:57 GMT+02:00 Cohan Sujay Carlos 

> I have a question for the OpenNLP project team.
>
> I was wondering if there is a Naive Bayesian classifier implementation in
> OpenNLP that I've not come across, or if there are plans to implement one.
>
> If it is the latter, I should love to contribute an implementation.
>
> There is an ME classifier already available in OpenNLP, of course, but I
> felt that there was an unmet need for a Naive Bayesian (NB) classifier
> implementation to be offered as well.
>
> An NB classifier could be bootstrapped up with partially labelled training
> data as explained in the Nigam, McCallum, et al paper of 2000 "Text
> Classification from Labeled and Unlabeled Documents using EM".
>
> So, if there isn't an NB code base out there already, I'd be happy to
> contribute a very solid implementation that we've used in production for a
> good 5 years.
>
> I'd have to adapt it to load the same training data format as the ME
> classifier, but I guess that shouldn't be very difficult to do.
>
> I was wondering if there was some interest in adding an NB implementation
> and I'd love to know who could I coordinate with if there is?
>
> Cohan Sujay Carlos
> CEO, Aiaioo Labs, India



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OPENNLP-776) Model Objects should be Serializable

2015-05-19 Thread Tristan Nixon (JIRA)

[ 
https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14550561#comment-14550561
 ] 

Tristan Nixon commented on OPENNLP-776:
---

You're totally welcome! Let me know when this gets merged into a release, so I 
can update my project and get rid of my custom build.

> Model Objects should be Serializable
> 
>
> Key: OPENNLP-776
> URL: https://issues.apache.org/jira/browse/OPENNLP-776
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Formats
>Affects Versions: tools-1.5.3
>Reporter: Tristan Nixon
>Priority: Minor
>  Labels: features, patch
> Attachments: BaseModel-serialization.patch
>
>
> Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can 
> enable a number of features offered by other Java frameworks (my own use case 
> is described below). You've already got a good mechanism for 
> (de-)serialization, but it cannot be leveraged by other frameworks without 
> implementing the Serializable interface. I'm attaching a patch to BaseModel 
> that implements the methods in the java.io.Externalizable interface as 
> wrappers to the existing (de-)serialization methods. This simple change can 
> open up a number of useful opportunities for integrating OpenNLP with other 
> frameworks.
> My use case is that I am incorporating OpenNLP into a Spark application. This 
> requires that components of the system be distributed between the driver and 
> worker nodes within the cluster. In order to do this, Spark uses Java 
> serialization API to transmit objects between nodes. This is far more 
> efficient than instantiating models on each node independently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (OPENNLP-776) Model Objects should be Serializable

2015-05-19 Thread Tristan Nixon (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tristan Nixon updated OPENNLP-776:
--
Attachment: model-constructors.patch

I realized that for automatic de-serialization, all models need No-Op 
constructors. See attached.

> Model Objects should be Serializable
> 
>
> Key: OPENNLP-776
> URL: https://issues.apache.org/jira/browse/OPENNLP-776
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Formats
>Affects Versions: tools-1.5.3
>Reporter: Tristan Nixon
>Priority: Minor
>  Labels: features, patch
> Attachments: BaseModel-serialization.patch, model-constructors.patch
>
>
> Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can 
> enable a number of features offered by other Java frameworks (my own use case 
> is described below). You've already got a good mechanism for 
> (de-)serialization, but it cannot be leveraged by other frameworks without 
> implementing the Serializable interface. I'm attaching a patch to BaseModel 
> that implements the methods in the java.io.Externalizable interface as 
> wrappers to the existing (de-)serialization methods. This simple change can 
> open up a number of useful opportunities for integrating OpenNLP with other 
> frameworks.
> My use case is that I am incorporating OpenNLP into a Spark application. This 
> requires that components of the system be distributed between the driver and 
> worker nodes within the cluster. In order to do this, Spark uses Java 
> serialization API to transmit objects between nodes. This is far more 
> efficient than instantiating models on each node independently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OPENNLP-776) Model Objects should be Serializable

2015-05-19 Thread Joern Kottmann (JIRA)

[ 
https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14550576#comment-14550576
 ] 

Joern Kottmann commented on OPENNLP-776:


Having no-arg constructors on all those models is not nice.

Can you please elaborate on this:
" This is far more efficient than instantiating models on each node 
independently."

How can the proposed patch make that more efficient. The models still need to 
be created and actually that is done using the existing serialization support.

> Model Objects should be Serializable
> 
>
> Key: OPENNLP-776
> URL: https://issues.apache.org/jira/browse/OPENNLP-776
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Formats
>Affects Versions: tools-1.5.3
>Reporter: Tristan Nixon
>Priority: Minor
>  Labels: features, patch
> Attachments: BaseModel-serialization.patch, model-constructors.patch
>
>
> Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can 
> enable a number of features offered by other Java frameworks (my own use case 
> is described below). You've already got a good mechanism for 
> (de-)serialization, but it cannot be leveraged by other frameworks without 
> implementing the Serializable interface. I'm attaching a patch to BaseModel 
> that implements the methods in the java.io.Externalizable interface as 
> wrappers to the existing (de-)serialization methods. This simple change can 
> open up a number of useful opportunities for integrating OpenNLP with other 
> frameworks.
> My use case is that I am incorporating OpenNLP into a Spark application. This 
> requires that components of the system be distributed between the driver and 
> worker nodes within the cluster. In order to do this, Spark uses Java 
> serialization API to transmit objects between nodes. This is far more 
> efficient than instantiating models on each node independently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OPENNLP-777) Naive Bayesian Classifier

2015-05-19 Thread Joern Kottmann (JIRA)

[ 
https://issues.apache.org/jira/browse/OPENNLP-777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14550578#comment-14550578
 ] 

Joern Kottmann commented on OPENNLP-777:


Yes, that would be really nice to have in OpenNLP!

> Naive Bayesian Classifier
> -
>
> Key: OPENNLP-777
> URL: https://issues.apache.org/jira/browse/OPENNLP-777
> Project: OpenNLP
>  Issue Type: New Feature
>  Components: Machine Learning
>Affects Versions: 1.6.0
> Environment: J2SE 1.5 and above
>Reporter: Cohan Sujay Carlos
>Priority: Minor
>  Labels: NBClassifier, bayes, bayesian, classifier, multinomial, 
> naive
> Fix For: 1.6.0
>
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> I thought it would be nice to have a Naive Bayesian classifier in OpenNLP (it 
> lacks one at present).
> Implementation details:  We have a production-hardened piece of Java code for 
> a multinomial Naive Bayesian classifier (with default Laplace smoothing) that 
> we'd like to contribute.  The code is Java 1.5 compatible.  I'd have to write 
> an adapter to make the interface compatible with the ME classifier in 
> OpenNLP.  I expect the patch to be available 1 to 3 weeks from now.
> Below is the email trail of a discussion in the dev mailing list around this 
> dated May 19th, 2015.
> 
> Tommaso Teofili via opennlp.apache.org 
> to dev 
> Hi Cohan,
> I think that'd be a very valuable contribution, as NB is one of the
> foundation algorithms, often used as basis for comparisons.
> It would be good if you could create a Jira issue and provide more details
> about the implementation and, eventually, a patch.
> Thanks and regards,
> Tommaso
> 
> 2015-05-19 9:57 GMT+02:00 Cohan Sujay Carlos 
> > I have a question for the OpenNLP project team.
> >
> > I was wondering if there is a Naive Bayesian classifier implementation in
> > OpenNLP that I've not come across, or if there are plans to implement one.
> >
> > If it is the latter, I should love to contribute an implementation.
> >
> > There is an ME classifier already available in OpenNLP, of course, but I
> > felt that there was an unmet need for a Naive Bayesian (NB) classifier
> > implementation to be offered as well.
> >
> > An NB classifier could be bootstrapped up with partially labelled training
> > data as explained in the Nigam, McCallum, et al paper of 2000 "Text
> > Classification from Labeled and Unlabeled Documents using EM".
> >
> > So, if there isn't an NB code base out there already, I'd be happy to
> > contribute a very solid implementation that we've used in production for a
> > good 5 years.
> >
> > I'd have to adapt it to load the same training data format as the ME
> > classifier, but I guess that shouldn't be very difficult to do.
> >
> > I was wondering if there was some interest in adding an NB implementation
> > and I'd love to know who could I coordinate with if there is?
> >
> > Cohan Sujay Carlos
> > CEO, Aiaioo Labs, India



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OPENNLP-776) Model Objects should be Serializable

2015-05-19 Thread Joern Kottmann (JIRA)

[ 
https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14550584#comment-14550584
 ] 

Joern Kottmann commented on OPENNLP-776:


The models could be sub-classed in a user project to implement the 
java.io.Externalizable interface.

> Model Objects should be Serializable
> 
>
> Key: OPENNLP-776
> URL: https://issues.apache.org/jira/browse/OPENNLP-776
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Formats
>Affects Versions: tools-1.5.3
>Reporter: Tristan Nixon
>Priority: Minor
>  Labels: features, patch
> Attachments: BaseModel-serialization.patch, model-constructors.patch
>
>
> Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can 
> enable a number of features offered by other Java frameworks (my own use case 
> is described below). You've already got a good mechanism for 
> (de-)serialization, but it cannot be leveraged by other frameworks without 
> implementing the Serializable interface. I'm attaching a patch to BaseModel 
> that implements the methods in the java.io.Externalizable interface as 
> wrappers to the existing (de-)serialization methods. This simple change can 
> open up a number of useful opportunities for integrating OpenNLP with other 
> frameworks.
> My use case is that I am incorporating OpenNLP into a Spark application. This 
> requires that components of the system be distributed between the driver and 
> worker nodes within the cluster. In order to do this, Spark uses Java 
> serialization API to transmit objects between nodes. This is far more 
> efficient than instantiating models on each node independently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OPENNLP-776) Model Objects should be Serializable

2015-05-19 Thread Tristan Nixon (JIRA)

[ 
https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14550604#comment-14550604
 ] 

Tristan Nixon commented on OPENNLP-776:
---

It does not make the (de-)serialization process more efficient. It allows me to 
use a model as a "broadcast variable" which means it is de-serialized once on 
each worker node, and can then be re-used for all work on that node. Otherwise, 
it may need to be de-serialized multiple times, adding quite a bit of overhead 
to the application.

> Model Objects should be Serializable
> 
>
> Key: OPENNLP-776
> URL: https://issues.apache.org/jira/browse/OPENNLP-776
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Formats
>Affects Versions: tools-1.5.3
>Reporter: Tristan Nixon
>Priority: Minor
>  Labels: features, patch
> Attachments: BaseModel-serialization.patch, model-constructors.patch
>
>
> Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can 
> enable a number of features offered by other Java frameworks (my own use case 
> is described below). You've already got a good mechanism for 
> (de-)serialization, but it cannot be leveraged by other frameworks without 
> implementing the Serializable interface. I'm attaching a patch to BaseModel 
> that implements the methods in the java.io.Externalizable interface as 
> wrappers to the existing (de-)serialization methods. This simple change can 
> open up a number of useful opportunities for integrating OpenNLP with other 
> frameworks.
> My use case is that I am incorporating OpenNLP into a Spark application. This 
> requires that components of the system be distributed between the driver and 
> worker nodes within the cluster. In order to do this, Spark uses Java 
> serialization API to transmit objects between nodes. This is far more 
> efficient than instantiating models on each node independently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OPENNLP-777) Naive Bayesian Classifier

2015-05-19 Thread Haider Ali (JIRA)

[ 
https://issues.apache.org/jira/browse/OPENNLP-777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14551168#comment-14551168
 ] 

Haider Ali commented on OPENNLP-777:


i also wan to contribute to Naive Bayesian Classifier 

> Naive Bayesian Classifier
> -
>
> Key: OPENNLP-777
> URL: https://issues.apache.org/jira/browse/OPENNLP-777
> Project: OpenNLP
>  Issue Type: New Feature
>  Components: Machine Learning
>Affects Versions: 1.6.0
> Environment: J2SE 1.5 and above
>Reporter: Cohan Sujay Carlos
>Priority: Minor
>  Labels: NBClassifier, bayes, bayesian, classifier, multinomial, 
> naive
> Fix For: 1.6.0
>
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> I thought it would be nice to have a Naive Bayesian classifier in OpenNLP (it 
> lacks one at present).
> Implementation details:  We have a production-hardened piece of Java code for 
> a multinomial Naive Bayesian classifier (with default Laplace smoothing) that 
> we'd like to contribute.  The code is Java 1.5 compatible.  I'd have to write 
> an adapter to make the interface compatible with the ME classifier in 
> OpenNLP.  I expect the patch to be available 1 to 3 weeks from now.
> Below is the email trail of a discussion in the dev mailing list around this 
> dated May 19th, 2015.
> 
> Tommaso Teofili via opennlp.apache.org 
> to dev 
> Hi Cohan,
> I think that'd be a very valuable contribution, as NB is one of the
> foundation algorithms, often used as basis for comparisons.
> It would be good if you could create a Jira issue and provide more details
> about the implementation and, eventually, a patch.
> Thanks and regards,
> Tommaso
> 
> 2015-05-19 9:57 GMT+02:00 Cohan Sujay Carlos 
> > I have a question for the OpenNLP project team.
> >
> > I was wondering if there is a Naive Bayesian classifier implementation in
> > OpenNLP that I've not come across, or if there are plans to implement one.
> >
> > If it is the latter, I should love to contribute an implementation.
> >
> > There is an ME classifier already available in OpenNLP, of course, but I
> > felt that there was an unmet need for a Naive Bayesian (NB) classifier
> > implementation to be offered as well.
> >
> > An NB classifier could be bootstrapped up with partially labelled training
> > data as explained in the Nigam, McCallum, et al paper of 2000 "Text
> > Classification from Labeled and Unlabeled Documents using EM".
> >
> > So, if there isn't an NB code base out there already, I'd be happy to
> > contribute a very solid implementation that we've used in production for a
> > good 5 years.
> >
> > I'd have to adapt it to load the same training data format as the ME
> > classifier, but I guess that shouldn't be very difficult to do.
> >
> > I was wondering if there was some interest in adding an NB implementation
> > and I'd love to know who could I coordinate with if there is?
> >
> > Cohan Sujay Carlos
> > CEO, Aiaioo Labs, India



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (OPENNLP-777) Naive Bayesian Classifier

2015-05-19 Thread Haider Ali (JIRA)

[ 
https://issues.apache.org/jira/browse/OPENNLP-777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14551168#comment-14551168
 ] 

Haider Ali edited comment on OPENNLP-777 at 5/19/15 8:33 PM:
-

i also want to contribute to Naive Bayesian Classifier 


was (Author: haider.ali):
i also wan to contribute to Naive Bayesian Classifier 

> Naive Bayesian Classifier
> -
>
> Key: OPENNLP-777
> URL: https://issues.apache.org/jira/browse/OPENNLP-777
> Project: OpenNLP
>  Issue Type: New Feature
>  Components: Machine Learning
>Affects Versions: 1.6.0
> Environment: J2SE 1.5 and above
>Reporter: Cohan Sujay Carlos
>Priority: Minor
>  Labels: NBClassifier, bayes, bayesian, classifier, multinomial, 
> naive
> Fix For: 1.6.0
>
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> I thought it would be nice to have a Naive Bayesian classifier in OpenNLP (it 
> lacks one at present).
> Implementation details:  We have a production-hardened piece of Java code for 
> a multinomial Naive Bayesian classifier (with default Laplace smoothing) that 
> we'd like to contribute.  The code is Java 1.5 compatible.  I'd have to write 
> an adapter to make the interface compatible with the ME classifier in 
> OpenNLP.  I expect the patch to be available 1 to 3 weeks from now.
> Below is the email trail of a discussion in the dev mailing list around this 
> dated May 19th, 2015.
> 
> Tommaso Teofili via opennlp.apache.org 
> to dev 
> Hi Cohan,
> I think that'd be a very valuable contribution, as NB is one of the
> foundation algorithms, often used as basis for comparisons.
> It would be good if you could create a Jira issue and provide more details
> about the implementation and, eventually, a patch.
> Thanks and regards,
> Tommaso
> 
> 2015-05-19 9:57 GMT+02:00 Cohan Sujay Carlos 
> > I have a question for the OpenNLP project team.
> >
> > I was wondering if there is a Naive Bayesian classifier implementation in
> > OpenNLP that I've not come across, or if there are plans to implement one.
> >
> > If it is the latter, I should love to contribute an implementation.
> >
> > There is an ME classifier already available in OpenNLP, of course, but I
> > felt that there was an unmet need for a Naive Bayesian (NB) classifier
> > implementation to be offered as well.
> >
> > An NB classifier could be bootstrapped up with partially labelled training
> > data as explained in the Nigam, McCallum, et al paper of 2000 "Text
> > Classification from Labeled and Unlabeled Documents using EM".
> >
> > So, if there isn't an NB code base out there already, I'd be happy to
> > contribute a very solid implementation that we've used in production for a
> > good 5 years.
> >
> > I'd have to adapt it to load the same training data format as the ME
> > classifier, but I guess that shouldn't be very difficult to do.
> >
> > I was wondering if there was some interest in adding an NB implementation
> > and I'd love to know who could I coordinate with if there is?
> >
> > Cohan Sujay Carlos
> > CEO, Aiaioo Labs, India



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)