Re: DocumentSample in Doccat

2014-04-17 Thread Mark G
Another general doccat thought I had is this. in my projects that use
Doccat, I created a class called a samplecollection, which simply wrapped a
list but then provided  a method that returned the samples
as a DoccatModel (using a properly formatted ByteArrayInputStream of the
doccat training format of all the samples). This worked out well because I
stored all the samples in a database, and users could CRUD samples for
different categories. There was a map reduce job that at job startup read
in the samples from the database into the samplecollection, dynamically
generated the model, and then used the model to classify all the texts
across the cluster; so every MR job ran the latest and greatest model based
on current samples. Not sure if we're interested in something like that,
but I see several questions on stack overflow asking about iterative model
building, and a SampleCollection that returns a Model has worked for me.  I
also created a SampleCRUD interface that abstracts storage and retrieval of
the samples I had a Postgres and Accumulo impl for sample storage.
just a thought, I know this can get very specific and complicated, thought
we may be able to find a middle ground by providing a framework and some
generic impls.
MG


On Thu, Apr 17, 2014 at 8:28 AM, William Colen wrote:

> Yes, I don't see how to represent the sentences and paragraphs.
>
> +1 for the generic Map as suggested by Mark. We already have such things in
> other sample classes, like NameSample and the POSSample.
>
> A use case: the 20news corpus is a collection of articles, and each article
> contains fields like "From", "Subject", "Organization". Mahout, which
> includes a formatter for this corpus, concatenate it all to the text field,
> but I think we could improve accuracy by handling this metadata in a
> separated feature generator.
>
>
> 2014-04-17 8:37 GMT-03:00 Tech mail :
>
> > I agree, this goes back to the concept of having a "document" model...
> > I know in the prod systems I've used doccat, storing sentences and
> > paragraphs wouldn't make sense, people usually have their own domain
> model
> > for that. I still feel like if we augment the documentsample object with
> a
> > generic Map it would be helpful in some cases and not constraining
> >
> > Sent from my iPhone
> >
> > > On Apr 17, 2014, at 6:35 AM, Jörn Kottmann  wrote:
> > >
> > >> On 04/15/2014 07:45 PM, William Colen wrote:
> > >> Hello,
> > >>
> > >> I've been working with the Doccat module and I am wondering if we
> could
> > >> improve its data structure for the 1.6.0 release.
> > >>
> > >> Today the DocumentSample has the following attributes:
> > >>
> > >> - String category
> > >> - List text
> > >>
> > >> I would suggest adding an attribute to hold metadata, or additional
> > >> contexts information. What do you think?
> > >
> > > Right now the training format contains these two fields per line.
> > > Do you want to change the format as well?
> > >
> > >> Also, what do you think of including sentences and paragraph
> > information? I
> > >> don't know if there is anything a feature generator can extract from
> it
> > to
> > >> improve the classification.
> > >
> > > I guess we only want to do that if there is a use case for it. It will
> > make the processing for the clients
> > > more complex, since they then would have to provide sentences and
> > paragraphs compared to just
> > > a piece of text.
> > >
> > > Jörn
> >
>


Re: DocumentSample in Doccat

2014-04-17 Thread William Colen
Yes, I don't see how to represent the sentences and paragraphs.

+1 for the generic Map as suggested by Mark. We already have such things in
other sample classes, like NameSample and the POSSample.

A use case: the 20news corpus is a collection of articles, and each article
contains fields like "From", "Subject", "Organization". Mahout, which
includes a formatter for this corpus, concatenate it all to the text field,
but I think we could improve accuracy by handling this metadata in a
separated feature generator.


2014-04-17 8:37 GMT-03:00 Tech mail :

> I agree, this goes back to the concept of having a "document" model...
> I know in the prod systems I've used doccat, storing sentences and
> paragraphs wouldn't make sense, people usually have their own domain model
> for that. I still feel like if we augment the documentsample object with a
> generic Map it would be helpful in some cases and not constraining
>
> Sent from my iPhone
>
> > On Apr 17, 2014, at 6:35 AM, Jörn Kottmann  wrote:
> >
> >> On 04/15/2014 07:45 PM, William Colen wrote:
> >> Hello,
> >>
> >> I've been working with the Doccat module and I am wondering if we could
> >> improve its data structure for the 1.6.0 release.
> >>
> >> Today the DocumentSample has the following attributes:
> >>
> >> - String category
> >> - List text
> >>
> >> I would suggest adding an attribute to hold metadata, or additional
> >> contexts information. What do you think?
> >
> > Right now the training format contains these two fields per line.
> > Do you want to change the format as well?
> >
> >> Also, what do you think of including sentences and paragraph
> information? I
> >> don't know if there is anything a feature generator can extract from it
> to
> >> improve the classification.
> >
> > I guess we only want to do that if there is a use case for it. It will
> make the processing for the clients
> > more complex, since they then would have to provide sentences and
> paragraphs compared to just
> > a piece of text.
> >
> > Jörn
>


Re: DocumentSample in Doccat

2014-04-17 Thread Tech mail
I agree, this goes back to the concept of having a "document" model...
I know in the prod systems I've used doccat, storing sentences and paragraphs 
wouldn't make sense, people usually have their own domain model for that. I 
still feel like if we augment the documentsample object with a generic Map it 
would be helpful in some cases and not constraining

Sent from my iPhone

> On Apr 17, 2014, at 6:35 AM, Jörn Kottmann  wrote:
> 
>> On 04/15/2014 07:45 PM, William Colen wrote:
>> Hello,
>> 
>> I've been working with the Doccat module and I am wondering if we could
>> improve its data structure for the 1.6.0 release.
>> 
>> Today the DocumentSample has the following attributes:
>> 
>> - String category
>> - List text
>> 
>> I would suggest adding an attribute to hold metadata, or additional
>> contexts information. What do you think?
> 
> Right now the training format contains these two fields per line.
> Do you want to change the format as well?
> 
>> Also, what do you think of including sentences and paragraph information? I
>> don't know if there is anything a feature generator can extract from it to
>> improve the classification.
> 
> I guess we only want to do that if there is a use case for it. It will make 
> the processing for the clients
> more complex, since they then would have to provide sentences and paragraphs 
> compared to just
> a piece of text.
> 
> Jörn


Re: DocumentSample in Doccat

2014-04-17 Thread Jörn Kottmann

On 04/15/2014 07:45 PM, William Colen wrote:

Hello,

I've been working with the Doccat module and I am wondering if we could
improve its data structure for the 1.6.0 release.

Today the DocumentSample has the following attributes:

- String category
- List text

I would suggest adding an attribute to hold metadata, or additional
contexts information. What do you think?


Right now the training format contains these two fields per line.
Do you want to change the format as well?


Also, what do you think of including sentences and paragraph information? I
don't know if there is anything a feature generator can extract from it to
improve the classification.


I guess we only want to do that if there is a use case for it. It will 
make the processing for the clients
more complex, since they then would have to provide sentences and 
paragraphs compared to just

a piece of text.

Jörn


Re: End of line whitespaces in Eclipse

2014-04-17 Thread Jörn Kottmann

Hello,

in Eclipse you can configure actions which are run when a file is saved.
One of these actions removes white spaces at the end of a line.

You can disable it:
Window -> Preferences -> Java -> Editor -> Save Actions

HTH,
Jörn

On 04/11/2014 12:58 AM, William Colen wrote:

When I save a .java file in Eclipse, it is removing the end of line
whitespaces. I am using the
http://opennlp.apache.org/code-formatter/OpenNLP-Eclipse-Formatter.xml

This is causing lots of changes in files I actually needed to change only
one line. Do anybody know how to I avoid it?

Thank you,
William