Yes, I don't see how to represent the sentences and paragraphs.

+1 for the generic Map as suggested by Mark. We already have such things in
other sample classes, like NameSample and the POSSample.

A use case: the 20news corpus is a collection of articles, and each article
contains fields like "From", "Subject", "Organization". Mahout, which
includes a formatter for this corpus, concatenate it all to the text field,
but I think we could improve accuracy by handling this metadata in a
separated feature generator.


2014-04-17 8:37 GMT-03:00 Tech mail <giaconiam...@gmail.com>:

> I agree, this goes back to the concept of having a "document" model...
> I know in the prod systems I've used doccat, storing sentences and
> paragraphs wouldn't make sense, people usually have their own domain model
> for that. I still feel like if we augment the documentsample object with a
> generic Map it would be helpful in some cases and not constraining
>
> Sent from my iPhone
>
> > On Apr 17, 2014, at 6:35 AM, Jörn Kottmann <kottm...@gmail.com> wrote:
> >
> >> On 04/15/2014 07:45 PM, William Colen wrote:
> >> Hello,
> >>
> >> I've been working with the Doccat module and I am wondering if we could
> >> improve its data structure for the 1.6.0 release.
> >>
> >> Today the DocumentSample has the following attributes:
> >>
> >> - String category
> >> - List<String> text
> >>
> >> I would suggest adding an attribute to hold metadata, or additional
> >> contexts information. What do you think?
> >
> > Right now the training format contains these two fields per line.
> > Do you want to change the format as well?
> >
> >> Also, what do you think of including sentences and paragraph
> information? I
> >> don't know if there is anything a feature generator can extract from it
> to
> >> improve the classification.
> >
> > I guess we only want to do that if there is a use case for it. It will
> make the processing for the clients
> > more complex, since they then would have to provide sentences and
> paragraphs compared to just
> > a piece of text.
> >
> > Jörn
>

Reply via email to