Yes, I don't see how to represent the sentences and paragraphs. +1 for the generic Map as suggested by Mark. We already have such things in other sample classes, like NameSample and the POSSample.
A use case: the 20news corpus is a collection of articles, and each article contains fields like "From", "Subject", "Organization". Mahout, which includes a formatter for this corpus, concatenate it all to the text field, but I think we could improve accuracy by handling this metadata in a separated feature generator. 2014-04-17 8:37 GMT-03:00 Tech mail <giaconiam...@gmail.com>: > I agree, this goes back to the concept of having a "document" model... > I know in the prod systems I've used doccat, storing sentences and > paragraphs wouldn't make sense, people usually have their own domain model > for that. I still feel like if we augment the documentsample object with a > generic Map it would be helpful in some cases and not constraining > > Sent from my iPhone > > > On Apr 17, 2014, at 6:35 AM, Jörn Kottmann <kottm...@gmail.com> wrote: > > > >> On 04/15/2014 07:45 PM, William Colen wrote: > >> Hello, > >> > >> I've been working with the Doccat module and I am wondering if we could > >> improve its data structure for the 1.6.0 release. > >> > >> Today the DocumentSample has the following attributes: > >> > >> - String category > >> - List<String> text > >> > >> I would suggest adding an attribute to hold metadata, or additional > >> contexts information. What do you think? > > > > Right now the training format contains these two fields per line. > > Do you want to change the format as well? > > > >> Also, what do you think of including sentences and paragraph > information? I > >> don't know if there is anything a feature generator can extract from it > to > >> improve the classification. > > > > I guess we only want to do that if there is a use case for it. It will > make the processing for the clients > > more complex, since they then would have to provide sentences and > paragraphs compared to just > > a piece of text. > > > > Jörn >