On 04/15/2014 07:45 PM, William Colen wrote:
Hello,

I've been working with the Doccat module and I am wondering if we could
improve its data structure for the 1.6.0 release.

Today the DocumentSample has the following attributes:

- String category
- List<String> text

I would suggest adding an attribute to hold metadata, or additional
contexts information. What do you think?

Right now the training format contains these two fields per line.
Do you want to change the format as well?

Also, what do you think of including sentences and paragraph information? I
don't know if there is anything a feature generator can extract from it to
improve the classification.

I guess we only want to do that if there is a use case for it. It will make the processing for the clients more complex, since they then would have to provide sentences and paragraphs compared to just
a piece of text.

Jörn

Reply via email to