Re: DocumentSample in Doccat

2014-04-24 Thread Tech mail
I think it might also be true that the featuregenerator interface in doccat is 
different than the others, also I don't think the tokennamefinder interface has 
a probs() method, which has always made me use the ME impl direct.

Sent from my iPhone

> On Apr 24, 2014, at 6:54 PM, William Colen  wrote:
> 
> Yes, it looks nice. Maybe we should redo all the DocumentCategorizer
> interface. It is different from other tools, for example, we can't get the
> best category of one document with only one call, we need to use two
> methods.
> 
> 
> 
> 2014-04-24 18:43 GMT-03:00 Mark G :
> 
>> William, that map looks good to me.
>> In my current project I find this method convenient for getting back the
>> probs over the categories in the model as a Maplet me know if there's
>> anything wrong with it :)
>> 
>> public Map categoriesAsMap(String text) {
>>Map probDist = new HashMap();
>> 
>>double[] categorize = categorize(text);
>>int catSize = getNumberOfCategories();
>>for (int i = 0; i < catSize; i++) {
>>  String category = getCategory(i);
>>  probDist.put(category, categorize[getIndex(category)]);
>>}
>>return probDist;
>> 
>>  }
>> 
>> perhaps we should consider adding this method to abstract some
>> detailsjust a thought
>> 
>> 
>> 
>> 
>> 
>> On Thu, Apr 24, 2014 at 3:56 PM, William Colen >> wrote:
>> 
>>> What do you think of adding the following field to the DocumentSample?
>>> 
>>> Map extraInformation
>>> 
>>> 
>>> Also, we could add the following methods to the DocumentCategorizer
>>> interface:
>>> 
>>> public double[] categorize(String text[], Map
>>> extraInformation);
>>> public double[] categorize(String documentText, Map
>>> extraInformation);
>>> 
>>> Any opinion?
>>> 
>>> Thank you,
>>> William
>>> 
>>> 
>>> 2014-04-17 10:39 GMT-03:00 Mark G :
>>> 
>>>> Another general doccat thought I had is this. in my projects that use
>>>> Doccat, I created a class called a samplecollection, which simply
>>> wrapped a
>>>> list but then provided  a method that returned the
>>> samples
>>>> as a DoccatModel (using a properly formatted ByteArrayInputStream of
>> the
>>>> doccat training format of all the samples). This worked out well
>> because
>>> I
>>>> stored all the samples in a database, and users could CRUD samples for
>>>> different categories. There was a map reduce job that at job startup
>> read
>>>> in the samples from the database into the samplecollection, dynamically
>>>> generated the model, and then used the model to classify all the texts
>>>> across the cluster; so every MR job ran the latest and greatest model
>>> based
>>>> on current samples. Not sure if we're interested in something like
>> that,
>>>> but I see several questions on stack overflow asking about iterative
>>> model
>>>> building, and a SampleCollection that returns a Model has worked for
>> me.
>>> I
>>>> also created a SampleCRUD interface that abstracts storage and
>> retrieval
>>> of
>>>> the samples I had a Postgres and Accumulo impl for sample storage.
>>>> just a thought, I know this can get very specific and complicated,
>>> thought
>>>> we may be able to find a middle ground by providing a framework and
>> some
>>>> generic impls.
>>>> MG
>>>> 
>>>> 
>>>> On Thu, Apr 17, 2014 at 8:28 AM, William Colen <
>> william.co...@gmail.com
>>>>> wrote:
>>>> 
>>>>> Yes, I don't see how to represent the sentences and paragraphs.
>>>>> 
>>>>> +1 for the generic Map as suggested by Mark. We already have such
>>> things
>>>> in
>>>>> other sample classes, like NameSample and the POSSample.
>>>>> 
>>>>> A use case: the 20news corpus is a collection of articles, and each
>>>> article
>>>>> contains fields like "From", "Subject", "Organization". Mahout, which
>>>>> includes a formatter for this corpus, concatenate it all to the text
>>>> field,
>>>>> but I think we could improve accuracy by handling this metadata in a
>>>>> separated feature 

Re: DocumentSample in Doccat

2014-04-17 Thread Tech mail
I agree, this goes back to the concept of having a "document" model...
I know in the prod systems I've used doccat, storing sentences and paragraphs 
wouldn't make sense, people usually have their own domain model for that. I 
still feel like if we augment the documentsample object with a generic Map it 
would be helpful in some cases and not constraining

Sent from my iPhone

> On Apr 17, 2014, at 6:35 AM, Jörn Kottmann  wrote:
> 
>> On 04/15/2014 07:45 PM, William Colen wrote:
>> Hello,
>> 
>> I've been working with the Doccat module and I am wondering if we could
>> improve its data structure for the 1.6.0 release.
>> 
>> Today the DocumentSample has the following attributes:
>> 
>> - String category
>> - List text
>> 
>> I would suggest adding an attribute to hold metadata, or additional
>> contexts information. What do you think?
> 
> Right now the training format contains these two fields per line.
> Do you want to change the format as well?
> 
>> Also, what do you think of including sentences and paragraph information? I
>> don't know if there is anything a feature generator can extract from it to
>> improve the classification.
> 
> I guess we only want to do that if there is a use case for it. It will make 
> the processing for the clients
> more complex, since they then would have to provide sentences and paragraphs 
> compared to just
> a piece of text.
> 
> Jörn


Re: DocumentSample in Doccat

2014-04-15 Thread Tech mail
William, in my last project that I used doccat, I extended the documentsample 
and just added a generic Map to hold additional key values. Perhaps adding that 
to the baseline might be natural

Sent from my iPhone

> On Apr 15, 2014, at 11:45 AM, William Colen  wrote:
> 
> Hello,
> 
> I've been working with the Doccat module and I am wondering if we could
> improve its data structure for the 1.6.0 release.
> 
> Today the DocumentSample has the following attributes:
> 
> - String category
> - List text
> 
> I would suggest adding an attribute to hold metadata, or additional
> contexts information. What do you think?
> 
> Also, what do you think of including sentences and paragraph information? I
> don't know if there is anything a feature generator can extract from it to
> improve the classification.
> 
> Thank you,
> William