Re: DocumentSample in Doccat

Mark G Thu, 24 Apr 2014 17:51:24 -0700

William here is another thought, we could include something like this to
return a map sorted descending with the best score on top... so you can
call categoriesAsSortedMap("").firstEntry() to get the best score (which
can be the same for more that one category hence the Set as value)


  public NavigableMap<Double, Set<String>> categoriesAsSortedMap(String
text) {
    NavigableMap<Double, Set<String>> descendingMap = new TreeMap<Double,
Set<String>>().descendingMap();
    double[] categorize = categorize(text);
    int catSize = getNumberOfCategories();
    for (int i = 0; i < catSize; i++) {
      String category = getCategory(i);
      double score = categorize[getIndex(category)];
      if (descendingMap.containsKey(score)) {
        descendingMap.get(score).add(category);
      } else {
        Set<String> newset = new HashSet<>();
        newset.add(category);
        descendingMap.put(score, newset);
      }
    }
    return descendingMap;
  }


On Thu, Apr 24, 2014 at 7:04 PM, Tech mail <giaconiam...@gmail.com> wrote:

> I think it might also be true that the featuregenerator interface in
> doccat is different than the others, also I don't think the tokennamefinder
> interface has a probs() method, which has always made me use the ME impl
> direct.
>
> Sent from my iPhone
>
> > On Apr 24, 2014, at 6:54 PM, William Colen <william.co...@gmail.com>
> wrote:
> >
> > Yes, it looks nice. Maybe we should redo all the DocumentCategorizer
> > interface. It is different from other tools, for example, we can't get
> the
> > best category of one document with only one call, we need to use two
> > methods.
> >
> >
> >
> > 2014-04-24 18:43 GMT-03:00 Mark G <ma...@apache.org>:
> >
> >> William, that map looks good to me.
> >> In my current project I find this method convenient for getting back the
> >> probs over the categories in the model as a Map....let me know if
> there's
> >> anything wrong with it :)
> >>
> >> public Map<String, Double> categoriesAsMap(String text) {
> >>    Map<String, Double> probDist = new HashMap<String, Double>();
> >>
> >>    double[] categorize = categorize(text);
> >>    int catSize = getNumberOfCategories();
> >>    for (int i = 0; i < catSize; i++) {
> >>      String category = getCategory(i);
> >>      probDist.put(category, categorize[getIndex(category)]);
> >>    }
> >>    return probDist;
> >>
> >>  }
> >>
> >> perhaps we should consider adding this method to abstract some
> >> details....just a thought
> >>
> >>
> >>
> >>
> >>
> >> On Thu, Apr 24, 2014 at 3:56 PM, William Colen <william.co...@gmail.com
> >>> wrote:
> >>
> >>> What do you think of adding the following field to the DocumentSample?
> >>>
> >>> Map<String, Object> extraInformation
> >>>
> >>>
> >>> Also, we could add the following methods to the DocumentCategorizer
> >>> interface:
> >>>
> >>> public double[] categorize(String text[], Map<String, Object>
> >>> extraInformation);
> >>> public double[] categorize(String documentText, Map<String, Object>
> >>> extraInformation);
> >>>
> >>> Any opinion?
> >>>
> >>> Thank you,
> >>> William
> >>>
> >>>
> >>> 2014-04-17 10:39 GMT-03:00 Mark G <giaconiam...@gmail.com>:
> >>>
> >>>> Another general doccat thought I had is this. in my projects that use
> >>>> Doccat, I created a class called a samplecollection, which simply
> >>> wrapped a
> >>>> list<documentsample> but then provided  a method that returned the
> >>> samples
> >>>> as a DoccatModel (using a properly formatted ByteArrayInputStream of
> >> the
> >>>> doccat training format of all the samples). This worked out well
> >> because
> >>> I
> >>>> stored all the samples in a database, and users could CRUD samples for
> >>>> different categories. There was a map reduce job that at job startup
> >> read
> >>>> in the samples from the database into the samplecollection,
> dynamically
> >>>> generated the model, and then used the model to classify all the texts
> >>>> across the cluster; so every MR job ran the latest and greatest model
> >>> based
> >>>> on current samples. Not sure if we're interested in something like
> >> that,
> >>>> but I see several questions on stack overflow asking about iterative
> >>> model
> >>>> building, and a SampleCollection that returns a Model has worked for
> >> me.
> >>> I
> >>>> also created a SampleCRUD interface that abstracts storage and
> >> retrieval
> >>> of
> >>>> the samples.... I had a Postgres and Accumulo impl for sample storage.
> >>>> just a thought, I know this can get very specific and complicated,
> >>> thought
> >>>> we may be able to find a middle ground by providing a framework and
> >> some
> >>>> generic impls.
> >>>> MG
> >>>>
> >>>>
> >>>> On Thu, Apr 17, 2014 at 8:28 AM, William Colen <
> >> william.co...@gmail.com
> >>>>> wrote:
> >>>>
> >>>>> Yes, I don't see how to represent the sentences and paragraphs.
> >>>>>
> >>>>> +1 for the generic Map as suggested by Mark. We already have such
> >>> things
> >>>> in
> >>>>> other sample classes, like NameSample and the POSSample.
> >>>>>
> >>>>> A use case: the 20news corpus is a collection of articles, and each
> >>>> article
> >>>>> contains fields like "From", "Subject", "Organization". Mahout, which
> >>>>> includes a formatter for this corpus, concatenate it all to the text
> >>>> field,
> >>>>> but I think we could improve accuracy by handling this metadata in a
> >>>>> separated feature generator.
> >>>>>
> >>>>>
> >>>>> 2014-04-17 8:37 GMT-03:00 Tech mail <giaconiam...@gmail.com>:
> >>>>>
> >>>>>> I agree, this goes back to the concept of having a "document"
> >>> model...
> >>>>>> I know in the prod systems I've used doccat, storing sentences and
> >>>>>> paragraphs wouldn't make sense, people usually have their own
> >> domain
> >>>>> model
> >>>>>> for that. I still feel like if we augment the documentsample object
> >>>> with
> >>>>> a
> >>>>>> generic Map it would be helpful in some cases and not constraining
> >>>>>>
> >>>>>> Sent from my iPhone
> >>>>>>
> >>>>>>> On Apr 17, 2014, at 6:35 AM, Jörn Kottmann <kottm...@gmail.com>
> >>>> wrote:
> >>>>>>>
> >>>>>>>> On 04/15/2014 07:45 PM, William Colen wrote:
> >>>>>>>> Hello,
> >>>>>>>>
> >>>>>>>> I've been working with the Doccat module and I am wondering if
> >> we
> >>>>> could
> >>>>>>>> improve its data structure for the 1.6.0 release.
> >>>>>>>>
> >>>>>>>> Today the DocumentSample has the following attributes:
> >>>>>>>>
> >>>>>>>> - String category
> >>>>>>>> - List<String> text
> >>>>>>>>
> >>>>>>>> I would suggest adding an attribute to hold metadata, or
> >>> additional
> >>>>>>>> contexts information. What do you think?
> >>>>>>>
> >>>>>>> Right now the training format contains these two fields per line.
> >>>>>>> Do you want to change the format as well?
> >>>>>>>
> >>>>>>>> Also, what do you think of including sentences and paragraph
> >>>>>> information? I
> >>>>>>>> don't know if there is anything a feature generator can extract
> >>> from
> >>>>> it
> >>>>>> to
> >>>>>>>> improve the classification.
> >>>>>>>
> >>>>>>> I guess we only want to do that if there is a use case for it. It
> >>>> will
> >>>>>> make the processing for the clients
> >>>>>>> more complex, since they then would have to provide sentences and
> >>>>>> paragraphs compared to just
> >>>>>>> a piece of text.
> >>>>>>>
> >>>>>>> Jörn
> >>
>

Re: DocumentSample in Doccat

Reply via email to