[ 
https://issues.apache.org/jira/browse/MAHOUT-299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12836224#action_12836224
 ] 

Drew Farris commented on MAHOUT-299:
------------------------------------

bq. I'd not throw RuntimeException - IllegalStateException is my favorite for a 
"this can't happen" situation. 

Just wanted to check on this -- I think the pattern below is the right one to 
use to catch exceptions in ObjectIntProcedures (from OutputCollector) in a map 
or reduce phase-- this look good to everyone else? (Not sure how the discussion 
on the list re: this ended up)

{code}
    OpenObjectIntHashMap<String> ngrams = new OpenObjectIntHashMap<String>(...)

   // popluate ngrams map here and then... 

    try {
      ngrams.forEachPair(new ObjectIntProcedure<String>() {
        @Override
        public boolean apply(String term, int frequency) {
          // obtain components, the leading (n-1)gram and the trailing unigram.
          int i = term.lastIndexOf(' '); // TODO: fix for non-whitespace 
delimited languages.
          if (i != -1) { // bigram, trigram etc
            Gram ngram = new Gram(term, frequency, Gram.Type.NGRAM);
            try {
              collector.collect(new Gram(term.substring(0, i), frequency, 
Gram.Type.HEAD), ngram);
              collector.collect(new Gram(term.substring(i + 1), frequency, 
Gram.Type.TAIL), ngram);
            } catch (IOException e) {
              throw new IllegalStateException(e);
            }
          }
          return true;
        }
      });
    }
    catch (IllegalStateException ise) {
      // catch an re-throw original exceptions from the procedures.
      if (ise.getCause() instanceof IOException) {
        throw (IOException) ise.getCause();
      }
      else {
        // wasn't what was expected, so re-throw
        throw ise;
      }
    }
{code}

> Collocations: improve performance by making Gram BinaryComparable
> -----------------------------------------------------------------
>
>                 Key: MAHOUT-299
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-299
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Utils
>    Affects Versions: 0.3
>            Reporter: Drew Farris
>            Priority: Minor
>             Fix For: 0.3
>
>         Attachments: MAHOUT-299.patch
>
>
> Robin's profiling indicated that a large portion of a run was spent in 
> readFields() in Gram due to the deserialization occuring as a part of Gram 
> comparions for sorting. He pointed me to BinaryComparable and the 
> implementation in Text.
> Like Text, in this new implementation, Gram stores its string in binary form. 
> When encoding the string at construction time we allocate an extra 
> character's worth of data to hold the Gram type information. When sorting 
> Grams, the binary arrays are compared instead of deserializing and comparing 
> fields.
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to