[ https://issues.apache.org/jira/browse/MAHOUT-299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12836224#action_12836224 ]
Drew Farris commented on MAHOUT-299: ------------------------------------ bq. I'd not throw RuntimeException - IllegalStateException is my favorite for a "this can't happen" situation. Just wanted to check on this -- I think the pattern below is the right one to use to catch exceptions in ObjectIntProcedures (from OutputCollector) in a map or reduce phase-- this look good to everyone else? (Not sure how the discussion on the list re: this ended up) {code} OpenObjectIntHashMap<String> ngrams = new OpenObjectIntHashMap<String>(...) // popluate ngrams map here and then... try { ngrams.forEachPair(new ObjectIntProcedure<String>() { @Override public boolean apply(String term, int frequency) { // obtain components, the leading (n-1)gram and the trailing unigram. int i = term.lastIndexOf(' '); // TODO: fix for non-whitespace delimited languages. if (i != -1) { // bigram, trigram etc Gram ngram = new Gram(term, frequency, Gram.Type.NGRAM); try { collector.collect(new Gram(term.substring(0, i), frequency, Gram.Type.HEAD), ngram); collector.collect(new Gram(term.substring(i + 1), frequency, Gram.Type.TAIL), ngram); } catch (IOException e) { throw new IllegalStateException(e); } } return true; } }); } catch (IllegalStateException ise) { // catch an re-throw original exceptions from the procedures. if (ise.getCause() instanceof IOException) { throw (IOException) ise.getCause(); } else { // wasn't what was expected, so re-throw throw ise; } } {code} > Collocations: improve performance by making Gram BinaryComparable > ----------------------------------------------------------------- > > Key: MAHOUT-299 > URL: https://issues.apache.org/jira/browse/MAHOUT-299 > Project: Mahout > Issue Type: Improvement > Components: Utils > Affects Versions: 0.3 > Reporter: Drew Farris > Priority: Minor > Fix For: 0.3 > > Attachments: MAHOUT-299.patch > > > Robin's profiling indicated that a large portion of a run was spent in > readFields() in Gram due to the deserialization occuring as a part of Gram > comparions for sorting. He pointed me to BinaryComparable and the > implementation in Text. > Like Text, in this new implementation, Gram stores its string in binary form. > When encoding the string at construction time we allocate an extra > character's worth of data to hold the Gram type information. When sorting > Grams, the binary arrays are compared instead of deserializing and comparing > fields. > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.