[ https://issues.apache.org/jira/browse/NUTCH-631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635259#action_12635259 ]
Edward Quick commented on NUTCH-631: ------------------------------------ I had the same issue and after applying this fix, it didn't work. I got the following error: 2008-09-28 12:25:57,736 INFO indexer.Indexer - Indexing [http://cap-wiki.uk.ba.com/twiki/bin/view/WCE/WebTopicList] with analyzer [EMAIL PROTECTED] (null) 2008-09-28 12:25:57,755 WARN mapred.LocalJobRunner - job_local_16 java.util.NoSuchElementException at java.util.TreeMap.key(TreeMap.java:433) at java.util.TreeMap.firstKey(TreeMap.java:287) at java.util.TreeSet.first(TreeSet.java:407) at java.util.Collections$UnmodifiableSortedSet.first(Collections.java:1114) at org.apache.nutch.indexer.more.MoreIndexingFilter.addType(MoreIndexingFilter.java:207) at org.apache.nutch.indexer.more.MoreIndexingFilter.filter(MoreIndexingFilter.java:90) at org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:111) at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:249) at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:52) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:391) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:201) > MoreIndexingFilter fails with NoSuchElementException > ---------------------------------------------------- > > Key: NUTCH-631 > URL: https://issues.apache.org/jira/browse/NUTCH-631 > Project: Nutch > Issue Type: Bug > Components: indexer > Affects Versions: 1.0.0 > Environment: Verified on CentOS and OSX > Reporter: Stefan Will > Fix For: 1.0.0 > > > I did a simple crawl and started the indexer with the index-more plugin > activated. The index job fails with the following stack trace in the task log: > java.util.NoSuchElementException > at java.util.TreeMap.key(TreeMap.java:433) > at java.util.TreeMap.firstKey(TreeMap.java:287) > at java.util.TreeSet.first(TreeSet.java:407) > at > java.util.Collections$UnmodifiableSortedSet.first(Collections.java:1114) > at > org.apache.nutch.indexer.more.MoreIndexingFilter.addType(MoreIndexingFilter.java:207) > at > org.apache.nutch.indexer.more.MoreIndexingFilter.filter(MoreIndexingFilter.java:90) > at > org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:111) > at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:249) > at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:52) > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:333) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:164) > I traced this down to the part in MoreIndexingFilter where the mime type is > split into primary type and subtype for indexing: > contentType = mimeType.getName(); > String primaryType = mimeType.getSuperType().getName(); > String subType = mimeType.getSubTypes().first().getName(); > Apparently Tika does not have a subtype for text/html. Furthermore, the > supertype for text/html is set as application/octet-stream, which I doubt is > what we want indexed. Don't we want primaryType to be "text" and subType to > be "html" ? > So I changed the code to: > contentType = mimeType.getName(); > String[] split = contentType.split("/"); > String primaryType = split[0]; > String subType = (split.length>1)?split[1]:null; > > This does what I think it should do, but perhaps I'm missing something ? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.