[ https://issues.apache.org/jira/browse/NUTCH-631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Edward Quick updated NUTCH-631: ------------------------------- Comment: was deleted > MoreIndexingFilter fails with NoSuchElementException > ---------------------------------------------------- > > Key: NUTCH-631 > URL: https://issues.apache.org/jira/browse/NUTCH-631 > Project: Nutch > Issue Type: Bug > Components: indexer > Affects Versions: 1.0.0 > Environment: Verified on CentOS and OSX > Reporter: Stefan Will > Fix For: 1.0.0 > > > I did a simple crawl and started the indexer with the index-more plugin > activated. The index job fails with the following stack trace in the task log: > java.util.NoSuchElementException > at java.util.TreeMap.key(TreeMap.java:433) > at java.util.TreeMap.firstKey(TreeMap.java:287) > at java.util.TreeSet.first(TreeSet.java:407) > at > java.util.Collections$UnmodifiableSortedSet.first(Collections.java:1114) > at > org.apache.nutch.indexer.more.MoreIndexingFilter.addType(MoreIndexingFilter.java:207) > at > org.apache.nutch.indexer.more.MoreIndexingFilter.filter(MoreIndexingFilter.java:90) > at > org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:111) > at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:249) > at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:52) > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:333) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:164) > I traced this down to the part in MoreIndexingFilter where the mime type is > split into primary type and subtype for indexing: > contentType = mimeType.getName(); > String primaryType = mimeType.getSuperType().getName(); > String subType = mimeType.getSubTypes().first().getName(); > Apparently Tika does not have a subtype for text/html. Furthermore, the > supertype for text/html is set as application/octet-stream, which I doubt is > what we want indexed. Don't we want primaryType to be "text" and subType to > be "html" ? > So I changed the code to: > contentType = mimeType.getName(); > String[] split = contentType.split("/"); > String primaryType = split[0]; > String subType = (split.length>1)?split[1]:null; > > This does what I think it should do, but perhaps I'm missing something ? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.