[ https://issues.apache.org/jira/browse/NUTCH-631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635322#action_12635322 ]
Edward Quick commented on NUTCH-631: ------------------------------------ Even after applying the fix above I still ran into problems. $ search/bin/nutch index crawled/newindexes crawled/crawldb crawled/linkdb crawled/segments/* .. .. Indexing [http://cap-wiki.somedomain.com/twiki/bin/view/WCE/WikiHintsAndTips] with analyzer [EMAIL PROTECTED] (null) 29-Sep-2008 10:44:17 org.apache.nutch.util.MimeUtil forName WARNING: Exception getting mime type by name: [text/html; charset=iso-8859-15]: Message: Invalid media type name: text/html; charset=iso-8859-15 Indexing [http://cap-wiki.somedomain.com/twiki/bin/view/WCE/WorldCargo] with analyzer [EMAIL PROTECTED] (null) 29-Sep-2008 10:44:17 org.apache.nutch.util.MimeUtil forName WARNING: Exception getting mime type by name: [text/html; charset=iso-8859-15]: Message: Invalid media type name: text/html; charset=iso-8859-15 Indexing [http://cap-wiki.somedomain.com/twiki/bin/view/WCE/WrongJVMArgsPickedUp] with analyzer [EMAIL PROTECTED] (null) 29-Sep-2008 10:44:17 org.apache.nutch.util.MimeUtil forName WARNING: Exception getting mime type by name: [text/html; charset=iso-8859-15]: Message: Invalid media type name: text/html; charset=iso-8859-15 Indexing [http://cap-wiki.somedomain.com/twiki/bin/view/WCE/XPHintsandTips] with analyzer [EMAIL PROTECTED] (null) Indexer: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1062) at org.apache.nutch.indexer.Indexer.index(Indexer.java:311) at org.apache.nutch.indexer.Indexer.run(Indexer.java:333) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.indexer.Indexer.main(Indexer.java:316) In hadoop.log: 2008-09-29 10:44:17,686 INFO indexer.Indexer - Indexing [http://cap-wiki.somedomain.com/twiki/bin/view/WCE/WrongJVMArgsPickedUp] with analyzer [EMAIL PROTECTED] (null) 2008-09-29 10:44:17,688 INFO indexer.Indexer - Indexing [http://cap-wiki.somedomain.com/twiki/bin/view/WCE/XPHintsandTips] with analyzer [EMAIL PROTECTED] (null) 2008-09-29 10:44:17,699 WARN mapred.LocalJobRunner - job_local_1 java.util.NoSuchElementException at java.util.TreeMap.key(TreeMap.java:433) at java.util.TreeMap.firstKey(TreeMap.java:287) at java.util.TreeSet.first(TreeSet.java:407) at java.util.Collections$UnmodifiableSortedSet.first(Collections.java:1114) at org.apache.nutch.indexer.more.MoreIndexingFilter.addType(MoreIndexingFilter.java:207) at org.apache.nutch.indexer.more.MoreIndexingFilter.filter(MoreIndexingFilter.java:90) at org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:111) at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:249) at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:52) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:391) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:201) 2008-09-29 10:44:18,247 FATAL indexer.Indexer - Indexer: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1062) at org.apache.nutch.indexer.Indexer.index(Indexer.java:311) at org.apache.nutch.indexer.Indexer.run(Indexer.java:333) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.indexer.Indexer.main(Indexer.java:316) > MoreIndexingFilter fails with NoSuchElementException > ---------------------------------------------------- > > Key: NUTCH-631 > URL: https://issues.apache.org/jira/browse/NUTCH-631 > Project: Nutch > Issue Type: Bug > Components: indexer > Affects Versions: 1.0.0 > Environment: Verified on CentOS and OSX > Reporter: Stefan Will > Fix For: 1.0.0 > > > I did a simple crawl and started the indexer with the index-more plugin > activated. The index job fails with the following stack trace in the task log: > java.util.NoSuchElementException > at java.util.TreeMap.key(TreeMap.java:433) > at java.util.TreeMap.firstKey(TreeMap.java:287) > at java.util.TreeSet.first(TreeSet.java:407) > at > java.util.Collections$UnmodifiableSortedSet.first(Collections.java:1114) > at > org.apache.nutch.indexer.more.MoreIndexingFilter.addType(MoreIndexingFilter.java:207) > at > org.apache.nutch.indexer.more.MoreIndexingFilter.filter(MoreIndexingFilter.java:90) > at > org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:111) > at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:249) > at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:52) > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:333) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:164) > I traced this down to the part in MoreIndexingFilter where the mime type is > split into primary type and subtype for indexing: > contentType = mimeType.getName(); > String primaryType = mimeType.getSuperType().getName(); > String subType = mimeType.getSubTypes().first().getName(); > Apparently Tika does not have a subtype for text/html. Furthermore, the > supertype for text/html is set as application/octet-stream, which I doubt is > what we want indexed. Don't we want primaryType to be "text" and subType to > be "html" ? > So I changed the code to: > contentType = mimeType.getName(); > String[] split = contentType.split("/"); > String primaryType = split[0]; > String subType = (split.length>1)?split[1]:null; > > This does what I think it should do, but perhaps I'm missing something ? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.