[ https://issues.apache.org/jira/browse/NUTCH-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13446487#comment-13446487 ]
Christian Johnsson edited comment on NUTCH-1461 at 9/1/12 10:28 AM: -------------------------------------------------------------------- Sure, this one should do the trick. I also changed the following in regex-urlfilter.txt it should filter out non valid domainnames. +^(http|https)\://[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(:[a-zA-Z0-9]*)?/?([a-zA-Z0-9\-\._\?\,\'/\\\+&%\$#\=~])*[^\.\,\)\(\s]$ #accept anything else #+. Good luck :-) was (Author: mr.johnsson): Sure, this one should do the trick. I also added the following line in regex-urlfilter.txt it should filter out non valid domainnames. +^(http|https)\://[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(:[a-zA-Z0-9]*)?/?([a-zA-Z0-9\-\._\?\,\'/\\\+&%\$#\=~])*[^\.\,\)\(\s]$ #accept anything else #+. Good luck :-) > Problem with TableUtil > ---------------------- > > Key: NUTCH-1461 > URL: https://issues.apache.org/jira/browse/NUTCH-1461 > Project: Nutch > Issue Type: Bug > Components: parser > Affects Versions: nutchgora > Environment: Debian / CDH3 / Nutch 2.0 Release > Reporter: Christian Johnsson > Attachments: TabelUtil_Fix.patch > > > Affects parse and updatedb and parse. > Think i got some missformated urls into hbase but i can't fin them. > It generates this error though. If i empty hbase and restart it goes for a > couple of million pages indexed then it comes up again. Any tips on how to > locate what row in the table that genereates this error? > 2012-08-28 01:48:10,871 WARN org.apache.hadoop.mapred.Child: Error running > child > java.lang.ArrayIndexOutOfBoundsException: 1 > at org.apache.nutch.util.TableUtil.unreverseUrl(TableUtil.java:98) > at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:102) > at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:76) > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323) > at org.apache.hadoop.mapred.Child$4.run(Child.java:266) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1278) > at org.apache.hadoop.mapred.Child.main(Child.java:260) > 2012-08-28 01:48:10,875 INFO org.apache.hadoop.mapred.Task: Runnning cleanup > for the task -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira