[jira] [Commented] (NUTCH-585) [PARSE-HTML plugin] Block certain parts of HTML code from being indexed
[ https://issues.apache.org/jira/browse/NUTCH-585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13255502#comment-13255502 ] Markus Jelsma commented on NUTCH-585: - This issue is not going to be part of Nutch 1.5 which is likely to be released very soon. However, you can download the patch and see if it works for you, the plugin builds fine for 1.4, 1.5 and the to-be 1.6-SNAPSHOT. [PARSE-HTML plugin] Block certain parts of HTML code from being indexed --- Key: NUTCH-585 URL: https://issues.apache.org/jira/browse/NUTCH-585 Project: Nutch Issue Type: Improvement Affects Versions: 0.9.0 Environment: All operating systems Reporter: Andrea Spinelli Assignee: Markus Jelsma Priority: Minor Fix For: 1.6 Attachments: blacklist_whitelist_plugin.patch, nutch-585-excludeNodes.patch, nutch-585-jostens-excludeDIVs.patch We are using nutch to index our own web sites; we would like not to index certain parts of our pages, because we know they are not relevant (for instance, there are several links to change the background color) and generate spurious matches. We have modified the plugin so that it ignores HTML code between certain HTML comments, like !-- START-IGNORE -- ... ignored part ... !-- STOP-IGNORE -- We feel this might be useful to someone else, maybe factorizing the comment strings as constants in the configuration files (say parser.html.ignore.start and parser.html.ignore.stop in nutch-site.xml). We are almost ready to contribute our code snippet. Looking forward for any expression of interest - or for an explanation why waht we are doing is plain wrong! -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-585) [PARSE-HTML plugin] Block certain parts of HTML code from being indexed
[ https://issues.apache.org/jira/browse/NUTCH-585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13255528#comment-13255528 ] Markus Jelsma commented on NUTCH-585: - You should take the latest patch: blacklist_whitelist_plugin.patch. It contains example config etc. Please let us know if you get it to work. [PARSE-HTML plugin] Block certain parts of HTML code from being indexed --- Key: NUTCH-585 URL: https://issues.apache.org/jira/browse/NUTCH-585 Project: Nutch Issue Type: Improvement Affects Versions: 0.9.0 Environment: All operating systems Reporter: Andrea Spinelli Assignee: Markus Jelsma Priority: Minor Fix For: 1.6 Attachments: blacklist_whitelist_plugin.patch, nutch-585-excludeNodes.patch, nutch-585-jostens-excludeDIVs.patch We are using nutch to index our own web sites; we would like not to index certain parts of our pages, because we know they are not relevant (for instance, there are several links to change the background color) and generate spurious matches. We have modified the plugin so that it ignores HTML code between certain HTML comments, like !-- START-IGNORE -- ... ignored part ... !-- STOP-IGNORE -- We feel this might be useful to someone else, maybe factorizing the comment strings as constants in the configuration files (say parser.html.ignore.start and parser.html.ignore.stop in nutch-site.xml). We are almost ready to contribute our code snippet. Looking forward for any expression of interest - or for an explanation why waht we are doing is plain wrong! -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1339) Default URL normalization rules to remove page anchors completely
[ https://issues.apache.org/jira/browse/NUTCH-1339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13255936#comment-13255936 ] Markus Jelsma commented on NUTCH-1339: -- The anchor is still removed by the BasicURLNormalizer. We worked around the problem for the AJAXNormalizer by simply changing the normalizer order. First the AJAXNormalizer and then everything else. But, when indexing, first do the BasicNormalizer (if enabled) and only then the AJAXNormalizer. Default URL normalization rules to remove page anchors completely - Key: NUTCH-1339 URL: https://issues.apache.org/jira/browse/NUTCH-1339 Project: Nutch Issue Type: Bug Affects Versions: nutchgora, 1.6 Reporter: Sebastian Nagel Attachments: NUTCH-1339-2.patch, NUTCH-1339.patch The default rules of URLNormalizerRegex remove the anchor up to the first occurrence of ? or . The remaining part of the anchor is kept which may cause a large, possibly infinite number of outlinks when the same document fetched again and again with different URLs, see http://www.mail-archive.com/user%40nutch.apache.org/msg05940.html Parameters in inner-page anchors are a common practice in AJAX web sites. Currently, crawling AJAX content is not supported (NUTCH-1323). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1234) Upgrade to Tika 1.1
[ https://issues.apache.org/jira/browse/NUTCH-1234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13244147#comment-13244147 ] Markus Jelsma commented on NUTCH-1234: -- Excellent! I'll remember next time! Thanks :) Upgrade to Tika 1.1 --- Key: NUTCH-1234 URL: https://issues.apache.org/jira/browse/NUTCH-1234 Project: Nutch Issue Type: Task Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.5 Attachments: NUTCH-1234-1.5-1.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1321) IDNNormalizer
[ https://issues.apache.org/jira/browse/NUTCH-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13242219#comment-13242219 ] Markus Jelsma commented on NUTCH-1321: -- ...or, we could do a toUnicode for outlinks or directly in the fetcher. This also makes sense because as ASCII these URL's are longer, sometimes much longer. This can stir trouble for filters that, partly, rely on string length. If both conversions are implemented in the fetcher or protocol library then we don't have to worry about it, and have better logging in the fetcher! IDNNormalizer - Key: NUTCH-1321 URL: https://issues.apache.org/jira/browse/NUTCH-1321 Project: Nutch Issue Type: New Feature Reporter: Markus Jelsma Assignee: Markus Jelsma Right now, IDN's are indexed as ASCII. An IDNNormalizer is to be used with an indexer so it will encode ASCII URL's to their proper unicode equivalant. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1234) Upgrade to Tika 1.1
[ https://issues.apache.org/jira/browse/NUTCH-1234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13242239#comment-13242239 ] Markus Jelsma commented on NUTCH-1234: -- Julien or Chris, can either of you check this out? I'm wasting time and gaining frustration! I cannot get it to work :) Upgrade to Tika 1.1 --- Key: NUTCH-1234 URL: https://issues.apache.org/jira/browse/NUTCH-1234 Project: Nutch Issue Type: Task Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.5 Attachments: NUTCH-1234-1.5-1.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1024) Dynamically set fetchInterval by MIME-type
[ https://issues.apache.org/jira/browse/NUTCH-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13241090#comment-13241090 ] Markus Jelsma commented on NUTCH-1024: -- I'll change the legacy sys.out to logging. HttpHeaders doesnt have Text representations of the strings but i'll be happy to add if you want. Dynamically set fetchInterval by MIME-type -- Key: NUTCH-1024 URL: https://issues.apache.org/jira/browse/NUTCH-1024 Project: Nutch Issue Type: New Feature Components: generator Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Fix For: 1.5 Attachments: AdaptiveFetchSchedule.patch, MimeAdaptiveFetchSchedule.java, NUTCH-1024-1.5-1.patch, Nutch.patch, adaptive-mimetypes.txt Add facility to configure default or fixed fetchInterval values by MIME-type. This is useful for conserving resources for files that are known to change frequently or never and everything in between. * simple key\tvalue\n configuration file * only set fetchInterval for new documents * keep max fetchInterval fixed by current config -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1320) IndexChecker and ParseChecker choke on IDN's
[ https://issues.apache.org/jira/browse/NUTCH-1320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13241225#comment-13241225 ] Markus Jelsma commented on NUTCH-1320: -- Somewhere down the line IDN's enter the CrawlDB in ASCII so there is no problem there but these tools lack conversion. The filter and normalizer checker tools would also benefit. This also suggests the need of an IDNNormalizer that does toUnicode when indexing, you don't want http://xn--*/ URL's in your index. IndexChecker and ParseChecker choke on IDN's Key: NUTCH-1320 URL: https://issues.apache.org/jira/browse/NUTCH-1320 Project: Nutch Issue Type: Bug Affects Versions: 1.4 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.5 Attachments: NUTCH-1320-1.5-1.patch These handy debug tools do not handle IDN's and throw an NPE bin/nutch parsechecker http://例子.測試/%E9%A6%96%E9%A0%81 {code} Exception in thread main java.lang.NullPointerException at org.apache.nutch.indexer.IndexingFiltersChecker.run(IndexingFiltersChecker.java:71) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.indexer.IndexingFiltersChecker.main(IndexingFiltersChecker.java:116) {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1024) Dynamically set fetchInterval by MIME-type
[ https://issues.apache.org/jira/browse/NUTCH-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13241229#comment-13241229 ] Markus Jelsma commented on NUTCH-1024: -- I'll fix the logging, this is old code. The inc and dec rate directives are already in nutch-default but the mime-file and the file itself are missing. Dynamically set fetchInterval by MIME-type -- Key: NUTCH-1024 URL: https://issues.apache.org/jira/browse/NUTCH-1024 Project: Nutch Issue Type: New Feature Components: generator Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Fix For: 1.5 Attachments: AdaptiveFetchSchedule.patch, MimeAdaptiveFetchSchedule.java, NUTCH-1024-1.5-1.patch, NUTCH-1024-1.5-2.patch, Nutch.patch, adaptive-mimetypes.txt Add facility to configure default or fixed fetchInterval values by MIME-type. This is useful for conserving resources for files that are known to change frequently or never and everything in between. * simple key\tvalue\n configuration file * only set fetchInterval for new documents * keep max fetchInterval fixed by current config -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1024) Dynamically set fetchInterval by MIME-type
[ https://issues.apache.org/jira/browse/NUTCH-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13240430#comment-13240430 ] Markus Jelsma commented on NUTCH-1024: -- Thoughts? I'd like to send this one in. Dynamically set fetchInterval by MIME-type -- Key: NUTCH-1024 URL: https://issues.apache.org/jira/browse/NUTCH-1024 Project: Nutch Issue Type: New Feature Components: generator Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Fix For: 1.5 Attachments: AdaptiveFetchSchedule.patch, MimeAdaptiveFetchSchedule.java, NUTCH-1024-1.5-1.patch, Nutch.patch, adaptive-mimetypes.txt Add facility to configure default or fixed fetchInterval values by MIME-type. This is useful for conserving resources for files that are known to change frequently or never and everything in between. * simple key\tvalue\n configuration file * only set fetchInterval for new documents * keep max fetchInterval fixed by current config -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1300) Indexer to normalize URL's
[ https://issues.apache.org/jira/browse/NUTCH-1300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13239418#comment-13239418 ] Markus Jelsma commented on NUTCH-1300: -- I think a scope index makes sense. It would make building a two-way normalizer a bit easier. Commandline options can be added but you can use -D option as well. Indexer to normalize URL's -- Key: NUTCH-1300 URL: https://issues.apache.org/jira/browse/NUTCH-1300 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Fix For: 1.5 Attachments: NUTCH-1300-1.5-1.patch Indexers should be able to normalize URL's. This is useful when a new normalizer is applied to the entire CrawlDB. Without it, some or all records in a segment cannot be indexed at all. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1317) Max content length by MIME-type
[ https://issues.apache.org/jira/browse/NUTCH-1317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13233970#comment-13233970 ] Markus Jelsma commented on NUTCH-1317: -- I am not sure about the root of the problem. We only use Tika for parsing PDF and (X)HTML and rely on Boilerpipe. Some HTML pages are quite a thing, full of stuff or endless tables. You'll press page down over a hundred times to scroll to the bottom. I've not tested all bad URL's but i think Tika does the job eventually, if not i'll file a ticket. Most i tested work, given enough time. HTML pages that take more than one second to parse are considered bad, it should be less than 50ms on average. Those that are bad usually contain too much elements and are large in size. Max content length by MIME-type --- Key: NUTCH-1317 URL: https://issues.apache.org/jira/browse/NUTCH-1317 Project: Nutch Issue Type: Improvement Components: parser Affects Versions: 1.4 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.5 The good old http.content.length directive is not sufficient in large internet crawls. For example, a 5MB PDF file may be parsed without issues but a 5MB HTML file may time out. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1315) reduce speculation on but ParseOutputFormat doesn't name output files correctly?
[ https://issues.apache.org/jira/browse/NUTCH-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13232914#comment-13232914 ] Markus Jelsma commented on NUTCH-1315: -- Speculative task execution is enabled by default but the fetch and index jobs disable them. We have disabled speculative execution altogether at some point only because we need those slots to be free for other jobs. Should extended OutputFormat's take care of this? It isn't clear in MapRed's API docs whether this is a problem. The name parameter is to be unique for the task's part of the output for the entire job, which it is. Wouldn't including a task ID in the output name cause a mess in the final output? In the mean time i would indeed disable speculative execution. In my opinion and experience with Nutch and other jobs it's not really worth it. It takes empty slots that you can use for other jobs and if there are no other jobs it still takes additional CPU cycles and RAM and disk I/O for a few seconds. I must add that our network is homogenous (fallacy) and all nodes have almost equal load. reduce speculation on but ParseOutputFormat doesn't name output files correctly? Key: NUTCH-1315 URL: https://issues.apache.org/jira/browse/NUTCH-1315 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.4 Environment: ubuntu 64bit, hadoop 1.0.1, 3 Node Cluster, segment size 1.5M urls Reporter: Rafael Labels: hadoop, hdfs From time to time the Reducer log contains the following and one tasktracker gets blacklisted. org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file /user/test/crawl/segments/20120316065507/parse_text/part-1/data for DFSClient_attempt_201203151054_0028_r_01_1 on client xx.x.xx.xx.10, because this file is already being created by DFSClient_attempt_201203151054_0028_r_01_0 on xx.xx.xx.9 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:1404) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1244) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:1186) at org.apache.hadoop.hdfs.server.namenode.NameNode.create(NameNode.java:628) at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382) at org.apache.hadoop.ipc.Client.call(Client.java:1066) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225) at $Proxy2.create(Unknown Source) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59) at $Proxy2.create(Unknown Source) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.init(DFSClient.java:3245) at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:713) at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:182) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:555) at org.apache.hadoop.io.SequenceFile$RecordCompressWriter.init(SequenceFile.java:1132) at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:397) at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:354) at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:476) at org.apache.hadoop.io.MapFile$Writer.init(MapFile.java:157) at org.apache.hadoop.io.MapFile$Writer.init(MapFile.java:134) at org.apache.hadoop.io.MapFile$Writer.init(MapFile.java:92) at
[jira] [Commented] (NUTCH-1311) Add response headers to datastore for the protocol-httpclient plugin
[ https://issues.apache.org/jira/browse/NUTCH-1311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13231172#comment-13231172 ] Markus Jelsma commented on NUTCH-1311: -- Hi, HTTP response headers are available as Content metadata in trunk. Add response headers to datastore for the protocol-httpclient plugin - Key: NUTCH-1311 URL: https://issues.apache.org/jira/browse/NUTCH-1311 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: nutchgora, 1.5 Reporter: Dan Rosher Priority: Minor Fix For: nutchgora Attachments: NUTCH-1311.patch Response Headers need to be added to the page to add to the datastore for this plugin -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1314) Impose a limit on the length of outlink target urls
[ https://issues.apache.org/jira/browse/NUTCH-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13231368#comment-13231368 ] Markus Jelsma commented on NUTCH-1314: -- This should then also work for the Tika parser and the OutlinkExtractor i think. Parse-html is similar to parse-tika, it there are no outlinks obtain by getOutlinks in Domcontentutils then the outlink extractor is used. Impose a limit on the length of outlink target urls --- Key: NUTCH-1314 URL: https://issues.apache.org/jira/browse/NUTCH-1314 Project: Nutch Issue Type: Improvement Reporter: Ferdy Galema Attachments: NUTCH-1314.patch In the past we have encountered situations where crawling specific broken sites resulted in ridiciously long urls that caused the stalling of tasks. The regex plugins (normalizing/filtering) processed single urls for hours, if not indefinitely hanging. My suggestion is to limit the outlink url target length as soon possible. It is a configurable limit, the default is 3000. This should be reasonably long enough for most uses. But sufficienly strict enough to make sure regex plugins do not choke on urls that are too long. Please see attached patch for the Nutchgora implementation. I'd like to hear what you think about this. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1310) Nutch to send HTTP-accept header
[ https://issues.apache.org/jira/browse/NUTCH-1310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13229254#comment-13229254 ] Markus Jelsma commented on NUTCH-1310: -- Any idea on how to resolve this? Suggestions for code location and header value? Nutch to send HTTP-accept header Key: NUTCH-1310 URL: https://issues.apache.org/jira/browse/NUTCH-1310 Project: Nutch Issue Type: Bug Affects Versions: 1.4 Reporter: Markus Jelsma Fix For: 1.5 Nutch does not send a HTTP-accept header with its requests. This is usually not a problem but some firewall do not like it and will reject the request. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1310) Nutch to send HTTP-accept header
[ https://issues.apache.org/jira/browse/NUTCH-1310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13229295#comment-13229295 ] Markus Jelsma commented on NUTCH-1310: -- Ah, yes, that should work out just fine. Thanks for pointing me to it! Nutch to send HTTP-accept header Key: NUTCH-1310 URL: https://issues.apache.org/jira/browse/NUTCH-1310 Project: Nutch Issue Type: Bug Affects Versions: 1.4 Reporter: Markus Jelsma Fix For: 1.5 Nutch does not send a HTTP-accept header with its requests. This is usually not a problem but some firewall do not like it and will reject the request. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1305) Domain(blacklist)URLFilter to trim entries
[ https://issues.apache.org/jira/browse/NUTCH-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13225209#comment-13225209 ] Markus Jelsma commented on NUTCH-1305: -- Thanks Lewis. Domain(blacklist)URLFilter to trim entries -- Key: NUTCH-1305 URL: https://issues.apache.org/jira/browse/NUTCH-1305 Project: Nutch Issue Type: Bug Affects Versions: 1.4 Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Fix For: 1.5 Attachments: NUTCH-1305-1.5-1.patch Both filters should handle entries with trailing whitespace. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1282) linkdb scalability
[ https://issues.apache.org/jira/browse/NUTCH-1282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13221598#comment-13221598 ] Markus Jelsma commented on NUTCH-1282: -- There is an issue for that. In my opinion with that issue implemented the current linkdb can be deprecated. Please check NUTCH-1181 if you have a patch for this. linkdb scalability -- Key: NUTCH-1282 URL: https://issues.apache.org/jira/browse/NUTCH-1282 Project: Nutch Issue Type: Improvement Components: linkdb Affects Versions: 1.4 Reporter: behnam nikbakht as described in NUTCH-1054, the linkdb is optional in solrindex and it's usage is only for anchor and not impact on scoring. as seemed, size of linkdb in incremental crawl grow very fast and make it unscalable for huge size of web sites. so, here is two choises, one, ignore invertlinks and linkdb from crawl, and second, make it scalable in invertlinks, there is 2 jobs, first for construct new linkdb from new parsed segments, and second for merge new linkdb with old linkdb. the second job is unscalable and we can ignore it with this changes in solrIndex: in the class IndexerMapReduce, reduce method, if fetchDatum == null or dbDatum == null or parseText == null or parseData == null, then add anchor to doc and update solr (no insert) here also some changes required to NutchDocument. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1258) MoreIndexingFilter should be able to read Content-Type from both parse metadata and content metadata
[ https://issues.apache.org/jira/browse/NUTCH-1258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13220080#comment-13220080 ] Markus Jelsma commented on NUTCH-1258: -- The patch won't patch as it complains about being malformed. Also, the Writable class is not imported for some reason. It seems to work. Want me to commit? MoreIndexingFilter should be able to read Content-Type from both parse metadata and content metadata Key: NUTCH-1258 URL: https://issues.apache.org/jira/browse/NUTCH-1258 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 1.4 Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Fix For: 1.5 Attachments: NUTCH-1258-1.5-1.patch, NUTCH-1258-v2.patch The MoreIndexingFilter reads the Content-Type from parse metadata. However, this usually contains a lot of crap because web developers can set it to anything they like. The filter must be able to read the Content-Type field from content metadata as well because that contains the type detected by Tika's Detector. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-945) Indexing to multiple SOLR Servers
[ https://issues.apache.org/jira/browse/NUTCH-945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13219079#comment-13219079 ] Markus Jelsma commented on NUTCH-945: - Perhaps good to know for other readers, the patches submitted by Sujit are for the Nutch Gora branch. Indexing to multiple SOLR Servers - Key: NUTCH-945 URL: https://issues.apache.org/jira/browse/NUTCH-945 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 1.2 Reporter: Charan Malemarpuram Attachments: MurmurHashPartitioner.java, NonPartitioningPartitioner.java, patch-NUTCH-945.txt It would be nice to have a default Indexer in Nutch, which can submit docs to multiple SOLR Servers. Partitioning is always the question, when writing to multiple SOLR Servers. Default partitioning can be a simple hashcode based distribution with addition hooks to customization. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1289) In distributed mode URL's are not partitioned
[ https://issues.apache.org/jira/browse/NUTCH-1289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13217157#comment-13217157 ] Markus Jelsma commented on NUTCH-1289: -- In trunk records of the same queue end up in the same fetch list which corresponds to a single mapper. In distributed mode URL's are not partitioned - Key: NUTCH-1289 URL: https://issues.apache.org/jira/browse/NUTCH-1289 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: nutchgora Reporter: Dan Rosher Fix For: nutchgora Attachments: NUTCH-1289.patch In distributed mode URL's are not partitioned to a specific machine which means the politeness policy is voided -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-965) Skip parsing for truncated documents
[ https://issues.apache.org/jira/browse/NUTCH-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13215507#comment-13215507 ] Markus Jelsma commented on NUTCH-965: - has this been fixed now? Skip parsing for truncated documents Key: NUTCH-965 URL: https://issues.apache.org/jira/browse/NUTCH-965 Project: Nutch Issue Type: Improvement Components: parser Reporter: Alexis Assignee: Lewis John McGibbney Fix For: nutchgora, 1.5 Attachments: NUTCH-965-v2.patch, NUTCH-965-v3-nutchgora.txt, NUTCH-965-v3-trunk.txt, parserJob.patch The issue you're likely to run into when parsing truncated FLV files is described here: http://www.mail-archive.com/user@nutch.apache.org/msg01880.html The parser library gets stuck in infinite loop as it encounters corrupted data due to for example truncating big binary files at fetch time. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-965) Skip parsing for truncated documents
[ https://issues.apache.org/jira/browse/NUTCH-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13215561#comment-13215561 ] Markus Jelsma commented on NUTCH-965: - Hi Ferdy, With a parsing fetcher on trunk we see the ParseStatus.success counter rarely being incremented. A test crawl succesfully fetches 10.000 records but the success counter hangs around 15 records. Most, if not all, fetched pages are well below the truncating threshold. Cheers Skip parsing for truncated documents Key: NUTCH-965 URL: https://issues.apache.org/jira/browse/NUTCH-965 Project: Nutch Issue Type: Improvement Components: parser Reporter: Alexis Assignee: Lewis John McGibbney Fix For: nutchgora, 1.5 Attachments: NUTCH-965-v2.patch, NUTCH-965-v3-nutchgora.txt, NUTCH-965-v3-trunk.txt, parserJob.patch The issue you're likely to run into when parsing truncated FLV files is described here: http://www.mail-archive.com/user@nutch.apache.org/msg01880.html The parser library gets stuck in infinite loop as it encounters corrupted data due to for example truncating big binary files at fetch time. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-965) Skip parsing for truncated documents
[ https://issues.apache.org/jira/browse/NUTCH-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13215585#comment-13215585 ] Markus Jelsma commented on NUTCH-965: - Hmm, cleaning and rebuilding the job fixes that issue here. Please ignore :) Skip parsing for truncated documents Key: NUTCH-965 URL: https://issues.apache.org/jira/browse/NUTCH-965 Project: Nutch Issue Type: Improvement Components: parser Reporter: Alexis Assignee: Lewis John McGibbney Fix For: nutchgora, 1.5 Attachments: NUTCH-965-v2.patch, NUTCH-965-v3-nutchgora.txt, NUTCH-965-v3-trunk.txt, parserJob.patch The issue you're likely to run into when parsing truncated FLV files is described here: http://www.mail-archive.com/user@nutch.apache.org/msg01880.html The parser library gets stuck in infinite loop as it encounters corrupted data due to for example truncating big binary files at fetch time. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1283) Ridically update all Solr configuration in Nutchgora
[ https://issues.apache.org/jira/browse/NUTCH-1283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13211732#comment-13211732 ] Markus Jelsma commented on NUTCH-1283: -- 1.4 is the schema version of Solr 3.5. It is up to date. Ridically update all Solr configuration in Nutchgora Key: NUTCH-1283 URL: https://issues.apache.org/jira/browse/NUTCH-1283 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: nutchgora Reporter: Lewis John McGibbney Fix For: nutchgora We're currently running with a Schema which states it's 1.4 :0| There should be better support for newer stuff going on over the Solrland. Thsi issue should track those improvements entirely. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1246) Upgrade to Hadoop 1.0.0
[ https://issues.apache.org/jira/browse/NUTCH-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13210339#comment-13210339 ] Markus Jelsma commented on NUTCH-1246: -- hmm, the jackson dep is still there but it should be removed as it is properly included with the deps of 1.0.0. Upgrade to Hadoop 1.0.0 --- Key: NUTCH-1246 URL: https://issues.apache.org/jira/browse/NUTCH-1246 Project: Nutch Issue Type: Improvement Affects Versions: nutchgora, 1.5 Reporter: Julien Nioche -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1210) DomainBlacklistFilter
[ https://issues.apache.org/jira/browse/NUTCH-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13209436#comment-13209436 ] Markus Jelsma commented on NUTCH-1210: -- Sure, will include a sample file and change ivy's include path. That property is also not included for the regular domainfilter. I don't think users are likely to change the value. DomainBlacklistFilter - Key: NUTCH-1210 URL: https://issues.apache.org/jira/browse/NUTCH-1210 Project: Nutch Issue Type: New Feature Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.5 Attachments: NUTCH-1210-1.5-1.patch The current DomainFilter acts as a white list. We also need a filter that acts as a black list so we can allow tld's and/or domains with DomainFilter but blacklist specific subdomains. If we would patch the current DomainFilter for this behaviour it would break current semantics such as it's precedence. Therefore i would propose a new filter instead. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin
[ https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13208309#comment-13208309 ] Markus Jelsma commented on NUTCH-1129: -- This is a parser plugin right? How will this work if we for example would like to parse microdata with any23 and use Tika's BoilerpipeContentHandler to extraction? In the current BP patch we use multiple content handlers to parse all in one go so i wonder if this could be implemented as such. Please correct me when wrong :) Any23 Nutch plugin -- Key: NUTCH-1129 URL: https://issues.apache.org/jira/browse/NUTCH-1129 Project: Nutch Issue Type: New Feature Components: parser Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Minor Fix For: 1.5 Attachments: NUTCH-1129.patch This plugin should build on the Any23 library to provide us with a plugin which extracts RDF data from HTTP and file resources. Although as of writing Any23 not part of the ASF, the project is working towards integration into the Apache Incubator. Once the project proves its value, this would be an excellent addition to the Nutch 1.X codebase. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1262) Map `duplicating` content-types to a single type
[ https://issues.apache.org/jira/browse/NUTCH-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13208461#comment-13208461 ] Markus Jelsma commented on NUTCH-1262: -- Is this issue still subject to debate? Opinions? Map `duplicating` content-types to a single type Key: NUTCH-1262 URL: https://issues.apache.org/jira/browse/NUTCH-1262 Project: Nutch Issue Type: Improvement Affects Versions: 1.4 Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Fix For: 1.5 Attachments: NUTCH-1262-1.5-1.patch Similar or duplicating content-types can end-up differently in an index. With, for example, both application/xhtml+xml and text/html it is impossible to use a single filter to select `web pages`. See also: http://lucene.472066.n3.nabble.com/application-xhtml-xml-gt-text-html-td3699942.html -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1259) Store detected content type in crawldatum metadata
[ https://issues.apache.org/jira/browse/NUTCH-1259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13207746#comment-13207746 ] Markus Jelsma commented on NUTCH-1259: -- Splendid work my friend! The fetcher runs smoothly again! I'll check out your patch for NUTCH-1258 this week. But what about segments fetched with and without this new feature and db.parsemeta.to.crawldb=Content-Type property? I assume i'd have to update the segments before this change with the property enabled and update the segments fetched with this feature without the db.parsemeta.to.crawldb property. Store detected content type in crawldatum metadata -- Key: NUTCH-1259 URL: https://issues.apache.org/jira/browse/NUTCH-1259 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.4 Reporter: Markus Jelsma Assignee: Julien Nioche Fix For: 1.5 Attachments: NUTCH-1259-1.5-1.patch The MIME-type detected by Tika's Detect() API is never added to a Parse's ContentMetaData or ParseMetaData. Because of this bad Content-Types will end up in the documents. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1259) TikaParser should not add Content-Type from HTTP Headers to Nutch Metadata
[ https://issues.apache.org/jira/browse/NUTCH-1259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13206142#comment-13206142 ] Markus Jelsma commented on NUTCH-1259: -- Great. Thanks! TikaParser should not add Content-Type from HTTP Headers to Nutch Metadata -- Key: NUTCH-1259 URL: https://issues.apache.org/jira/browse/NUTCH-1259 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.4 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.5 Attachments: NUTCH-1259-1.5-1.patch The MIME-type detected by Tika's Detect() API is never added to a Parse's ContentMetaData or ParseMetaData. Because of this bad Content-Types will end up in the documents. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1259) TikaParser should not add Content-Type from HTTP Headers to Nutch Metadata
[ https://issues.apache.org/jira/browse/NUTCH-1259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13205413#comment-13205413 ] Markus Jelsma commented on NUTCH-1259: -- Sounds good! We already store the Content-Type in de CrawlDatum's metadata for NUTCH-1024 via db.parsemeta.to.crawldb. Wouldn't it be better to store it in the CrawlDatum object itself just like the signature? Then someone cannot override it by accident. TikaParser should not add Content-Type from HTTP Headers to Nutch Metadata -- Key: NUTCH-1259 URL: https://issues.apache.org/jira/browse/NUTCH-1259 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.4 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.5 Attachments: NUTCH-1259-1.5-1.patch The MIME-type detected by Tika's Detect() API is never added to a Parse's ContentMetaData or ParseMetaData. Because of this bad Content-Types will end up in the documents. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1259) TikaParser should not add Content-Type from HTTP Headers to Nutch Metadata
[ https://issues.apache.org/jira/browse/NUTCH-1259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13205548#comment-13205548 ] Markus Jelsma commented on NUTCH-1259: -- NUTCH-1024 relies on the Content-Type to be added crawldatum metadata via db.parsemeta.to.crawldb. Anyway, i agree. Will you open another issue? have a nice weekend :) TikaParser should not add Content-Type from HTTP Headers to Nutch Metadata -- Key: NUTCH-1259 URL: https://issues.apache.org/jira/browse/NUTCH-1259 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.4 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.5 Attachments: NUTCH-1259-1.5-1.patch The MIME-type detected by Tika's Detect() API is never added to a Parse's ContentMetaData or ParseMetaData. Because of this bad Content-Types will end up in the documents. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1259) TikaParser should not add Content-Type from HTTP Headers to Nutch Metadata
[ https://issues.apache.org/jira/browse/NUTCH-1259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13204556#comment-13204556 ] Markus Jelsma commented on NUTCH-1259: -- Hi, Consider the following URL that produces bad output. This URL is not the only producing bad output. We've seen countless of examples that produce funky values in both content meta and parse meta, or no value at all. http://kam.mff.cuni.cz/conferences/GraDR/ The current Nutch trunk shows us the following meta data for this URL obtained via parsechecker with only parse-tika enabled: {code} Content Metadata: Vary=negotiate,accept,Accept-Encoding Date=Thu, 09 Feb 2012 14:37:47 GMT Content-Length=4911 TCN=choice Content-Encoding=gzip Content-Location=index.html.bak Content-Type=application/x-trash Connection=close Accept-Ranges=bytes Server=Apache/2.2.9 (Debian) mod_auth_kerb/5.3 PHP/5.2.6-1+lenny14 with Suhosin-Patch mod_ssl/2.2.9 OpenSSL/0.9.8g Parse Metadata: Content-Encoding=ISO-8859-1 {code} It's an application/x-trash according to content meta and no data is available in parse meta. But, it's just an ordinary HTML page. This cannot be true, from an index point of view we will never know that this is an HTML page. With this patch enabled we will get the following output: {code} Content Metadata: Vary=negotiate,accept,Accept-Encoding Date=Thu, 09 Feb 2012 14:40:15 GMT Content-Length=4911 TCN=choice Content-Encoding=gzip Content-Location=index.html.bak Content-Type=application/x-trash Connection=close Accept-Ranges=bytes Server=Apache/2.2.9 (Debian) mod_auth_kerb/5.3 PHP/5.2.6-1+lenny14 with Suhosin-Patch mod_ssl/2.2.9 OpenSSL/0.9.8g Parse Metadata: Content-Encoding=ISO-8859-1 Content-Type=text/html {code} For us, this solves all problems as we now only rely on Tika's MIME-detector and store it in parse meta. The value of content meta cannot be trusted. It's the same as with languages, when we do not use Tika to detect the language we get all sorts of crap. Since the upgrade to Tika 1.0 and with NUTCH-1230 we obtain the detected MIME-type but it's not added to the parse meta. Now it is. Do you have another suggestion? TikaParser should not add Content-Type from HTTP Headers to Nutch Metadata -- Key: NUTCH-1259 URL: https://issues.apache.org/jira/browse/NUTCH-1259 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.4 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.5 Attachments: NUTCH-1259-1.5-1.patch The MIME-type detected by Tika's Detect() API is never added to a Parse's ContentMetaData or ParseMetaData. Because of this bad Content-Types will end up in the documents. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1269) Generate main problems
[ https://issues.apache.org/jira/browse/NUTCH-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13203488#comment-13203488 ] Markus Jelsma commented on NUTCH-1269: -- It won't patch for trunk, all hunks fail. Anyway, this issue looks like NUTCH-1074. Segment sizes are uniform and the correct number of records per queue end up in a segment. I think this duplicates NUTCH-1074 which was fixed for 1.4. What Nutch are you using Benham? Generate main problems -- Key: NUTCH-1269 URL: https://issues.apache.org/jira/browse/NUTCH-1269 Project: Nutch Issue Type: Improvement Components: generator Affects Versions: 1.4 Environment: software Reporter: behnam nikbakht Labels: Generate, MaxHostCount, MaxNumSegments Attachments: NUTCH-1269.patch there are some problems with current Generate method, with maxNumSegments and maxHostCount options: 1. first, size of generated segments are different 2. with maxHostCount option, it is unclear that it was applied or not 3. urls from one host are distributed non-uniform between segments we change Generator.java as described below: in Selector class: private int maxNumSegments; private int segmentSize; private int maxHostCount; public void config ... maxNumSegments = job.getInt(GENERATOR_MAX_NUM_SEGMENTS, 1); segmentSize=(int)job.getInt(GENERATOR_TOP_N, 1000)/maxNumSegments; maxHostCount=job.getInt(GENERATE_MAX_PER_HOST, 100); ... public void reduce(FloatWritable key, IteratorSelectorEntry values, OutputCollectorFloatWritable,SelectorEntry output, Reporter reporter) throws IOException { int limit2=(int)((limit*3)/2); while (values.hasNext()) { if(count == limit) break; if (count % segmentSize == 0 ) { if (currentsegmentnum maxNumSegments-1){ currentsegmentnum++; } else currentsegmentnum=0; } boolean full=true; for(int jk=0;jkmaxNumSegments;jk++){ if (segCounts[jk]segmentSize){ full=false; } } if(full){ break; } SelectorEntry entry = values.next(); Text url = entry.url; //logWrite(Generated3:+limit+-+count+-+url.toString()); String urlString = url.toString(); URL u = null; String hostordomain = null; try { if (normalise normalizers != null) { urlString = normalizers.normalize(urlString, URLNormalizers.SCOPE_GENERATE_HOST_COUNT); } u = new URL(urlString); if (byDomain) { hostordomain = URLUtil.getDomainName(u); } else { hostordomain = new URL(urlString).getHost(); } hostordomain = hostordomain.toLowerCase(); boolean countLimit=true; // only filter if we are counting hosts or domains int[] hostCount = hostCounts.get(hostordomain); //host count: {a,b,c,d} means that from this host there are a urls in segment 0 and b urls in seg 1 and ... if (hostCount == null) { hostCount = new int[maxNumSegments]; for(int kl=0;klhostCount.length;kl++) hostCount[kl]=0; hostCounts.put(hostordomain, hostCount); } int selectedSeg=currentsegmentnum; int minCount=hostCount[selectedSeg]; for(int jk=0;jkmaxNumSegments;jk++){ if(hostCount[jk]minCount){ minCount=hostCount[jk]; selectedSeg=jk; } } if(hostCount[selectedSeg]=maxHostCount){ count++; entry.segnum = new IntWritable(selectedSeg); hostCount[selectedSeg]++; output.collect(key, entry); } } catch (Exception e) { LOG.warn(Malformed URL: ' + urlString + ', skipping ( logWrite(Generate-malform:+hostordomain+-+url.toString()); + StringUtils.stringifyException(e) + )); //continue; } } } -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1269) Generate main problems
[ https://issues.apache.org/jira/browse/NUTCH-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13203501#comment-13203501 ] Markus Jelsma commented on NUTCH-1269: -- Ah, yes, i understand now. Your patch is an attempt to spread the host (or domain) limit over all generated segments. Interesting. Can you provide a patch that works with trunk and have this feature enabled via configuration? Generate main problems -- Key: NUTCH-1269 URL: https://issues.apache.org/jira/browse/NUTCH-1269 Project: Nutch Issue Type: Improvement Components: generator Affects Versions: 1.4 Environment: software Reporter: behnam nikbakht Labels: Generate, MaxHostCount, MaxNumSegments Attachments: NUTCH-1269.patch there are some problems with current Generate method, with maxNumSegments and maxHostCount options: 1. first, size of generated segments are different 2. with maxHostCount option, it is unclear that it was applied or not 3. urls from one host are distributed non-uniform between segments we change Generator.java as described below: in Selector class: private int maxNumSegments; private int segmentSize; private int maxHostCount; public void config ... maxNumSegments = job.getInt(GENERATOR_MAX_NUM_SEGMENTS, 1); segmentSize=(int)job.getInt(GENERATOR_TOP_N, 1000)/maxNumSegments; maxHostCount=job.getInt(GENERATE_MAX_PER_HOST, 100); ... public void reduce(FloatWritable key, IteratorSelectorEntry values, OutputCollectorFloatWritable,SelectorEntry output, Reporter reporter) throws IOException { int limit2=(int)((limit*3)/2); while (values.hasNext()) { if(count == limit) break; if (count % segmentSize == 0 ) { if (currentsegmentnum maxNumSegments-1){ currentsegmentnum++; } else currentsegmentnum=0; } boolean full=true; for(int jk=0;jkmaxNumSegments;jk++){ if (segCounts[jk]segmentSize){ full=false; } } if(full){ break; } SelectorEntry entry = values.next(); Text url = entry.url; //logWrite(Generated3:+limit+-+count+-+url.toString()); String urlString = url.toString(); URL u = null; String hostordomain = null; try { if (normalise normalizers != null) { urlString = normalizers.normalize(urlString, URLNormalizers.SCOPE_GENERATE_HOST_COUNT); } u = new URL(urlString); if (byDomain) { hostordomain = URLUtil.getDomainName(u); } else { hostordomain = new URL(urlString).getHost(); } hostordomain = hostordomain.toLowerCase(); boolean countLimit=true; // only filter if we are counting hosts or domains int[] hostCount = hostCounts.get(hostordomain); //host count: {a,b,c,d} means that from this host there are a urls in segment 0 and b urls in seg 1 and ... if (hostCount == null) { hostCount = new int[maxNumSegments]; for(int kl=0;klhostCount.length;kl++) hostCount[kl]=0; hostCounts.put(hostordomain, hostCount); } int selectedSeg=currentsegmentnum; int minCount=hostCount[selectedSeg]; for(int jk=0;jkmaxNumSegments;jk++){ if(hostCount[jk]minCount){ minCount=hostCount[jk]; selectedSeg=jk; } } if(hostCount[selectedSeg]=maxHostCount){ count++; entry.segnum = new IntWritable(selectedSeg); hostCount[selectedSeg]++; output.collect(key, entry); } } catch (Exception e) { LOG.warn(Malformed URL: ' + urlString + ', skipping ( logWrite(Generate-malform:+hostordomain+-+url.toString()); + StringUtils.stringifyException(e) + )); //continue; } } } -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1005) Index headings plugin
[ https://issues.apache.org/jira/browse/NUTCH-1005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13202219#comment-13202219 ] Markus Jelsma commented on NUTCH-1005: -- i'll commit this one shortly if there are no objections thanks Index headings plugin - Key: NUTCH-1005 URL: https://issues.apache.org/jira/browse/NUTCH-1005 Project: Nutch Issue Type: New Feature Components: indexer, parser Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Fix For: 1.5 Attachments: HeadingsIndexingFilter.java, HeadingsParseFilter.java, NUTCH-1005-1.4-2.patch, NUTCH-1005-1.4-3.patch, NUTCH-1005-1.5-4.patch, NUTCH-1005-1.5-5.patch Very simple plugin for extracting and indexing a comma separated list of headings via the headings configuration directive. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1266) Subcollection to optionally write to configured fields
[ https://issues.apache.org/jira/browse/NUTCH-1266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13202220#comment-13202220 ] Markus Jelsma commented on NUTCH-1266: -- comments? Subcollection to optionally write to configured fields -- Key: NUTCH-1266 URL: https://issues.apache.org/jira/browse/NUTCH-1266 Project: Nutch Issue Type: Improvement Components: indexer Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.5 Attachments: NUTCH-1266-1.5-1.patch The subcollection plugin writes the contents of the name element of a given subcollection to the subcollection field. There are cases in which writing to fields other than subcollection is useful. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1210) DomainBlacklistFilter
[ https://issues.apache.org/jira/browse/NUTCH-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13202221#comment-13202221 ] Markus Jelsma commented on NUTCH-1210: -- I'll send this one in if there are no objections. DomainBlacklistFilter - Key: NUTCH-1210 URL: https://issues.apache.org/jira/browse/NUTCH-1210 Project: Nutch Issue Type: New Feature Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.5 Attachments: NUTCH-1210-1.5-1.patch The current DomainFilter acts as a white list. We also need a filter that acts as a black list so we can allow tld's and/or domains with DomainFilter but blacklist specific subdomains. If we would patch the current DomainFilter for this behaviour it would break current semantics such as it's precedence. Therefore i would propose a new filter instead. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1266) Subcollection to optionally write to configured fields
[ https://issues.apache.org/jira/browse/NUTCH-1266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13202378#comment-13202378 ] Markus Jelsma commented on NUTCH-1266: -- I'll commit this one in a few hours unless there are objections. Subcollection to optionally write to configured fields -- Key: NUTCH-1266 URL: https://issues.apache.org/jira/browse/NUTCH-1266 Project: Nutch Issue Type: Improvement Components: indexer Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.5 Attachments: NUTCH-1266-1.5-1.patch The subcollection plugin writes the contents of the name element of a given subcollection to the subcollection field. There are cases in which writing to fields other than subcollection is useful. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1259) TikaParser should not add Content-Type from HTTP Headers to Nutch Metadata
[ https://issues.apache.org/jira/browse/NUTCH-1259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13202460#comment-13202460 ] Markus Jelsma commented on NUTCH-1259: -- I'll comment on it myself then: the code above fixes the issue and adds a proper content-type to parsemeta. Consider the following URL with a very bad content-type: http://kam.mff.cuni.cz/conferences/GraDR/ I'll upload a patch in a minute that sets the detected content type in the metadata instead TikaParser should not add Content-Type from HTTP Headers to Nutch Metadata -- Key: NUTCH-1259 URL: https://issues.apache.org/jira/browse/NUTCH-1259 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.4 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.5 The MIME-type detected by Tika's Detect() API is never added to a Parse's ContentMetaData or ParseMetaData. Because of this bad Content-Types will end up in the documents. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1259) TikaParser should not add Content-Type from HTTP Headers to Nutch Metadata
[ https://issues.apache.org/jira/browse/NUTCH-1259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13202483#comment-13202483 ] Markus Jelsma commented on NUTCH-1259: -- you're right. but since you're most of the time the only person reviewing and the fact this issue has your attention now, what is your opinion on this problem? ;) TikaParser should not add Content-Type from HTTP Headers to Nutch Metadata -- Key: NUTCH-1259 URL: https://issues.apache.org/jira/browse/NUTCH-1259 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.4 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.5 Attachments: NUTCH-1259-1.5-1.patch The MIME-type detected by Tika's Detect() API is never added to a Parse's ContentMetaData or ParseMetaData. Because of this bad Content-Types will end up in the documents. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1264) Configurable indexing plugin (index-extra)
[ https://issues.apache.org/jira/browse/NUTCH-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13201303#comment-13201303 ] Markus Jelsma commented on NUTCH-1264: -- +1 Didn't manage to test last week but it works like a charm now! I'll upload a headings plugin without indexing that works with this plugin. Configurable indexing plugin (index-extra) --- Key: NUTCH-1264 URL: https://issues.apache.org/jira/browse/NUTCH-1264 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 1.5 Reporter: Julien Nioche Attachments: NUTCH-1264-trunk.patch We currently have several plugins already distributed or proposed which do very comparable things : - parse-meta [NUTCH-809] to generate metadata fields in parse-metadata and index them - headings [NUTCH-1005] to generate headings fields in parse-metadata and index them - index-extra [NUTCH-422] to index configurable fields - urlmeta [NUTCH-855] to propagate metadata from the seeds to the outlinks and index them - index-static [NUTCH-940] to generate configurable static fields All these plugins have in common that they allow to extract information from various sources and generate fields from them and are largely redundant. Instead this issue proposes to have a single plugin allowing to generate configurable fields from : - static values - parse metadata - content metadata - crawldb metadata and let the other plugins focus on the parsing and extraction of the values to index. This will make the addition of new fields simpler by relying on a stable common plugin instead of multiplying the code in various plugins. This plugin will replace index-static [NUTCH-940] and index-extra [NUTCH-422] and will serve as a basis for further improvements. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1264) Configurable indexing plugin (index-metadata)
[ https://issues.apache.org/jira/browse/NUTCH-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13201392#comment-13201392 ] Markus Jelsma commented on NUTCH-1264: -- Works fine! Configurable indexing plugin (index-metadata) -- Key: NUTCH-1264 URL: https://issues.apache.org/jira/browse/NUTCH-1264 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 1.5 Reporter: Julien Nioche Attachments: NUTCH-1264-trunk-v2.patch, NUTCH-1264-trunk.patch We currently have several plugins already distributed or proposed which do very comparable things : - parse-meta [NUTCH-809] to generate metadata fields in parse-metadata and index them - headings [NUTCH-1005] to generate headings fields in parse-metadata and index them - index-extra [NUTCH-422] to index configurable fields - urlmeta [NUTCH-855] to propagate metadata from the seeds to the outlinks and index them - index-static [NUTCH-940] to generate configurable static fields All these plugins have in common that they allow to extract information from various sources and generate fields from them and are largely redundant. Instead this issue proposes to have a single plugin allowing to generate configurable fields from : - static values - parse metadata - content metadata - crawldb metadata and let the other plugins focus on the parsing and extraction of the values to index. This will make the addition of new fields simpler by relying on a stable common plugin instead of multiplying the code in various plugins. This plugin will replace index-extra [NUTCH-422] and will serve as a basis for further improvements. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1005) Index headings plugin
[ https://issues.apache.org/jira/browse/NUTCH-1005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13197870#comment-13197870 ] Markus Jelsma commented on NUTCH-1005: -- Hi! Don't you mean: {code} parse.getData().getParseMeta().set(headings[i], heading.trim()); {code} That still works well with the indexfilter when testing via indexchecker. Index headings plugin - Key: NUTCH-1005 URL: https://issues.apache.org/jira/browse/NUTCH-1005 Project: Nutch Issue Type: New Feature Components: indexer, parser Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Fix For: 1.5 Attachments: HeadingsIndexingFilter.java, HeadingsParseFilter.java, NUTCH-1005-1.4-2.patch, NUTCH-1005-1.4-3.patch Very simple plugin for extracting and indexing a comma separated list of headings via the headings configuration directive. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1005) Index headings plugin
[ https://issues.apache.org/jira/browse/NUTCH-1005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13197886#comment-13197886 ] Markus Jelsma commented on NUTCH-1005: -- Yes i'll give it a shot this week. Your patch can index fields from content, parse and db metadata which replaces the indexing filter of this headings plugin. I assume i have to disable the indexing filter of this plugin but keep the parse filter since your patch does not do any parsing right? Index headings plugin - Key: NUTCH-1005 URL: https://issues.apache.org/jira/browse/NUTCH-1005 Project: Nutch Issue Type: New Feature Components: indexer, parser Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Fix For: 1.5 Attachments: HeadingsIndexingFilter.java, HeadingsParseFilter.java, NUTCH-1005-1.4-2.patch, NUTCH-1005-1.4-3.patch, NUTCH-1005-1.5-4.patch Very simple plugin for extracting and indexing a comma separated list of headings via the headings configuration directive. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1005) Index headings plugin
[ https://issues.apache.org/jira/browse/NUTCH-1005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13197894#comment-13197894 ] Markus Jelsma commented on NUTCH-1005: -- index-meta comes to mind! It's exactly what it does right? I'll try the patch with the headings indexing filter disabled and with good results will provide a new patch without the indexing filter extension. Index headings plugin - Key: NUTCH-1005 URL: https://issues.apache.org/jira/browse/NUTCH-1005 Project: Nutch Issue Type: New Feature Components: indexer, parser Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Fix For: 1.5 Attachments: HeadingsIndexingFilter.java, HeadingsParseFilter.java, NUTCH-1005-1.4-2.patch, NUTCH-1005-1.4-3.patch, NUTCH-1005-1.5-4.patch Very simple plugin for extracting and indexing a comma separated list of headings via the headings configuration directive. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1256) WebGraph to dump host + score
[ https://issues.apache.org/jira/browse/NUTCH-1256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13196830#comment-13196830 ] Markus Jelsma commented on NUTCH-1256: -- I'll commit this one today unless there are objections. WebGraph to dump host + score - Key: NUTCH-1256 URL: https://issues.apache.org/jira/browse/NUTCH-1256 Project: Nutch Issue Type: Improvement Affects Versions: 1.4 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.5 Attachments: NUTCH-1256-1.5-1.patch WebGraph's NodeDumper tool can dump url,score information but a host|domain,score output can also be put to good use. This is likely to require a new MapReduce job as the NodeDumper's atonomy is not suited to return max or or summed scores. Code could also be merged with the tool. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1242) Allow disabling of URL Filters in ParseSegment
[ https://issues.apache.org/jira/browse/NUTCH-1242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13196962#comment-13196962 ] Markus Jelsma commented on NUTCH-1242: -- Yes it should. Thanks! It's now changed in equalsIgnoreCase! Allow disabling of URL Filters in ParseSegment -- Key: NUTCH-1242 URL: https://issues.apache.org/jira/browse/NUTCH-1242 Project: Nutch Issue Type: Improvement Reporter: Edward Drapkin Assignee: Markus Jelsma Fix For: 1.5 Attachments: NUTCH-1242-1.5-1.patch, ParseSegment.patch, parseoutputformat.patch, trunk.patch Right now, the ParseSegment job does not allow you to disable URL filtration. For reasons that aren't worth explaining, I need to do this, so I enabled this behavior through the use of a boolean configuration value parse.filter.urls which defaults to true. I've attached a simple, preliminary patch that enables this behavior with that configuration option. I'm not sure if it should be made a command line option or not. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1256) WebGraph to dump host + score
[ https://issues.apache.org/jira/browse/NUTCH-1256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13196033#comment-13196033 ] Markus Jelsma commented on NUTCH-1256: -- comments? WebGraph to dump host + score - Key: NUTCH-1256 URL: https://issues.apache.org/jira/browse/NUTCH-1256 Project: Nutch Issue Type: Improvement Affects Versions: 1.4 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.5 Attachments: NUTCH-1256-1.5-1.patch WebGraph's NodeDumper tool can dump url,score information but a host|domain,score output can also be put to good use. This is likely to require a new MapReduce job as the NodeDumper's atonomy is not suited to return max or or summed scores. Code could also be merged with the tool. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1259) TikaParser should not add Content-Type from HTTP Headers to Nutch Metadata
[ https://issues.apache.org/jira/browse/NUTCH-1259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13196034#comment-13196034 ] Markus Jelsma commented on NUTCH-1259: -- comments please. TikaParser should not add Content-Type from HTTP Headers to Nutch Metadata -- Key: NUTCH-1259 URL: https://issues.apache.org/jira/browse/NUTCH-1259 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.4 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.5 The MIME-type detected by Tika's Detect() API is never added to a Parse's ContentMetaData or ParseMetaData. Because of this bad Content-Types will end up in the documents. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1258) MoreIndexingFilter should be able to read Content-Type from both parse metadata and content metadata
[ https://issues.apache.org/jira/browse/NUTCH-1258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13192992#comment-13192992 ] Markus Jelsma commented on NUTCH-1258: -- Comments? Tested and things work as expected, tests pass. Ill commit shortly unless there are objections. MoreIndexingFilter should be able to read Content-Type from both parse metadata and content metadata Key: NUTCH-1258 URL: https://issues.apache.org/jira/browse/NUTCH-1258 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 1.4 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.5 Attachments: NUTCH-1258-1.5-1.patch The MoreIndexingFilter reads the Content-Type from parse metadata. However, this usually contains a lot of crap because web developers can set it to anything they like. The filter must be able to read the Content-Type field from content metadata as well because that contains the type detected by Tika's Detector. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1258) MoreIndexingFilter should be able to read Content-Type from both parse metadata and content metadata
[ https://issues.apache.org/jira/browse/NUTCH-1258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13193001#comment-13193001 ] Markus Jelsma commented on NUTCH-1258: -- That may be a good idea indeed but we need to extend it too. This patch fixes some issues with bad content-types but it seems the problem is bigger. The example URL [1] doesn't provide any Content-Type in ParseMeta and a bad Content-Type in ContentMeta, application/x-trash which is found in the HTTP resp. header. However, parserchecker (and indexchecker) both show contentType: text/html at the top but this value is not added to any metadata AFAIK. In this case only contentType = content.getContentType(); returns the desired Content-Type. Any idea how we can get a hold on that value when we have an instance of ParseData in the MoreIndexingFilter? [1]: http://kam.mff.cuni.cz/conferences/GraDR/ MoreIndexingFilter should be able to read Content-Type from both parse metadata and content metadata Key: NUTCH-1258 URL: https://issues.apache.org/jira/browse/NUTCH-1258 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 1.4 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.5 Attachments: NUTCH-1258-1.5-1.patch The MoreIndexingFilter reads the Content-Type from parse metadata. However, this usually contains a lot of crap because web developers can set it to anything they like. The filter must be able to read the Content-Type field from content metadata as well because that contains the type detected by Tika's Detector. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1258) MoreIndexingFilter should be able to read Content-Type from both parse metadata and content metadata
[ https://issues.apache.org/jira/browse/NUTCH-1258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13193019#comment-13193019 ] Markus Jelsma commented on NUTCH-1258: -- Ah, the Content-Type detected by Tika is never added to ParseMeta in the first place! I've modified TikaParser with nutchMetadata.add(Content-Type, mimeType);. In cases where at first i had a bad Content-Type in ParseMeta (but a good one in Content-Meta) i now have good old text/html. The problem is with Content-Types already added to the MetaData by the parser. In that case both the good and bad Content-Types are present in ParseMeta. Just as commented in the code we now have a problem with multi values fields. {code} // populate Nutch metadata with Tika metadata String[] TikaMDNames = tikamd.names(); for (String tikaMDName : TikaMDNames) { if (tikaMDName.equalsIgnoreCase(Metadata.TITLE)) continue; // TODO what if multivalued? nutchMetadata.add(tikaMDName, tikamd.get(tikaMDName)); } {code} This needs another issue opened but some comments are more than appreciated first. Thanks MoreIndexingFilter should be able to read Content-Type from both parse metadata and content metadata Key: NUTCH-1258 URL: https://issues.apache.org/jira/browse/NUTCH-1258 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 1.4 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.5 Attachments: NUTCH-1258-1.5-1.patch The MoreIndexingFilter reads the Content-Type from parse metadata. However, this usually contains a lot of crap because web developers can set it to anything they like. The filter must be able to read the Content-Type field from content metadata as well because that contains the type detected by Tika's Detector. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1259) TikaParser should not add Content-Type from HTTP Headers to Nutch Metadata
[ https://issues.apache.org/jira/browse/NUTCH-1259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13193030#comment-13193030 ] Markus Jelsma commented on NUTCH-1259: -- A solution would be to prevent the type to be added just like what is already being done with the title field. Now, a reliable Content-Type value is added to the ParseMetaData. {code} // populate Nutch metadata with Tika metadata String[] TikaMDNames = tikamd.names(); for (String tikaMDName : TikaMDNames) { if (tikaMDName.equalsIgnoreCase(Metadata.TITLE)) continue; // DO NOT ADD Content-Type FROM HTTP_HEADERS, ONLY ADD THE DETECTED TYPE SEE https://issues.apache.org/jira/browse/NUTCH-1259 if (tikaMDName.equalsIgnoreCase(Metadata.CONTENT_TYPE)) continue; // TODO what if multivalued? nutchMetadata.add(tikaMDName, tikamd.get(tikaMDName)); } // Only add the detected TYPE nutchMetadata.add(Content-Type, mimeType); {code} TikaParser should not add Content-Type from HTTP Headers to Nutch Metadata -- Key: NUTCH-1259 URL: https://issues.apache.org/jira/browse/NUTCH-1259 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.4 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.5 The MIME-type detected by Tika's Detect() API is never added to a Parse's ContentMetaData or ParseMetaData. Because of this bad Content-Types will end up in the documents. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1201) Allow for different FetcherThread impls
[ https://issues.apache.org/jira/browse/NUTCH-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13187884#comment-13187884 ] Markus Jelsma commented on NUTCH-1201: -- Hi Edward, I've already modified Fetcher to allow for different Fetcher impls via configuration that inherit from Fetcher itself. It works fine and i can override methods i need. However, it may not be that elegant. There's no code to use other queue impls. I'll cook a patch tomorrow. Allow for different FetcherThread impls --- Key: NUTCH-1201 URL: https://issues.apache.org/jira/browse/NUTCH-1201 Project: Nutch Issue Type: New Feature Components: fetcher Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.5 For certain cases we need to modify parts in FetcherThread and make it pluggable. This introduces a new config directive fetcher.impl that takes a FQCN and uses that setting Fetcher.fetch to load a class to use for job.setMapRunnerClass(). This new class has to extend Fetcher and and inner class FetcherThread. This allows for overriding methods in FetcherThread but also methods in Fetcher itself if required. A follow up on this issue would be to refactor parts of FetcherThread to make it easier to override small sections instead of copying the entire method body for a small change, which is now the case. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1251) Deletion of duplicates fails with org.apache.solr.client.solrj.SolrServerException
[ https://issues.apache.org/jira/browse/NUTCH-1251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13188095#comment-13188095 ] Markus Jelsma commented on NUTCH-1251: -- Can you provide a patch for trunk? Deletion of duplicates fails with org.apache.solr.client.solrj.SolrServerException -- Key: NUTCH-1251 URL: https://issues.apache.org/jira/browse/NUTCH-1251 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.4 Environment: Any crawl where the number of URLs in Solr exceeds 1024 (the default max number of clusters in Lucene boolean query). Reporter: Arkadi Kosmynin Priority: Critical Fix For: 1.5 Deletion of duplicates fails. This happens because the get all query used to get Solr index size is id:[* TO *], which is a range query. Lucene is trying to expand it to a Boolean query and gets as many clauses as there are ids in the index. This is too many in a real situation and it throws an exception. To correct this problem, change the get all query (SOLR_GET_ALL_QUERY) to \*:\*, which is the standard Solr get all query. Indexing log extract: java.io.IOException: org.apache.solr.client.solrj.SolrServerException: Error executing query at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getRecordReader(SolrDeleteDuplicates.java:236) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:338) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177) Caused by: org.apache.solr.client.solrj.SolrServerException: Error executing query at org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:95) at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:118) at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getRecordReader(SolrDeleteDuplicates.java:234) ... 3 more Caused by: org.apache.solr.common.SolrException: Internal Server Error Internal Server Error request: http://localhost:8081/arch/select?q=id:[* TO *]fl=id,boost,tstamp,digeststart=0rows=82938wt=javabinversion=2 at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:430) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244) at org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:89) ... 5 more -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1247) CrawlDatum.retries should be int
[ https://issues.apache.org/jira/browse/NUTCH-1247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13186177#comment-13186177 ] Markus Jelsma commented on NUTCH-1247: -- Lewis, we're seeing many URL's with a high retry value. When the value is greater than 127 they're negative. This is in itself not a problem but it seems in my setup it will continue to increase. Andrzej, there may indeed be something wrong. Might this be related to NUTCH-1245 then? There seems to be something wrong with the following CrawlDBReducer code: {code} 260 case CrawlDatum.STATUS_FETCH_RETRY: // temporary failure 261 if (oldSet) { 262 result.setSignature(old.getSignature()); // use old signature 263 } 264 result = schedule.setPageRetrySchedule((Text)key, result, prevFetchTime, 265 prevModifiedTime, fetch.getFetchTime()); 266 if (result.getRetriesSinceFetch() retryMax) { 267 result.setStatus(CrawlDatum.STATUS_DB_UNFETCHED); 268 } else { 269 result.setStatus(CrawlDatum.STATUS_DB_GONE); 270 } 271 break; {code} In setPageRetrySchedule() the num retries is always incremented. This causes records with exceptions such as UnknownHostException to be refetched for each segment. This makes sense because the first segment in our cycle has much more exceptions than average. Do you follow? CrawlDatum.retries should be int Key: NUTCH-1247 URL: https://issues.apache.org/jira/browse/NUTCH-1247 Project: Nutch Issue Type: Bug Affects Versions: 1.4 Reporter: Markus Jelsma Fix For: 1.5 CrawlDatum.retries is a byte and goes bad with larger values. 12/01/12 18:35:22 INFO crawl.CrawlDbReader: retry -127: 1 12/01/12 18:35:22 INFO crawl.CrawlDbReader: retry -128: 1 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1247) CrawlDatum.retries should be int
[ https://issues.apache.org/jira/browse/NUTCH-1247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13186217#comment-13186217 ] Markus Jelsma commented on NUTCH-1247: -- Alright, then i think this must be related to NUTCH-1245. In that case the record is set to DB_GONE but generated anyway so this counter would continue to increase forever. CrawlDatum.retries should be int Key: NUTCH-1247 URL: https://issues.apache.org/jira/browse/NUTCH-1247 Project: Nutch Issue Type: Bug Affects Versions: 1.4 Reporter: Markus Jelsma Fix For: 1.5 CrawlDatum.retries is a byte and goes bad with larger values. 12/01/12 18:35:22 INFO crawl.CrawlDbReader: retry -127: 1 12/01/12 18:35:22 INFO crawl.CrawlDbReader: retry -128: 1 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1246) Upgrade to Hadoop 1.0.0
[ https://issues.apache.org/jira/browse/NUTCH-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13185499#comment-13185499 ] Markus Jelsma commented on NUTCH-1246: -- This means the Jackson dep can be removed again as it is now fixed in 1.0.0 * HADOOP-7461. Fix to add jackson dependency to hadoop pom. Upgrade to Hadoop 1.0.0 --- Key: NUTCH-1246 URL: https://issues.apache.org/jira/browse/NUTCH-1246 Project: Nutch Issue Type: Improvement Affects Versions: nutchgora, 1.5 Reporter: Julien Nioche -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1177) Generator to select on retry interval
[ https://issues.apache.org/jira/browse/NUTCH-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13185560#comment-13185560 ] Markus Jelsma commented on NUTCH-1177: -- i'll commit this one if there are no objections. Generator to select on retry interval - Key: NUTCH-1177 URL: https://issues.apache.org/jira/browse/NUTCH-1177 Project: Nutch Issue Type: Improvement Components: generator Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Fix For: 1.5 Attachments: NUTCH-1177-1.5-1.patch The generator already has a mechanism to select entries with a score larger than specified threshold but should also have a means to select entries with a retry interval lower than specified by a configuration option. Such a feature is particulary useful when dealing with too large crawldb's where you still want a crawl to fetch rapid changing url's first. This issue should also add the missing generate.min.score configuration to nutch-default. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1247) CrawlDatum.retries should be int
[ https://issues.apache.org/jira/browse/NUTCH-1247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13185826#comment-13185826 ] Markus Jelsma commented on NUTCH-1247: -- Hints and thoughts are much appreciated, messing with CrawlDatum is pretty invasive. CrawlDatum.retries should be int Key: NUTCH-1247 URL: https://issues.apache.org/jira/browse/NUTCH-1247 Project: Nutch Issue Type: Bug Affects Versions: 1.4 Reporter: Markus Jelsma Fix For: 1.5 CrawlDatum.retries is a byte and goes bad with larger values. 12/01/12 18:35:22 INFO crawl.CrawlDbReader: retry -127: 1 12/01/12 18:35:22 INFO crawl.CrawlDbReader: retry -128: 1 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1244) CrawlDBDumper to filter by regex
[ https://issues.apache.org/jira/browse/NUTCH-1244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13182530#comment-13182530 ] Markus Jelsma commented on NUTCH-1244: -- I'll commit shortly if there are no objections. CrawlDBDumper to filter by regex Key: NUTCH-1244 URL: https://issues.apache.org/jira/browse/NUTCH-1244 Project: Nutch Issue Type: Improvement Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Fix For: 1.5 Attachments: NUTCH-1244-1.5-1.patch, NUTCH-1244-1.5-2.patch The CrawlDBDumper tool should be able to filter records by an option regular expression. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1244) CrawlDBDumper to filter by regex
[ https://issues.apache.org/jira/browse/NUTCH-1244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13180442#comment-13180442 ] Markus Jelsma commented on NUTCH-1244: -- Almost, this doesn't allow for creating mini-crawldb's using that feature. Perhaps a -format crawldb and setting a MapFileOutputFormat would do the trick. CrawlDBDumper to filter by regex Key: NUTCH-1244 URL: https://issues.apache.org/jira/browse/NUTCH-1244 Project: Nutch Issue Type: Improvement Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Fix For: 1.5 Attachments: NUTCH-1244-1.5-1.patch The CrawlDBDumper tool should be able to filter records by an option regular expression. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again
[ https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13180515#comment-13180515 ] Markus Jelsma commented on NUTCH-1245: -- Thanks! This must be the same issue as NUTCH-578 but marked as related for now. Can you provide a patch? URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again Key: NUTCH-1245 URL: https://issues.apache.org/jira/browse/NUTCH-1245 Project: Nutch Issue Type: Bug Affects Versions: 1.4, 1.5 Reporter: Sebastian Nagel A document gone with 404 after db.fetch.interval.max (90 days) has passed is fetched over and over again but although fetch status is fetch_gone its status in CrawlDb keeps db_unfetched. Consequently, this document will be generated and fetched from now on in every cycle. To reproduce: # create a CrawlDatum in CrawlDb which retry interval hits db.fetch.interval.max (I manipulated the shouldFetch() in AbstractFetchSchedule to achieve this) # now this URL is fetched again # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 days) # this does not change with every generate-fetch-update cycle, here for two segments: {noformat} /tmp/testcrawl/segments/20120105161430 SegmentReader: get 'http://localhost/page_gone' Crawl Generate:: Status: 1 (db_unfetched) Fetch time: Thu Jan 05 16:14:21 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: http://localhost/page_gone Crawl Fetch:: Status: 37 (fetch_gone) Fetch time: Thu Jan 05 16:14:48 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: http://localhost/page_gone /tmp/testcrawl/segments/20120105161631 SegmentReader: get 'http://localhost/page_gone' Crawl Generate:: Status: 1 (db_unfetched) Fetch time: Thu Jan 05 16:16:23 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: http://localhost/page_gone Crawl Fetch:: Status: 37 (fetch_gone) Fetch time: Thu Jan 05 16:20:05 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: http://localhost/page_gone {noformat} As far as I can see it's caused by setPageGoneSchedule() in AbstractFetchSchedule. Some pseudo-code: {code} setPageGoneSchedule (called from update / CrawlDbReducer.reduce): datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * maxInterval datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516 if (maxInterval datum.fetchInterval) // necessarily true forceRefetch() forceRefetch: if (datum.fetchInterval maxInterval) // true because it's 1.35 * maxInterval datum.fetchInterval = 0.9 * maxInterval datum.status = db_unfetched // shouldFetch (called from generate / Generator.map): if ((datum.fetchTime - curTime) maxInterval) // always true if the crawler is launched in short intervals // (lower than 0.35 * maxInterval) datum.fetchTime = curTime // forces a refetch {code} After setPageGoneSchedule is called via update the state is db_unfetched and the retry interval 0.9 * db.fetch.interval.max (81 days). Although the fetch time in the CrawlDb is far in the future {noformat} % nutch readdb testcrawl/crawldb -url http://localhost/page_gone URL: http://localhost/page_gone Version: 7 Status: 1 (db_unfetched) Fetch time: Sun May 06 05:20:05 CEST 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Score: 1.0 Signature: null Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone {noformat} the URL is generated again because (fetch time - current time) is larger than db.fetch.interval.max. The retry interval (datum.fetchInterval) oscillates between 0.9 and 1.35, and the fetch time is always close to current time + 1.35 * db.fetch.interval.max. It's possibly a side effect of NUTCH-516, and may be related to NUTCH-578 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
[jira] [Commented] (NUTCH-1241) CrawlDBScanner should also be able to find records
[ https://issues.apache.org/jira/browse/NUTCH-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13179374#comment-13179374 ] Markus Jelsma commented on NUTCH-1241: -- yes of course but it is not user friendly. You cannot search for product in http://host/product/123 in a user friendly manner. Also, using a Matcher would slighly boost performance. CrawlDBScanner should also be able to find records -- Key: NUTCH-1241 URL: https://issues.apache.org/jira/browse/NUTCH-1241 Project: Nutch Issue Type: Improvement Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Fix For: 1.5 The CrawlDBScanner cannot find partial matches because it uses String.match(); Instead, it should be able to use the Matcher.find() to find partial matches. Right now regex http will never match any records. It can then also reuse a compiled pattern. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1241) CrawlDBScanner should also be able to find records
[ https://issues.apache.org/jira/browse/NUTCH-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13179386#comment-13179386 ] Markus Jelsma commented on NUTCH-1241: -- hmm yes, if we add a -regex option that goes with the -dump option in the reader we can also have csv output! However, due to NUTCH-1029 i cannot test it properly in a production environment. Care to have a look? CrawlDBScanner should also be able to find records -- Key: NUTCH-1241 URL: https://issues.apache.org/jira/browse/NUTCH-1241 Project: Nutch Issue Type: Improvement Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Fix For: 1.5 The CrawlDBScanner cannot find partial matches because it uses String.match(); Instead, it should be able to use the Matcher.find() to find partial matches. Right now regex http will never match any records. It can then also reuse a compiled pattern. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1241) CrawlDBScanner should also be able to find records
[ https://issues.apache.org/jira/browse/NUTCH-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13179517#comment-13179517 ] Markus Jelsma commented on NUTCH-1241: -- Ugh, that's NUTCH-1084 instead. I don't need that ticket to test the regex dump because dumping the DB still works, reading a record doesn't. CrawlDBScanner should also be able to find records -- Key: NUTCH-1241 URL: https://issues.apache.org/jira/browse/NUTCH-1241 Project: Nutch Issue Type: Improvement Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Fix For: 1.5 The CrawlDBScanner cannot find partial matches because it uses String.match(); Instead, it should be able to use the Matcher.find() to find partial matches. Right now regex http will never match any records. It can then also reuse a compiled pattern. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1239) Webgraph should remove deleted pages from segment input
[ https://issues.apache.org/jira/browse/NUTCH-1239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13178358#comment-13178358 ] Markus Jelsma commented on NUTCH-1239: -- I'll commit shortly unless there are objections. thanks Webgraph should remove deleted pages from segment input --- Key: NUTCH-1239 URL: https://issues.apache.org/jira/browse/NUTCH-1239 Project: Nutch Issue Type: Improvement Reporter: Markus Jelsma Assignee: Markus Jelsma Attachments: NUTCH-1239-1.5-1.patch Webgraph's outlink job is currently unable to remove links. It should expand it's segment input and be able to remove nodes for pages that no longer exist. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1232) Remove host|site fields from index-basic
[ https://issues.apache.org/jira/browse/NUTCH-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13178373#comment-13178373 ] Markus Jelsma commented on NUTCH-1232: -- I'll remove the site field and commit unless there are objections. Users that have application software relying on that field can simply use a copyField to resolve the issue. Remove host|site fields from index-basic Key: NUTCH-1232 URL: https://issues.apache.org/jira/browse/NUTCH-1232 Project: Nutch Issue Type: Bug Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Fix For: 1.5 Either fields needs to be removed, it makes no sense to have two identical values for separate fields. I propose to get rid of the site field and leave the host field. This may be a breaking change for some installations however. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1138) remove LogUtil from trunk and nutch gora
[ https://issues.apache.org/jira/browse/NUTCH-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13178374#comment-13178374 ] Markus Jelsma commented on NUTCH-1138: -- lewis, isnt this issue resolved now? remove LogUtil from trunk and nutch gora Key: NUTCH-1138 URL: https://issues.apache.org/jira/browse/NUTCH-1138 Project: Nutch Issue Type: Improvement Affects Versions: 1.4, nutchgora Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Minor Fix For: nutchgora, 1.5 Attachments: Document1.txt, NUTCH-1138-trunk-20111023.patch This should move towards the removal of the LogUtil class from both codebases as per comments in NUTCH-1078. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1238) Fetcher throughput threshold must start before feeder finished
[ https://issues.apache.org/jira/browse/NUTCH-1238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13177116#comment-13177116 ] Markus Jelsma commented on NUTCH-1238: -- Tested and works. Unit tests also still pass. I'll commit shortly if there are no objections. Fetcher throughput threshold must start before feeder finished -- Key: NUTCH-1238 URL: https://issues.apache.org/jira/browse/NUTCH-1238 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 1.4 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.5 Attachments: NUTCH-1238-1.5-1.patch Right now the fetcher's minimum throughput threshold is activated only when the feeder has finished. However, for various reasons a running fetch can be slow. This issue must change the feature to start checking earlier, but not right after initialization. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1225) Migrate CrawlDBScanner to MapReduce API
[ https://issues.apache.org/jira/browse/NUTCH-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13176164#comment-13176164 ] Markus Jelsma commented on NUTCH-1225: -- Old mapred version restored per rev. 1224905. Migrate CrawlDBScanner to MapReduce API --- Key: NUTCH-1225 URL: https://issues.apache.org/jira/browse/NUTCH-1225 Project: Nutch Issue Type: Sub-task Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.5 Attachments: NUTCH-1225-1.5-1.patch, NUTCH-1225-1.5-2.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-961) Expose Tika's boilerpipe support
[ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13176194#comment-13176194 ] Markus Jelsma commented on NUTCH-961: - Fixed already. See NUTCH-1233 for a patch! Expose Tika's boilerpipe support Key: NUTCH-961 URL: https://issues.apache.org/jira/browse/NUTCH-961 Project: Nutch Issue Type: New Feature Components: parser Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.5 Attachments: BoilerpipeExtractorRepository.java, NUTCH-961-1.3-3.patch, NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.3-tikaparser1.patch, NUTCH-961-1.4-dombuilder-1.patch, NUTCH-961-1.5-1.patch, NUTCH-961v2.patch Tika 0.8 comes with the Boilerpipe content handler which can be used to extract boilerplate content from HTML pages. We should see how we can expose Boilerplate in the Nutch cofiguration. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1222) Upgrade to new Hadoop 0.22.0
[ https://issues.apache.org/jira/browse/NUTCH-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13176166#comment-13176166 ] Markus Jelsma commented on NUTCH-1222: -- Reverted per rev. 1224906. Upgrade to new Hadoop 0.22.0 Key: NUTCH-1222 URL: https://issues.apache.org/jira/browse/NUTCH-1222 Project: Nutch Issue Type: Improvement Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Critical Fix For: 1.5 Attachments: NUTCH-1222-1.5-1.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1235) Upgrade to new Hadoop 0.20.205.0
[ https://issues.apache.org/jira/browse/NUTCH-1235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13176181#comment-13176181 ] Markus Jelsma commented on NUTCH-1235: -- Forgot to add the Jackson ASL mapper as a dependency. It's something the new Hadoop needs AVRO and it needs Jackson. For some reason it is with 0.20.205.0 not specified as dependency. Committed in rev. 1224912. Upgrade to new Hadoop 0.20.205.0 Key: NUTCH-1235 URL: https://issues.apache.org/jira/browse/NUTCH-1235 Project: Nutch Issue Type: Task Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Fix For: 1.5 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1230) MimeType utils broken with Tika 1.1
[ https://issues.apache.org/jira/browse/NUTCH-1230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174021#comment-13174021 ] Markus Jelsma commented on NUTCH-1230: -- Seems the thing became deprecated in 1.0 and is in my 1.1-snapshot inaccessible. I'll try an upgrade to 1.0. MimeType utils broken with Tika 1.1 --- Key: NUTCH-1230 URL: https://issues.apache.org/jira/browse/NUTCH-1230 Project: Nutch Issue Type: Bug Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.5 We used Tika 1.0-SNAPSHOT in production and just switched to 1.1-SNAPSHOT. The new version triggers the following error: {code} 2011-12-21 12:29:56,665 ERROR http.Http - java.lang.IllegalAccessError: tried to access method org.apache.tika.mime.MimeTypes.getMimeType([B)Lorg/apache/tika/mime/MimeType; from class org.apache.nutch.util.MimeUtil 2011-12-21 12:29:56,665 ERROR http.Http - at org.apache.nutch.util.MimeUtil.autoResolveContentType(MimeUtil.java:169) 2011-12-21 12:29:56,665 ERROR http.Http - at org.apache.nutch.protocol.Content.getContentType(Content.java:292) 2011-12-21 12:29:56,666 ERROR http.Http - at org.apache.nutch.protocol.Content.init(Content.java:88) 2011-12-21 12:29:56,666 ERROR http.Http - at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:142) 2011-12-21 12:29:56,666 ERROR http.Http - at org.apache.nutch.parse.ParserChecker.run(ParserChecker.java:82) 2011-12-21 12:29:56,666 ERROR http.Http - at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69) 2011-12-21 12:29:56,666 ERROR http.Http - at org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:138) {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1230) MimeType API deprecated and breaks with Tika 1.0
[ https://issues.apache.org/jira/browse/NUTCH-1230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174071#comment-13174071 ] Markus Jelsma commented on NUTCH-1230: -- actually, Tika now returns the octetstream for that data. Please advice! MimeType API deprecated and breaks with Tika 1.0 Key: NUTCH-1230 URL: https://issues.apache.org/jira/browse/NUTCH-1230 Project: Nutch Issue Type: Bug Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Blocker Fix For: 1.5 Attachments: NUTCH-1230-1.5-2.patch We used Tika 1.0-SNAPSHOT in production and just switched to 1.1-SNAPSHOT. The new version triggers the following error: {code} 2011-12-21 12:29:56,665 ERROR http.Http - java.lang.IllegalAccessError: tried to access method org.apache.tika.mime.MimeTypes.getMimeType([B)Lorg/apache/tika/mime/MimeType; from class org.apache.nutch.util.MimeUtil 2011-12-21 12:29:56,665 ERROR http.Http - at org.apache.nutch.util.MimeUtil.autoResolveContentType(MimeUtil.java:169) 2011-12-21 12:29:56,665 ERROR http.Http - at org.apache.nutch.protocol.Content.getContentType(Content.java:292) 2011-12-21 12:29:56,666 ERROR http.Http - at org.apache.nutch.protocol.Content.init(Content.java:88) 2011-12-21 12:29:56,666 ERROR http.Http - at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:142) 2011-12-21 12:29:56,666 ERROR http.Http - at org.apache.nutch.parse.ParserChecker.run(ParserChecker.java:82) 2011-12-21 12:29:56,666 ERROR http.Http - at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69) 2011-12-21 12:29:56,666 ERROR http.Http - at org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:138) {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1225) Migrate CrawlDBScanner to MapReduce API
[ https://issues.apache.org/jira/browse/NUTCH-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13172209#comment-13172209 ] Markus Jelsma commented on NUTCH-1225: -- I'll commit shortly if there are no objections Migrate CrawlDBScanner to MapReduce API --- Key: NUTCH-1225 URL: https://issues.apache.org/jira/browse/NUTCH-1225 Project: Nutch Issue Type: Sub-task Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.5 Attachments: NUTCH-1225-1.5-1.patch, NUTCH-1225-1.5-2.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1222) Upgrade to newer Hadoop versions
[ https://issues.apache.org/jira/browse/NUTCH-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13172211#comment-13172211 ] Markus Jelsma commented on NUTCH-1222: -- If there are no objections i'll upgrade the ivy deps to Hadoop 0.22.0 Upgrade to newer Hadoop versions Key: NUTCH-1222 URL: https://issues.apache.org/jira/browse/NUTCH-1222 Project: Nutch Issue Type: Improvement Reporter: Markus Jelsma Priority: Critical Fix For: 1.5 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1184) Fetcher to parse and follow Nth degree outlinks
[ https://issues.apache.org/jira/browse/NUTCH-1184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13172350#comment-13172350 ] Markus Jelsma commented on NUTCH-1184: -- If there are no further objection i will commit this one tomorrow. Fetcher to parse and follow Nth degree outlinks --- Key: NUTCH-1184 URL: https://issues.apache.org/jira/browse/NUTCH-1184 Project: Nutch Issue Type: New Feature Components: fetcher Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.5 Attachments: NUTCH-1184-1.5-1.patch, NUTCH-1184-1.5-2.patch, NUTCH-1184-1.5-3.patch, NUTCH-1184-1.5-4.patch, NUTCH-1184-1.5-5-ParseData.patch, NUTCH-1184-1.5-5.patch, NUTCH-1184-1.5-9-ParseOutputFormat.patch, NUTCH-1185-1.5-6.patch, NUTCH-1185-1.5-7.patch, NUTCH-1185-1.5-8.patch, NUTCH-1185-1.5-9.patch Fetcher improvements to parse and follow outlinks up to a specified depth. The number of outlinks to follow can be decreased by depth using a divisor. This patch introduces three new configuration directives: {code} property namefetcher.follow.outlinks.depth/name value-1/value description(EXPERT)When fetcher.parse is true and this value is greater than 0 the fetcher will extract outlinks and follow until the desired depth is reached. A value of 1 means all generated pages are fetched and their first degree outlinks are fetched and parsed too. Be careful, this feature is in itself agnostic of the state of the CrawlDB and does not know about already fetched pages. A setting larger than 2 will most likely fetch home pages twice in the same fetch cycle. It is highly recommended to set db.ignore.external.links to true to restrict the outlink follower to URL's within the same domain. When disabled (false) the feature is likely to follow duplicates even when depth=1. A value of -1 of 0 disables this feature. /description /property property namefetcher.follow.outlinks.num.links/name value4/value description(EXPERT)The number of outlinks to follow when fetcher.follow.outlinks.depth is enabled. Be careful, this can multiply the total number of pages to fetch. This works with fetcher.follow.outlinks.depth.divisor, by default settings the followed outlinks at depth 1 is 8, not 4. /description /property property namefetcher.follow.outlinks.depth.divisor/name value2/value description(EXPERT)The divisor of fetcher.follow.outlinks.num.links per fetcher.follow.outlinks.depth. This decreases the number of outlinks to follow by increasing depth. The formula used is: outlinks = floor(divisor / depth * num.links). This prevents exponential growth of the fetch list. /description /property {code} Please, do not use this unless you know what you're doing. This feature does not consider the state of the CrawlDB nor does it consider generator settings such as limiting the number of pages per (domain|host|ip) queue. It is not polite to use this feature with high settings as it can fetch many pages from the same domain including duplicates. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1225) Migrate CrawlDBScanner to MapReduce API
[ https://issues.apache.org/jira/browse/NUTCH-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13170313#comment-13170313 ] Markus Jelsma commented on NUTCH-1225: -- I removed the Hadoop deps from Ivy and manually added Hadoop 0.21 jars to the lib directory. Next two other deps must be added to Ivy {code} !-- need to compile webgraph -- dependency org=commons-cli name=commons-cli rev=20040117.00 conf=*-default / !-- avro -- dependency org=org.apache.avro name=avro rev=1.6.1 conf=*-default / {code} Migrate CrawlDBScanner to MapReduce API --- Key: NUTCH-1225 URL: https://issues.apache.org/jira/browse/NUTCH-1225 Project: Nutch Issue Type: Sub-task Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.5 Attachments: NUTCH-1225-1.5-1.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1047) Pluggable indexing backends
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13163515#comment-13163515 ] Markus Jelsma commented on NUTCH-1047: -- Ah yes it makes sense now! If you look at the patch for NUTCH-1139 you can see that the endpoint, Solr in this case, implements the delete method as called from NutchIndexAction. Another endpoint could simply ignore and do nothing but write out WARC or Solr XML files. Pluggable indexing backends --- Key: NUTCH-1047 URL: https://issues.apache.org/jira/browse/NUTCH-1047 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Julien Nioche Assignee: Julien Nioche Labels: indexing Fix For: 1.5 One possible feature would be to add a new endpoint for indexing-backends and make the indexing plugable. at the moment we are hardwired to SOLR - which is OK - but as other resources like ElasticSearch are becoming more popular it would be better to handle this as plugins. Not sure about the name of the endpoint though : we already have indexing-plugins (which are about generating fields sent to the backends) and moreover the backends are not necessarily for indexing / searching but could be just an external storage e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this could be pertaining to the storage in GORA. 'indexing-backend' is the best name that came to my mind so far - please suggest better ones. We should come up with generic map/reduce jobs for indexing, deduplicating and cleaning and maybe add a Nutch extension point there so we can easily hook up indexing, cleaning and deduplicating for various backends. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1047) Pluggable indexing backends
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13162977#comment-13162977 ] Markus Jelsma commented on NUTCH-1047: -- Hi Julien, I'm not sure i get your point exactly but if we don't generate WARC files we: - don't have to think about the problem you state - don't create an additional process between Nutch and a search engine If you'd need WARC files, for some reason, i'd rather have an endpoint for it just like for ES and Solr instead of using WARC files as an intermediate format. Does your suggestion imply: segment+crawldb warc files search engine? Pluggable indexing backends --- Key: NUTCH-1047 URL: https://issues.apache.org/jira/browse/NUTCH-1047 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Julien Nioche Assignee: Julien Nioche Labels: indexing Fix For: 1.5 One possible feature would be to add a new endpoint for indexing-backends and make the indexing plugable. at the moment we are hardwired to SOLR - which is OK - but as other resources like ElasticSearch are becoming more popular it would be better to handle this as plugins. Not sure about the name of the endpoint though : we already have indexing-plugins (which are about generating fields sent to the backends) and moreover the backends are not necessarily for indexing / searching but could be just an external storage e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this could be pertaining to the storage in GORA. 'indexing-backend' is the best name that came to my mind so far - please suggest better ones. We should come up with generic map/reduce jobs for indexing, deduplicating and cleaning and maybe add a Nutch extension point there so we can easily hook up indexing, cleaning and deduplicating for various backends. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1206) tika parser of nutch 1.3 is failing to prcess pdfs
[ https://issues.apache.org/jira/browse/NUTCH-1206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13161501#comment-13161501 ] Markus Jelsma commented on NUTCH-1206: -- fetching: https://issues.apache.org/jira/secure/attachment/12505323/direct.pdf Can't fetch URL successfully This is obviously not a parser problem as it tells you it's a fetcher problem. Also, can you fetch httpS url's at all with the protocol plugin you use? tika parser of nutch 1.3 is failing to prcess pdfs -- Key: NUTCH-1206 URL: https://issues.apache.org/jira/browse/NUTCH-1206 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.3 Environment: Solaris/Linux/Windows Reporter: dibyendu ghosh Assignee: Chris A. Mattmann Attachments: direct.pdf Please refer to this message: http://www.mail-archive.com/user%40nutch.apache.org/msg04315.html. Old parse-pdf parser seems to be able to parse old pdfs (checked with nutch 1.2) though it is not able to parse acrobat 9.0 version of pdfs. nutch 1.3 does not have parse-pdf plugin and it is not able to parse even older pdfs. my code (TestParse.java): bash-2.00$ cat TestParse.java import java.io.File; import java.io.FileInputStream; import java.io.FileOutputStream; import java.io.PrintStream; import java.util.Iterator; import java.util.Map; import java.util.Map.Entry; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.io.Text; import org.apache.nutch.metadata.Metadata; import org.apache.nutch.parse.ParseResult; import org.apache.nutch.parse.Parse; import org.apache.nutch.parse.ParseStatus; import org.apache.nutch.parse.ParseUtil; import org.apache.nutch.parse.ParseData; import org.apache.nutch.protocol.Content; import org.apache.nutch.util.NutchConfiguration; public class TestParse { private static Configuration conf = NutchConfiguration.create(); public TestParse() { } public static void main(String[] args) { String filename = args[0]; convert(filename); } public static String convert(String fileName) { String newName = abc.html; try { System.out.println(Converting + fileName + to html.); if (convertToHtml(fileName, newName)) return newName; } catch (Exception e) { (new File(newName)).delete(); System.out.println(General exception + e.getMessage()); } return null; } private static boolean convertToHtml(String fileName, String newName) throws Exception { // Read the file FileInputStream in = new FileInputStream(fileName); byte[] buf = new byte[in.available()]; in.read(buf); in.close(); // Parse the file Content content = new Content(file: + fileName, file: + fileName, buf, , new Metadata(), conf); ParseResult parseResult = new ParseUtil(conf).parse(content); parseResult.filter(); if (parseResult.isEmpty()) { System.out.println(All parsing attempts failed); return false; } IteratorMap.Entrylt;Text,Parse iterator = parseResult.iterator(); if (iterator == null) { System.out.println(Cannot iterate over successful parse results); return false; } Parse parse = null; ParseData parseData = null; while (iterator.hasNext()) { parse = parseResult.get((Text)iterator.next().getKey()); parseData = parse.getData(); ParseStatus status = parseData.getStatus(); // If Parse failed then bail if (!ParseStatus.STATUS_SUCCESS.equals(status)) { System.out.println(Could not parse + fileName + . + status.getMessage()); return false; } } // Start writing to newName FileOutputStream fout = new FileOutputStream(newName); PrintStream out = new PrintStream(fout, true, UTF-8); // Start Document out.println(html); // Start Header out.println(head); // Write Title String title = parseData.getTitle(); if (title != null title.trim().length() 0) { out.println(title + parseData.getTitle() + /title); } // Write out Meta tags Metadata metaData = parseData.getContentMeta(); String[] names = metaData.names(); for (String name : names) { String[] subvalues = metaData.getValues(name); String values = null; for (String subvalue :
[jira] [Commented] (NUTCH-1206) tika parser of nutch 1.3 is failing to prcess pdfs
[ https://issues.apache.org/jira/browse/NUTCH-1206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13161521#comment-13161521 ] Markus Jelsma commented on NUTCH-1206: -- I see. Check your logs for something peculiar. I can fetch and parse this file with Nutch 1.4 with protocol-htttpclient. tika parser of nutch 1.3 is failing to prcess pdfs -- Key: NUTCH-1206 URL: https://issues.apache.org/jira/browse/NUTCH-1206 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.3 Environment: Solaris/Linux/Windows Reporter: dibyendu ghosh Assignee: Chris A. Mattmann Attachments: direct.pdf Please refer to this message: http://www.mail-archive.com/user%40nutch.apache.org/msg04315.html. Old parse-pdf parser seems to be able to parse old pdfs (checked with nutch 1.2) though it is not able to parse acrobat 9.0 version of pdfs. nutch 1.3 does not have parse-pdf plugin and it is not able to parse even older pdfs. my code (TestParse.java): bash-2.00$ cat TestParse.java import java.io.File; import java.io.FileInputStream; import java.io.FileOutputStream; import java.io.PrintStream; import java.util.Iterator; import java.util.Map; import java.util.Map.Entry; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.io.Text; import org.apache.nutch.metadata.Metadata; import org.apache.nutch.parse.ParseResult; import org.apache.nutch.parse.Parse; import org.apache.nutch.parse.ParseStatus; import org.apache.nutch.parse.ParseUtil; import org.apache.nutch.parse.ParseData; import org.apache.nutch.protocol.Content; import org.apache.nutch.util.NutchConfiguration; public class TestParse { private static Configuration conf = NutchConfiguration.create(); public TestParse() { } public static void main(String[] args) { String filename = args[0]; convert(filename); } public static String convert(String fileName) { String newName = abc.html; try { System.out.println(Converting + fileName + to html.); if (convertToHtml(fileName, newName)) return newName; } catch (Exception e) { (new File(newName)).delete(); System.out.println(General exception + e.getMessage()); } return null; } private static boolean convertToHtml(String fileName, String newName) throws Exception { // Read the file FileInputStream in = new FileInputStream(fileName); byte[] buf = new byte[in.available()]; in.read(buf); in.close(); // Parse the file Content content = new Content(file: + fileName, file: + fileName, buf, , new Metadata(), conf); ParseResult parseResult = new ParseUtil(conf).parse(content); parseResult.filter(); if (parseResult.isEmpty()) { System.out.println(All parsing attempts failed); return false; } IteratorMap.Entrylt;Text,Parse iterator = parseResult.iterator(); if (iterator == null) { System.out.println(Cannot iterate over successful parse results); return false; } Parse parse = null; ParseData parseData = null; while (iterator.hasNext()) { parse = parseResult.get((Text)iterator.next().getKey()); parseData = parse.getData(); ParseStatus status = parseData.getStatus(); // If Parse failed then bail if (!ParseStatus.STATUS_SUCCESS.equals(status)) { System.out.println(Could not parse + fileName + . + status.getMessage()); return false; } } // Start writing to newName FileOutputStream fout = new FileOutputStream(newName); PrintStream out = new PrintStream(fout, true, UTF-8); // Start Document out.println(html); // Start Header out.println(head); // Write Title String title = parseData.getTitle(); if (title != null title.trim().length() 0) { out.println(title + parseData.getTitle() + /title); } // Write out Meta tags Metadata metaData = parseData.getContentMeta(); String[] names = metaData.names(); for (String name : names) { String[] subvalues = metaData.getValues(name); String values = null; for (String subvalue : subvalues) { values += subvalue; } if (values.length() 0) out.printf(meta
[jira] [Commented] (NUTCH-1206) tika parser of nutch 1.3 is failing to prcess pdfs
[ https://issues.apache.org/jira/browse/NUTCH-1206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13154180#comment-13154180 ] Markus Jelsma commented on NUTCH-1206: -- Have you tried the Nutch trunk or the most recent Tika as suggested? tika parser of nutch 1.3 is failing to prcess pdfs -- Key: NUTCH-1206 URL: https://issues.apache.org/jira/browse/NUTCH-1206 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.3 Environment: Solaris/Linux/Windows Reporter: dibyendu ghosh Please refer to this message: http://www.mail-archive.com/user%40nutch.apache.org/msg04315.html. Old parse-pdf parser seems to be able to parse old pdfs (checked with nutch 1.2) though it is not able to parse acrobat 9.0 version of pdfs. nutch 1.3 does not have parse-pdf plugin and it is not able to parse even older pdfs. my code (TestParse.java): bash-2.00$ cat TestParse.java import java.io.File; import java.io.FileInputStream; import java.io.FileOutputStream; import java.io.PrintStream; import java.util.Iterator; import java.util.Map; import java.util.Map.Entry; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.io.Text; import org.apache.nutch.metadata.Metadata; import org.apache.nutch.parse.ParseResult; import org.apache.nutch.parse.Parse; import org.apache.nutch.parse.ParseStatus; import org.apache.nutch.parse.ParseUtil; import org.apache.nutch.parse.ParseData; import org.apache.nutch.protocol.Content; import org.apache.nutch.util.NutchConfiguration; public class TestParse { private static Configuration conf = NutchConfiguration.create(); public TestParse() { } public static void main(String[] args) { String filename = args[0]; convert(filename); } public static String convert(String fileName) { String newName = abc.html; try { System.out.println(Converting + fileName + to html.); if (convertToHtml(fileName, newName)) return newName; } catch (Exception e) { (new File(newName)).delete(); System.out.println(General exception + e.getMessage()); } return null; } private static boolean convertToHtml(String fileName, String newName) throws Exception { // Read the file FileInputStream in = new FileInputStream(fileName); byte[] buf = new byte[in.available()]; in.read(buf); in.close(); // Parse the file Content content = new Content(file: + fileName, file: + fileName, buf, , new Metadata(), conf); ParseResult parseResult = new ParseUtil(conf).parse(content); parseResult.filter(); if (parseResult.isEmpty()) { System.out.println(All parsing attempts failed); return false; } IteratorMap.Entrylt;Text,Parse iterator = parseResult.iterator(); if (iterator == null) { System.out.println(Cannot iterate over successful parse results); return false; } Parse parse = null; ParseData parseData = null; while (iterator.hasNext()) { parse = parseResult.get((Text)iterator.next().getKey()); parseData = parse.getData(); ParseStatus status = parseData.getStatus(); // If Parse failed then bail if (!ParseStatus.STATUS_SUCCESS.equals(status)) { System.out.println(Could not parse + fileName + . + status.getMessage()); return false; } } // Start writing to newName FileOutputStream fout = new FileOutputStream(newName); PrintStream out = new PrintStream(fout, true, UTF-8); // Start Document out.println(html); // Start Header out.println(head); // Write Title String title = parseData.getTitle(); if (title != null title.trim().length() 0) { out.println(title + parseData.getTitle() + /title); } // Write out Meta tags Metadata metaData = parseData.getContentMeta(); String[] names = metaData.names(); for (String name : names) { String[] subvalues = metaData.getValues(name); String values = null; for (String subvalue : subvalues) { values += subvalue; } if (values.length() 0) out.printf(meta name=\%s\ content=\%s\/\n, name, values); } out.println(meta http-equiv=\Content-Type\
[jira] [Commented] (NUTCH-1184) Fetcher to parse and follow Nth degree outlinks
[ https://issues.apache.org/jira/browse/NUTCH-1184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13150402#comment-13150402 ] Markus Jelsma commented on NUTCH-1184: -- Any comments? Objections? I'd like to push this in and mark the new config directives as expert. Fetcher to parse and follow Nth degree outlinks --- Key: NUTCH-1184 URL: https://issues.apache.org/jira/browse/NUTCH-1184 Project: Nutch Issue Type: New Feature Components: fetcher Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.5 Attachments: NUTCH-1184-1.5-1.patch, NUTCH-1184-1.5-2.patch, NUTCH-1184-1.5-3.patch, NUTCH-1184-1.5-4.patch, NUTCH-1184-1.5-5-ParseData.patch, NUTCH-1184-1.5-5.patch Improvements to fetcher to follow Nth degree outlinks of fetched items: - fetch - parse - normalize and filter outlinks - create new FetchItem and inject in the queue -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1202) Fetcher timebomb kills long waiting fetch jobs
[ https://issues.apache.org/jira/browse/NUTCH-1202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13149708#comment-13149708 ] Markus Jelsma commented on NUTCH-1202: -- i haven't looked at implementation details and cannot offer a suggestion (yet). Doing it in configure.setup is not pretty for the reason in the comments. On the other hand, the current implementation does not allow one to submit several jobs at once without risking a lot of records being hit by time limit. Fetcher timebomb kills long waiting fetch jobs -- Key: NUTCH-1202 URL: https://issues.apache.org/jira/browse/NUTCH-1202 Project: Nutch Issue Type: Bug Components: fetcher Reporter: Markus Jelsma Fix For: 1.5 The timebomb feature kills of mappers of jobs that have been waiting too long in the job queue. The timebomb feature should start at mapper initialization instead, not in job init. Thoughts? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1180) UpdateDB to backup previous CrawlDB
[ https://issues.apache.org/jira/browse/NUTCH-1180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13147641#comment-13147641 ] Markus Jelsma commented on NUTCH-1180: -- I'll send this in if there are no objections. UpdateDB to backup previous CrawlDB --- Key: NUTCH-1180 URL: https://issues.apache.org/jira/browse/NUTCH-1180 Project: Nutch Issue Type: Improvement Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Fix For: 1.5 Attachments: NUTCH-1180-1.5.-1.patch Nutch currently replaces an existing CrawlDB with the new CrawlDB. By optionally keeping a previous version on HDFS users can easily revert in case of a mistake without relying on external backup mechanims. This should be enabled by default. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1178) Incorrect CSV header CrawlDatumCsvOutputFormat
[ https://issues.apache.org/jira/browse/NUTCH-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13147642#comment-13147642 ] Markus Jelsma commented on NUTCH-1178: -- Objections? Incorrect CSV header CrawlDatumCsvOutputFormat -- Key: NUTCH-1178 URL: https://issues.apache.org/jira/browse/NUTCH-1178 Project: Nutch Issue Type: Bug Affects Versions: 1.3 Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Trivial Fix For: 1.5 Attachments: NUTCH-1178-1.5-1.patch, NUTCH-1178-1.5-2.patch The CSV header doesn't mention both retry interval fields (seconds + days). We should either add another field to the header to get rid of one retry interval field. I prefer the former as people may already rely on the current format. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1142) Normalization and filtering in WebGraph
[ https://issues.apache.org/jira/browse/NUTCH-1142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13147644#comment-13147644 ] Markus Jelsma commented on NUTCH-1142: -- I'll send this in today. Normalization and filtering in WebGraph --- Key: NUTCH-1142 URL: https://issues.apache.org/jira/browse/NUTCH-1142 Project: Nutch Issue Type: Improvement Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.5 Attachments: NUTCH-1142-1.4.patch, NUTCH-1142-1.5-2.patch, NUTCH-1142-1.5-3.patch The WebGraph programs performs URL normalization. Since normalization of outlinks is already performed during the parse it should become optional. There is also no URL filtering mechanism in the web graph program. When a CrawlDatum is removed from the CrawlDB by an URL filter is should be possible to remove it from the web graph as well. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1174) Outlinks are not properly normalized
[ https://issues.apache.org/jira/browse/NUTCH-1174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13147654#comment-13147654 ] Markus Jelsma commented on NUTCH-1174: -- Will commit of there are no objections. Outlinks are not properly normalized Key: NUTCH-1174 URL: https://issues.apache.org/jira/browse/NUTCH-1174 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.3 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.5 Attachments: NUTCH-1174-1.5-1.patch In ParseOutputFormat, the toUrl is read from Outlink and is processed. This String object is filtered, normalized etc but the original Outlink object is actually added. The normalized url in toUrl is not written back to the Outlink object. This issue adds a setUrl method to Outlink which is used in ParseOutputFormat to overwrite the unnormalized url. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1061) Migrate MoreIndexingFilter from Apache ORO to java.util.regex
[ https://issues.apache.org/jira/browse/NUTCH-1061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13147655#comment-13147655 ] Markus Jelsma commented on NUTCH-1061: -- Any comments on this one? Migrate MoreIndexingFilter from Apache ORO to java.util.regex - Key: NUTCH-1061 URL: https://issues.apache.org/jira/browse/NUTCH-1061 Project: Nutch Issue Type: Improvement Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.5 Attachments: NUTCH-1061-1.4-1.patch Here's a migrating resetTitle method to use Apache ORO. There was no unit test for this method so i added it. The test passes with old Apache ORO impl. and with the new j.u.regex impl. Please comment. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1139) Indexer to delete documents
[ https://issues.apache.org/jira/browse/NUTCH-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13147656#comment-13147656 ] Markus Jelsma commented on NUTCH-1139: -- Comments please? Indexer to delete documents --- Key: NUTCH-1139 URL: https://issues.apache.org/jira/browse/NUTCH-1139 Project: Nutch Issue Type: Improvement Components: indexer Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.5 Attachments: NUTCH-1139-1.4-1.patch Add an option -delete to the solrindex command. With this feature enabled documents of the currently processing segment with status FETCH_GONE or FETCH_REDIR_PERM are deleted, a following SolrClean is not required anymore. This issue is a follow up of NUTCH-1052. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1180) UpdateDB to backup previous CrawlDB
[ https://issues.apache.org/jira/browse/NUTCH-1180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13147715#comment-13147715 ] Markus Jelsma commented on NUTCH-1180: -- Config directive: {code} property namedb.preserve.backup/name valuetrue/value descriptionIf true, updatedb will keep a backup of the previous CrawlDB version in the old directory. In case of disaster, one can rename old to current and restore the CrawlDB to its previous state. /description /property {code} Fine? Wrong? UpdateDB to backup previous CrawlDB --- Key: NUTCH-1180 URL: https://issues.apache.org/jira/browse/NUTCH-1180 Project: Nutch Issue Type: Improvement Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Fix For: 1.5 Attachments: NUTCH-1180-1.5.-1.patch Nutch currently replaces an existing CrawlDB with the new CrawlDB. By optionally keeping a previous version on HDFS users can easily revert in case of a mistake without relying on external backup mechanims. This should be enabled by default. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1139) Indexer to delete documents
[ https://issues.apache.org/jira/browse/NUTCH-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13147726#comment-13147726 ] Markus Jelsma commented on NUTCH-1139: -- Yes, but does that also cover the indexer deleting PERM_REDIR? If so, then agreed. Indexer to delete documents --- Key: NUTCH-1139 URL: https://issues.apache.org/jira/browse/NUTCH-1139 Project: Nutch Issue Type: Improvement Components: indexer Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.5 Attachments: NUTCH-1139-1.4-1.patch Add an option -delete to the solrindex command. With this feature enabled documents of the currently processing segment with status FETCH_GONE or FETCH_REDIR_PERM are deleted, a following SolrClean is not required anymore. This issue is a follow up of NUTCH-1052. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1186) FreeGenerator always normalizes
[ https://issues.apache.org/jira/browse/NUTCH-1186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13147069#comment-13147069 ] Markus Jelsma commented on NUTCH-1186: -- This actually not the FreeGenerator but the URLPartitioner class doing the Partioning scope normalizing. I'm not sure what would be good behaviour. The common generator is also affected and uses the partitioner when turning fetch lists into segments. Without scope, this means ALL selected URL's are at least normalized once, twice when the normalizing is actually in use. Thoughts? FreeGenerator always normalizes --- Key: NUTCH-1186 URL: https://issues.apache.org/jira/browse/NUTCH-1186 Project: Nutch Issue Type: Bug Components: generator Affects Versions: 1.3 Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Fix For: 1.5 The FreeGenerator does not honor the -normalize option, it always normalizes all URL's in the input directory. The -filter option is respected. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1199) unfetched URLs problem
[ https://issues.apache.org/jira/browse/NUTCH-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13146144#comment-13146144 ] Markus Jelsma commented on NUTCH-1199: -- And what exactly is the problem definition? unfetched URLs problem -- Key: NUTCH-1199 URL: https://issues.apache.org/jira/browse/NUTCH-1199 Project: Nutch Issue Type: Improvement Components: fetcher, generator Reporter: behnam nikbakht Priority: Critical Labels: db_unfetched, fetch, freegen, generate, unfetched, updatedb we write a script to fetch unfetched urls: #first dump from readdb to a text file, and extract unfetched urls to a text file: bin/nutch readdb $crawldb -dump $SITE_DIR/tmp/dump_urls.txt -format csv cat $SITE_DIR/tmp/dump_urls.txt/part-0 | grep db_unfetched $SITE_DIR/tmp/dump_unf unfetched_urls_file=$SITE_DIR/tmp/unfetched_urls/unfetched_urls.txt cat $SITE_DIR/tmp/dump_unf | awk -F '' '{print $2}' $unfetched_urls_file unfetched_count=`cat $unfetched_urls_file|wc -l` #next, we have a list of unfetched urls in unfetched_urls.txt , then, we use command freegen to create segments for #these urls, we can not use command generate because these url's were generated previously if [[ $unfetched_count -lt $it_size ]] then echo UNFETCHED $J , $it_size URLs from $unfetched_count generated ((J++)) bin/nutch freegen $SITE_DIR/tmp/unfetched_urls/unfetched_urls.txt $crawlseg s2=`ls -d $crawlseg/2* | tail -1` bin/nutch fetch $s2 bin/nutch parse $s2 bin/nutch updatedb $crawldb $s2 echo bin/nutch updatedb $crawldb $s2 $SITE_DIR/updatedblog.txt get_new_links exit fi # if number of urls are greater than it_size, then package them ij=1 while read line do let ind = $ij / $it_size mkdir $SITE_DIR/tmp/unfetched_urls/unfetched_urls$ind/ echo $line $SITE_DIR/tmp/unfetched_urls/unfetched_urls$ind/unfetched_urls$ind.txt echo $ind ((ij++)) let completed=$ij % $it_size if [[ $completed -eq 0 ]] then echo UNFETCHED $J , $it_size URLs from $unfetched_count generated ((J++)) bin/nutch freegen $SITE_DIR/tmp/unfetched_urls/unfetched_urls$ind/unfetched_urls$ind.txt $crawlseg #finally fetch,parse and update new segment s2=`ls -d $crawlseg/2* | tail -1` bin/nutch fetch $s2 bin/nutch parse $s2 rm $crawldb/.locked bin/nutch updatedb $crawldb $s2 echo bin/nutch updatedb $crawldb $s2 $SITE_DIR/updatedblog.txt fi done $unfetched_urls_file -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira