[jira] [Commented] (NUTCH-1325) HostDB for Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13922724#comment-13922724 ] Tejas Patil commented on NUTCH-1325: It would take me few weeks before I can work on this one. The reason being: I have recently left school and started working at a company. There is some legal paperwork that I would have to finish off to work on open source projects (even if its during my free time). > HostDB for Nutch > > > Key: NUTCH-1325 > URL: https://issues.apache.org/jira/browse/NUTCH-1325 > Project: Nutch > Issue Type: New Feature >Reporter: Markus Jelsma >Assignee: Tejas Patil > Fix For: 1.8 > > Attachments: NUTCH-1325-1.6-1.patch, NUTCH-1325-trunk-v3.patch, > NUTCH-1325-trunk-v4.patch, NUTCH-1325.trunk.v2.path > > > A HostDB for Nutch and associated tools to create and read a database > containing information on hosts. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (NUTCH-1721) Upgrade to Crawler commons 0.3
[ https://issues.apache.org/jira/browse/NUTCH-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil resolved NUTCH-1721. Resolution: Fixed Committed to trunk (rev 1566255) and 2.x (rev 1566257) > Upgrade to Crawler commons 0.3 > -- > > Key: NUTCH-1721 > URL: https://issues.apache.org/jira/browse/NUTCH-1721 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.7, 2.2, 2.2.1 >Reporter: Tejas Patil >Assignee: Tejas Patil > Fix For: 2.3, 1.8 > > Attachments: NUTCH-1721-2.x.patch, NUTCH-1721-trunk.patch > > -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1721) Upgrade to Crawler commons 0.3
[ https://issues.apache.org/jira/browse/NUTCH-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13887784#comment-13887784 ] Tejas Patil commented on NUTCH-1721: Attached patches, all test cases are passing. > Upgrade to Crawler commons 0.3 > -- > > Key: NUTCH-1721 > URL: https://issues.apache.org/jira/browse/NUTCH-1721 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.7, 2.2, 2.2.1 >Reporter: Tejas Patil >Assignee: Tejas Patil > Fix For: 2.3, 1.8 > > Attachments: NUTCH-1721-2.x.patch, NUTCH-1721-trunk.patch > > -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (NUTCH-1721) Upgrade to Crawler commons 0.3
[ https://issues.apache.org/jira/browse/NUTCH-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated NUTCH-1721: --- Attachment: NUTCH-1721-2.x.patch NUTCH-1721-trunk.patch > Upgrade to Crawler commons 0.3 > -- > > Key: NUTCH-1721 > URL: https://issues.apache.org/jira/browse/NUTCH-1721 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.7, 2.2, 2.2.1 >Reporter: Tejas Patil >Assignee: Tejas Patil > Fix For: 2.3, 1.8 > > Attachments: NUTCH-1721-2.x.patch, NUTCH-1721-trunk.patch > > -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Created] (NUTCH-1721) Upgrade to Crawler commons 0.3
Tejas Patil created NUTCH-1721: -- Summary: Upgrade to Crawler commons 0.3 Key: NUTCH-1721 URL: https://issues.apache.org/jira/browse/NUTCH-1721 Project: Nutch Issue Type: Improvement Affects Versions: 2.2.1, 2.2, 1.7 Reporter: Tejas Patil Assignee: Tejas Patil Fix For: 2.3, 1.8 -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13887763#comment-13887763 ] Tejas Patil commented on NUTCH-1465: Re "filters and normalizers": +1. Re "fetch intervals" and "reducer overwriting": I have never encountered bogus sitemaps but that was for a intranet crawl and it would be better to take care of that in this jira. Here is what I conclude from the discussion till now: (1) _fetch interval_: For old entries, don't use the value from sitemap. For new ones, use the value from sitemap provided (db.fetch.schedule.adaptive.min_interval <= interval <= db.fetch.interval.max) (2) _score_: Never use value from sitemap. For new ones, use scoring filters. Keep the value of old entries as it is. (3) _modified time_: Always use the value from sitemap provided its not a date in future. Did I get it right ? Re "score": I missed that the jar is old. Would file a jira to upgrade CC to v0.3 in Nutch. > Support sitemaps in Nutch > - > > Key: NUTCH-1465 > URL: https://issues.apache.org/jira/browse/NUTCH-1465 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Lewis John McGibbney >Assignee: Tejas Patil > Fix For: 1.8 > > Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, > NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, > NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, > NUTCH-1465-trunk.v5.patch > > > I recently came across this rather stagnant codebase[0] which is ASL v2.0 > licensed and appears to have been used successfully to parse sitemaps as per > the discussion here[1]. > [0] http://sourceforge.net/projects/sitemap-parser/ > [1] > http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13886677#comment-13886677 ] Tejas Patil commented on NUTCH-1465: Interesting comments [~wastl-nagel]. Re "filters and normalizers" : By default I have kept those ON but can be disabled by using "-noFilter" and "-noNormalize". Re "default content limits" and "fetch timeout": +1. Agree with you. Re "Processing sitemap indexes fails" : +1. Nice catch. Re "Fetch intervals of 1 second or 1 hour may cause troubles" : Currently, Injector allows users to provide a custom fetch interval with any value eg. 1 sec. It makes sense not the correct it as user wants Nutch use that custom fetch interval. If we view sitemaps as custom seed list given by a content owner, then it would make sense to follow the intervals. But as you said that sitemaps can be wrongly set or outdated, the intervals might be incorrect. The question bolis down to: We are blindly accepting user's custom information in inject. Should we blindly assume that sitemaps are correct or not ? I have no strong opinion about either side of the argument. (PS : Default 'db.fetch.schedule.adaptive.min_interval' is 1 min so would allow 1 hr as per db.fetch.schedule.adaptive.min_interval <= interval) Re "SitemapReducer overwriting" : >> _"If a sitemap does not specify one of score, modified time, or fetch >> interval this values is set to zero. "_ Nope. See [SiteMapURL.java|https://code.google.com/p/crawler-commons/source/browse/trunk/src/main/java/crawlercommons/sitemaps/SiteMapURL.java] (a) score : Crawler commons assigns a default score of 0.5 if there was none provided in sitemap. We can do this: If an old entry has score other than 0.5, it can be preserved else update. For new entry, use scoring plugins for score equal to 0.5, else preserve the same. Limitation: Its not possible to distinguish if the score of 0.5 is from sitemap or the default one if was absent. (b) fetch interval : Crawler commons does NOT set fetch interval if there was none provided in sitemap. So we are sure that whatever value is used is coming from . Validation might be needed as per comments above. (c) modified time : Same as fetch interval, unless parsed from sitemap file, modified time is set to NULL. Only possible validation is to drop values greater than current time. > Support sitemaps in Nutch > - > > Key: NUTCH-1465 > URL: https://issues.apache.org/jira/browse/NUTCH-1465 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Lewis John McGibbney >Assignee: Tejas Patil > Fix For: 1.8 > > Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, > NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, > NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, > NUTCH-1465-trunk.v5.patch > > > I recently came across this rather stagnant codebase[0] which is ASL v2.0 > licensed and appears to have been used successfully to parse sitemaps as per > the discussion here[1]. > [0] http://sourceforge.net/projects/sitemap-parser/ > [1] > http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1718) update description of property http.robots.agent
[ https://issues.apache.org/jira/browse/NUTCH-1718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13885650#comment-13885650 ] Tejas Patil commented on NUTCH-1718: Hi [~someuser77], Yup. I am waiting for folks to comment if that addition is fine. If it is, then I would go ahead and update the description of this jira. > update description of property http.robots.agent > > > Key: NUTCH-1718 > URL: https://issues.apache.org/jira/browse/NUTCH-1718 > Project: Nutch > Issue Type: Bug > Components: fetcher >Affects Versions: 1.7, 2.2, 2.2.1 >Reporter: Sebastian Nagel >Priority: Trivial > Fix For: 2.3, 1.8 > > Attachments: NUTCH-1718-trunk.v1.patch > > > The description of property http.robots.agent in nutch-default.xml recommends > to add a '*' to the list of agent names. This will cause the same problem as > described in NUTCH-1715. The description should be updated. Also regarding > "order of precedence" which is dictated since NUTCH-1031 only by ordering of > user agents in robots.txt. > {code:xml} > > http.robots.agents > * > The agent strings we'll look for in robots.txt files, > comma-separated, in decreasing order of precedence. You should > put the value of http.agent.name as the first agent name, and keep the > default * at the end of the list. E.g.: BlurflDev,Blurfl,* > > > {code} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (NUTCH-1718) update description of property http.robots.agent
[ https://issues.apache.org/jira/browse/NUTCH-1718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated NUTCH-1718: --- Attachment: NUTCH-1718-trunk.v1.patch Thanks [~wastl-nagel] for bringing this up. I should have updated the documentation with NUTCH-1715 but lost track of the same. In addition to having a documentation, I am proposing this: Instead of making users to have 'http.agent.name' as the first agent in 'http.robots.agents', make the program do that automatically. So users would make use of 'http.robots.agents' to specify any additional agents apart from 'http.agent.name'. Here is a patch for the same. > update description of property http.robots.agent > > > Key: NUTCH-1718 > URL: https://issues.apache.org/jira/browse/NUTCH-1718 > Project: Nutch > Issue Type: Bug > Components: fetcher >Affects Versions: 1.7, 2.2, 2.2.1 >Reporter: Sebastian Nagel >Priority: Trivial > Fix For: 2.3, 1.8 > > Attachments: NUTCH-1718-trunk.v1.patch > > > The description of property http.robots.agent in nutch-default.xml recommends > to add a '*' to the list of agent names. This will cause the same problem as > described in NUTCH-1715. The description should be updated. Also regarding > "order of precedence" which is dictated since NUTCH-1031 only by ordering of > user agents in robots.txt. > {code:xml} > > http.robots.agents > * > The agent strings we'll look for in robots.txt files, > comma-separated, in decreasing order of precedence. You should > put the value of http.agent.name as the first agent name, and keep the > default * at the end of the list. E.g.: BlurflDev,Blurfl,* > > > {code} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated NUTCH-1465: --- Attachment: NUTCH-1465-trunk.v5.patch Adding new patch 'v5' with below changes: 1. Added Apache license header as per review comment by [~wastl-nagel] 2. Added counters in log output as per review comment by [~wastl-nagel] 3. Implemented the change suggested by [~wastl-nagel] for 'isHost' and 'filterNormalize'. I could do more re-factoring and make it more clean. 4. Added a new parameter "-noStrict" to control the checking done by sitemap parser > Support sitemaps in Nutch > - > > Key: NUTCH-1465 > URL: https://issues.apache.org/jira/browse/NUTCH-1465 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Lewis John McGibbney >Assignee: Tejas Patil > Fix For: 1.8 > > Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, > NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, > NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, > NUTCH-1465-trunk.v5.patch > > > I recently came across this rather stagnant codebase[0] which is ASL v2.0 > licensed and appears to have been used successfully to parse sitemaps as per > the discussion here[1]. > [0] http://sourceforge.net/projects/sitemap-parser/ > [1] > http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13883204#comment-13883204 ] Tejas Patil commented on NUTCH-1465: Hi [~wastl-nagel], Thanks a lot for your comments. First two were straight forward and I agree with those. Re "hacky way" : For hosts from the HostDb, we don't know which protocol they below to. In the code I was checking if http:// is a match and if that was a bad guess then try with https://. I didn't handle for ftp:// and file:/ schemes. By "hacky" I meant this approach of trial-and-error till a suitable match is formed and we create a homepage url for the host. I have thought of your comment and would have a better (yet hacky) way in the coming patch. Re "concurrency": I had thought of this and had searched over internet for internals of MultithreadedMapper. All I could get is that it has an internal thread pool and each input record to handed over to a thread in this pool to run map() over it. I wrote this code to check if thread safety was ensured in MultithreadedMapper: {noformat} private static class SitemapMapper extends Mapper { private String myurl = null; public void map(Text key, Writable value, Context context) throws IOException, InterruptedException { if (value instanceof Text) { String url = key.toString(); if(foo(url).compareTo(url) != 0) { LOG.warn("Race condition found !!!"); } } } private String foo(String url) { myurl = url; if(Thread.currentThread().getId() % 2 == 1) { try { Thread.sleep(1); } catch(InterruptedException e) { LOG.warn(e.getMessage()); } } return myurl; } {noformat} I ran it multiple times with threads set to 10, 100, 1000 and 2000 but never hit the race condition in the code. Is the code snippet above a good way to reveal any race condition in the code ? Its won't be a formal conclusion and more of an experimental conclusion. How do I get a concrete conclusion whether MultithreadedMapper is thread safe or not ? > Support sitemaps in Nutch > - > > Key: NUTCH-1465 > URL: https://issues.apache.org/jira/browse/NUTCH-1465 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Lewis John McGibbney >Assignee: Tejas Patil > Fix For: 1.8 > > Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, > NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, > NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch > > > I recently came across this rather stagnant codebase[0] which is ASL v2.0 > licensed and appears to have been used successfully to parse sitemaps as per > the discussion here[1]. > [0] http://sourceforge.net/projects/sitemap-parser/ > [1] > http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1084) ReadDB url throws exception
[ https://issues.apache.org/jira/browse/NUTCH-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13882771#comment-13882771 ] Tejas Patil commented on NUTCH-1084: The issue gets reproduced on current trunk. Attaching a test segment : https://issues.apache.org/jira/secure/attachment/12625275/20140126210858.tgz The workaround suggested by [~markus17] in comment above works correctly. > ReadDB url throws exception > --- > > Key: NUTCH-1084 > URL: https://issues.apache.org/jira/browse/NUTCH-1084 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.3 >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.9 > > > Readdb -url suffers from two problems: > 1. it trips over the _SUCCESS file generated by newer Hadoop version > 2. throws can't find class: org.apache.nutch.protocol.ProtocolStatus (???) > The first problem can be remedied by not allowing the injector or updater to > write the _SUCCESS file. Until now that's the solution implemented for > similar issues. I've not been successful as to make the Hadoop readers simply > skip the file. > The second issue seems a bit strange and did not happen on a local check out. > I'm not yet sure whether this is a Hadoop issue or something being corrupt in > the CrawlDB. Here's the stack trace: > {code} > Exception in thread "main" java.io.IOException: can't find class: > org.apache.nutch.protocol.ProtocolStatus because > org.apache.nutch.protocol.ProtocolStatus > at > org.apache.hadoop.io.AbstractMapWritable.readFields(AbstractMapWritable.java:204) > at org.apache.hadoop.io.MapWritable.readFields(MapWritable.java:146) > at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:278) > at > org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1751) > at org.apache.hadoop.io.MapFile$Reader.get(MapFile.java:524) > at > org.apache.hadoop.mapred.MapFileOutputFormat.getEntry(MapFileOutputFormat.java:105) > at org.apache.nutch.crawl.CrawlDbReader.get(CrawlDbReader.java:383) > at > org.apache.nutch.crawl.CrawlDbReader.readUrl(CrawlDbReader.java:389) > at org.apache.nutch.crawl.CrawlDbReader.main(CrawlDbReader.java:514) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at org.apache.hadoop.util.RunJar.main(RunJar.java:156) > {code} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1692) SegmentReader broken in distributed mode
[ https://issues.apache.org/jira/browse/NUTCH-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13882770#comment-13882770 ] Tejas Patil commented on NUTCH-1692: Hi [~markus17], I didn't knew about NUTCH-1084 until now and after going through it totally agree that the exception I faced was due to that issue. With that workaround and the patch for this jira, the NPE issue seems fixed. +1 for commit. > SegmentReader broken in distributed mode > > > Key: NUTCH-1692 > URL: https://issues.apache.org/jira/browse/NUTCH-1692 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.7 >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.8 > > Attachments: 20140126210858.tgz, NUTCH-1692-trunk.patch > > > SegmentReader -list option ignores the -no* options, causing the following > exception in distributed mode: > {code} > Exception in thread "main" java.lang.NullPointerException > at java.util.ComparableTimSort.sort(ComparableTimSort.java:146) > at java.util.Arrays.sort(Arrays.java:472) > at > org.apache.hadoop.mapred.SequenceFileOutputFormat.getReaders(SequenceFileOutputFormat.java:85) > at > org.apache.nutch.segment.SegmentReader.getStats(SegmentReader.java:463) > at org.apache.nutch.segment.SegmentReader.list(SegmentReader.java:441) > at org.apache.nutch.segment.SegmentReader.main(SegmentReader.java:587) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at org.apache.hadoop.util.RunJar.main(RunJar.java:160) > {code} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated NUTCH-1465: --- Attachment: NUTCH-1465-trunk.v4.patch Attaching v4 patch with the suggestions #1 and #2 from [~lewismc]. > Support sitemaps in Nutch > - > > Key: NUTCH-1465 > URL: https://issues.apache.org/jira/browse/NUTCH-1465 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Lewis John McGibbney >Assignee: Tejas Patil > Fix For: 1.8 > > Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, > NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, > NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch > > > I recently came across this rather stagnant codebase[0] which is ASL v2.0 > licensed and appears to have been used successfully to parse sitemaps as per > the discussion here[1]. > [0] http://sourceforge.net/projects/sitemap-parser/ > [1] > http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (NUTCH-1692) SegmentReader broken in distributed mode
[ https://issues.apache.org/jira/browse/NUTCH-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated NUTCH-1692: --- Attachment: 20140126210858.tgz Attaching the test segment (20140126210858.tgz) > SegmentReader broken in distributed mode > > > Key: NUTCH-1692 > URL: https://issues.apache.org/jira/browse/NUTCH-1692 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.7 >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.8 > > Attachments: 20140126210858.tgz, NUTCH-1692-trunk.patch > > > SegmentReader -list option ignores the -no* options, causing the following > exception in distributed mode: > {code} > Exception in thread "main" java.lang.NullPointerException > at java.util.ComparableTimSort.sort(ComparableTimSort.java:146) > at java.util.Arrays.sort(Arrays.java:472) > at > org.apache.hadoop.mapred.SequenceFileOutputFormat.getReaders(SequenceFileOutputFormat.java:85) > at > org.apache.nutch.segment.SegmentReader.getStats(SegmentReader.java:463) > at org.apache.nutch.segment.SegmentReader.list(SegmentReader.java:441) > at org.apache.nutch.segment.SegmentReader.main(SegmentReader.java:587) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at org.apache.hadoop.util.RunJar.main(RunJar.java:160) > {code} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1692) SegmentReader broken in distributed mode
[ https://issues.apache.org/jira/browse/NUTCH-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13882348#comment-13882348 ] Tejas Patil commented on NUTCH-1692: Hi [~markus17], I am tried out the patch on a latest trunk checkout and it ran fine in local mode. In deploy mode, I encountered this: {noformat} $ bin/nutch readseg -list 20140126210858/ -nocontent -nogenerate 14/01/26 22:26:16 INFO util.NativeCodeLoader: Loaded the native-hadoop library 14/01/26 22:26:16 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library 14/01/26 22:26:16 INFO compress.CodecPool: Got brand-new decompressor 14/01/26 22:26:16 INFO compress.CodecPool: Got brand-new decompressor 14/01/26 22:26:16 INFO compress.CodecPool: Got brand-new decompressor 14/01/26 22:26:16 INFO compress.CodecPool: Got brand-new decompressor 14/01/26 22:26:16 INFO compress.CodecPool: Got brand-new decompressor Exception in thread "main" java.io.IOException: can't find class: org.apache.nutch.protocol.ProtocolStatus because org.apache.nutch.protocol.ProtocolStatus at org.apache.hadoop.io.AbstractMapWritable.readFields(AbstractMapWritable.java:204) at org.apache.hadoop.io.MapWritable.readFields(MapWritable.java:146) at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:280) at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1813) at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1941) at org.apache.hadoop.io.MapFile$Reader.next(MapFile.java:517) at org.apache.nutch.segment.SegmentReader.getStats(SegmentReader.java:485) at org.apache.nutch.segment.SegmentReader.list(SegmentReader.java:441) at org.apache.nutch.segment.SegmentReader.main(SegmentReader.java:597) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) {noformat} > SegmentReader broken in distributed mode > > > Key: NUTCH-1692 > URL: https://issues.apache.org/jira/browse/NUTCH-1692 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.7 >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.8 > > Attachments: NUTCH-1692-trunk.patch > > > SegmentReader -list option ignores the -no* options, causing the following > exception in distributed mode: > {code} > Exception in thread "main" java.lang.NullPointerException > at java.util.ComparableTimSort.sort(ComparableTimSort.java:146) > at java.util.Arrays.sort(Arrays.java:472) > at > org.apache.hadoop.mapred.SequenceFileOutputFormat.getReaders(SequenceFileOutputFormat.java:85) > at > org.apache.nutch.segment.SegmentReader.getStats(SegmentReader.java:463) > at org.apache.nutch.segment.SegmentReader.list(SegmentReader.java:441) > at org.apache.nutch.segment.SegmentReader.main(SegmentReader.java:587) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at org.apache.hadoop.util.RunJar.main(RunJar.java:160) > {code} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Resolved] (NUTCH-1715) RobotRulesParser adds additional '*' to the robots name
[ https://issues.apache.org/jira/browse/NUTCH-1715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil resolved NUTCH-1715. Resolution: Fixed The change was verified over nutch-user mailing list. Committed to trunk (revision 1561087) and 2.x (revision 1561088). > RobotRulesParser adds additional '*' to the robots name > --- > > Key: NUTCH-1715 > URL: https://issues.apache.org/jira/browse/NUTCH-1715 > Project: Nutch > Issue Type: Bug > Components: fetcher >Affects Versions: 1.7, 2.2.1 >Reporter: Tejas Patil >Assignee: Tejas Patil > Fix For: 2.3, 1.8 > > Attachments: NUTCH-1715.2.x.patch, NUTCH-1715.trunk.patch > > > In RobotRulesParser, when Nutch creates a agent string from multiple agents, > it combines agents from both 'http.agent.name' and 'http.robots.agents'. > Along with that it appends a wildcard (ie. *) to it in the end. This is sent > to crawler commons while parsing the rules. The wildcard gets matched first > in robots file with (User-agent: *) if that comes before any other matching > rule thus resulting in a allowed url being robots denied. > This issue was reported by [~markus17]. The discussion over nutch-user is > here: > http://mail-archives.apache.org/mod_mbox/nutch-user/201401.mbox/%3CCAFKhtFzBRpVv4MULSxw8RDRR_wbivOt%3DnhFX-w621BR8q%2BxVDQ%40mail.gmail.com%3E -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (NUTCH-1715) RobotRulesParser adds additional '*' to the robots name
[ https://issues.apache.org/jira/browse/NUTCH-1715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated NUTCH-1715: --- Attachment: NUTCH-1715.2.x.patch NUTCH-1715.trunk.patch > RobotRulesParser adds additional '*' to the robots name > --- > > Key: NUTCH-1715 > URL: https://issues.apache.org/jira/browse/NUTCH-1715 > Project: Nutch > Issue Type: Bug > Components: fetcher >Affects Versions: 1.7, 2.2.1 >Reporter: Tejas Patil >Assignee: Tejas Patil > Fix For: 2.3, 1.8 > > Attachments: NUTCH-1715.2.x.patch, NUTCH-1715.trunk.patch > > > In RobotRulesParser, when Nutch creates a agent string from multiple agents, > it combines agents from both 'http.agent.name' and 'http.robots.agents'. > Along with that it appends a wildcard (ie. *) to it in the end. This is sent > to crawler commons while parsing the rules. The wildcard gets matched first > in robots file with (User-agent: *) if that comes before any other matching > rule thus resulting in a allowed url being robots denied. > This issue was reported by [~markus17]. The discussion over nutch-user is > here: > http://mail-archives.apache.org/mod_mbox/nutch-user/201401.mbox/%3CCAFKhtFzBRpVv4MULSxw8RDRR_wbivOt%3DnhFX-w621BR8q%2BxVDQ%40mail.gmail.com%3E -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (NUTCH-1715) RobotRulesParser adds additional '*' to the robots name
[ https://issues.apache.org/jira/browse/NUTCH-1715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated NUTCH-1715: --- Description: In RobotRulesParser, when Nutch creates a agent string from multiple agents, it combines agents from both 'http.agent.name' and 'http.robots.agents'. Along with that it appends a wildcard (ie. *) to it in the end. This is sent to crawler commons while parsing the rules. The wildcard gets matched first in robots file with (User-agent: *) if that comes before any other matching rule thus resulting in a allowed url being robots denied. This issue was reported by [~markus17]. The discussion over nutch-user is here: http://mail-archives.apache.org/mod_mbox/nutch-user/201401.mbox/%3CCAFKhtFzBRpVv4MULSxw8RDRR_wbivOt%3DnhFX-w621BR8q%2BxVDQ%40mail.gmail.com%3E was: In RobotRulesParser, when Nutch creates a agent string from multiple agents, it combines agents from both 'http.agent.name' and 'http.robots.agents'. Along with that it appends a wildcard '*' to it in the end. This is sent to crawler commons while parsing the rules. The wildcard '*' added to the end gets matched with the first rule in robots file and thus results in the url being robots denied while the robots.txt actually allows them. This issue was reported by [~markus17]. The discussion over nutch-user is here: http://mail-archives.apache.org/mod_mbox/nutch-user/201401.mbox/%3CCAFKhtFzBRpVv4MULSxw8RDRR_wbivOt%3DnhFX-w621BR8q%2BxVDQ%40mail.gmail.com%3E > RobotRulesParser adds additional '*' to the robots name > --- > > Key: NUTCH-1715 > URL: https://issues.apache.org/jira/browse/NUTCH-1715 > Project: Nutch > Issue Type: Bug > Components: fetcher >Affects Versions: 1.7, 2.2.1 >Reporter: Tejas Patil >Assignee: Tejas Patil > Fix For: 2.3, 1.8 > > > In RobotRulesParser, when Nutch creates a agent string from multiple agents, > it combines agents from both 'http.agent.name' and 'http.robots.agents'. > Along with that it appends a wildcard (ie. *) to it in the end. This is sent > to crawler commons while parsing the rules. The wildcard gets matched first > in robots file with (User-agent: *) if that comes before any other matching > rule thus resulting in a allowed url being robots denied. > This issue was reported by [~markus17]. The discussion over nutch-user is > here: > http://mail-archives.apache.org/mod_mbox/nutch-user/201401.mbox/%3CCAFKhtFzBRpVv4MULSxw8RDRR_wbivOt%3DnhFX-w621BR8q%2BxVDQ%40mail.gmail.com%3E -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Resolved] (NUTCH-1716) RobotRulesParser adds extra '*' to the robots name
[ https://issues.apache.org/jira/browse/NUTCH-1716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil resolved NUTCH-1716. Resolution: Duplicate Accidentally duplicated NUTCH-1715 > RobotRulesParser adds extra '*' to the robots name > -- > > Key: NUTCH-1716 > URL: https://issues.apache.org/jira/browse/NUTCH-1716 > Project: Nutch > Issue Type: Bug > Components: fetcher >Affects Versions: 1.7, 2.2.1 >Reporter: Tejas Patil >Assignee: Tejas Patil > Fix For: 2.3, 1.8 > > > In RobotRulesParser, when Nutch creates a agent string from multiple agents, > it combines agents from both 'http.agent.name' and 'http.robots.agents'. > Along with that it appends a wildcard (ie. *) to it in the end. This is sent > to crawler commons while parsing the rules. The wildcard gets matched first > in robots file with (User-agent: *) if that comes before any other matching > rule thus resulting in a allowed url being robots denied. > This bug was reported by @Markus Jelsma. The discussion over nutch-user can > be found here: > http://mail-archives.apache.org/mod_mbox/nutch-user/201401.mbox/%3CCAFKhtFzBRpVv4MULSxw8RDRR_wbivOt%3DnhFX-w621BR8q%2BxVDQ%40mail.gmail.com%3E > -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (NUTCH-1715) RobotRulesParser adds additional '*' to the robots name
[ https://issues.apache.org/jira/browse/NUTCH-1715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated NUTCH-1715: --- Description: In RobotRulesParser, when Nutch creates a agent string from multiple agents, it combines agents from both 'http.agent.name' and 'http.robots.agents'. Along with that it appends a wildcard '*' to it in the end. This is sent to crawler commons while parsing the rules. The wildcard '*' added to the end gets matched with the first rule in robots file and thus results in the url being robots denied while the robots.txt actually allows them. This issue was reported by [~markus17]. The discussion over nutch-user is here: http://mail-archives.apache.org/mod_mbox/nutch-user/201401.mbox/%3CCAFKhtFzBRpVv4MULSxw8RDRR_wbivOt%3DnhFX-w621BR8q%2BxVDQ%40mail.gmail.com%3E was: In RobotRulesParser, when Nutch creates a agent string from multiple agents, it combines agents from both 'http.agent.name' and 'http.robots.agents'. Along with that it appends a wildcard (*) to it in the end. This is sent to crawler commons while parsing the rules. The wildcard (*) added to the end gets matched with the first rule in robots file and thus results in the url being robots denied while the robots.txt actually allows them. This issue was reported by [~markus17]. The discussion over nutch-user is here: http://mail-archives.apache.org/mod_mbox/nutch-user/201401.mbox/%3CCAFKhtFzBRpVv4MULSxw8RDRR_wbivOt%3DnhFX-w621BR8q%2BxVDQ%40mail.gmail.com%3E > RobotRulesParser adds additional '*' to the robots name > --- > > Key: NUTCH-1715 > URL: https://issues.apache.org/jira/browse/NUTCH-1715 > Project: Nutch > Issue Type: Bug > Components: fetcher >Affects Versions: 1.7, 2.2.1 >Reporter: Tejas Patil >Assignee: Tejas Patil > Fix For: 2.3, 1.8 > > > In RobotRulesParser, when Nutch creates a agent string from multiple agents, > it combines agents from both 'http.agent.name' and 'http.robots.agents'. > Along with that it appends a wildcard '*' to it in the end. This is sent to > crawler commons while parsing the rules. The wildcard '*' added to the end > gets matched with the first rule in robots file and thus results in the url > being robots denied while the robots.txt actually allows them. > This issue was reported by [~markus17]. The discussion over nutch-user is > here: > http://mail-archives.apache.org/mod_mbox/nutch-user/201401.mbox/%3CCAFKhtFzBRpVv4MULSxw8RDRR_wbivOt%3DnhFX-w621BR8q%2BxVDQ%40mail.gmail.com%3E -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Created] (NUTCH-1716) RobotRulesParser adds extra '*' to the robots name
Tejas Patil created NUTCH-1716: -- Summary: RobotRulesParser adds extra '*' to the robots name Key: NUTCH-1716 URL: https://issues.apache.org/jira/browse/NUTCH-1716 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 2.2.1, 1.7 Reporter: Tejas Patil Assignee: Tejas Patil Fix For: 2.3, 1.8 In RobotRulesParser, when Nutch creates a agent string from multiple agents, it combines agents from both 'http.agent.name' and 'http.robots.agents'. Along with that it appends a wildcard (ie. *) to it in the end. This is sent to crawler commons while parsing the rules. The wildcard gets matched first in robots file with (User-agent: *) if that comes before any other matching rule thus resulting in a allowed url being robots denied. This bug was reported by @Markus Jelsma. The discussion over nutch-user can be found here: http://mail-archives.apache.org/mod_mbox/nutch-user/201401.mbox/%3CCAFKhtFzBRpVv4MULSxw8RDRR_wbivOt%3DnhFX-w621BR8q%2BxVDQ%40mail.gmail.com%3E -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Created] (NUTCH-1715) RobotRulesParser adds additional '*' to the robots name
Tejas Patil created NUTCH-1715: -- Summary: RobotRulesParser adds additional '*' to the robots name Key: NUTCH-1715 URL: https://issues.apache.org/jira/browse/NUTCH-1715 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 2.2.1, 1.7 Reporter: Tejas Patil Assignee: Tejas Patil Fix For: 2.3, 1.8 In RobotRulesParser, when Nutch creates a agent string from multiple agents, it combines agents from both 'http.agent.name' and 'http.robots.agents'. Along with that it appends a wildcard (*) to it in the end. This is sent to crawler commons while parsing the rules. The wildcard (*) added to the end gets matched with the first rule in robots file and thus results in the url being robots denied while the robots.txt actually allows them. This issue was reported by [~markus17]. The discussion over nutch-user is here: http://mail-archives.apache.org/mod_mbox/nutch-user/201401.mbox/%3CCAFKhtFzBRpVv4MULSxw8RDRR_wbivOt%3DnhFX-w621BR8q%2BxVDQ%40mail.gmail.com%3E -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1676) Add rudimentary SSL support to protocol-http
[ https://issues.apache.org/jira/browse/NUTCH-1676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13881143#comment-13881143 ] Tejas Patil commented on NUTCH-1676: Hi [~markus17], I tried out the patch with couple of https urls and it works correctly. Few comments on the patch: (1) In src/plugin/protocol-http/plugin.xml, the same stuff is repeated twice. Not sure if that was accidental or meant to be different {code:title=plugin.xml|borderStyle=solid} + + + + + + + {code} (2) In HttpBase.java: The values in this line go till column 2070 and might be painful while looking at the list. Is there any way to avoid it (maybe using a String array) ? {code:title=HttpBase.java|borderStyle=solid} conf.getStrings("http.tls.supported.cipher.suites", "TLS_ECDHE_ECDSA_WITH_AES_256_CBC {code} (3) The class description is empty after the deletion of author tag. Can you please fill that ? {code:title=HttpBase.java|borderStyle=solid} /** */ public abstract class HttpBase implements Protocol { {code} > Add rudimentary SSL support to protocol-http > > > Key: NUTCH-1676 > URL: https://issues.apache.org/jira/browse/NUTCH-1676 > Project: Nutch > Issue Type: Improvement > Components: protocol >Affects Versions: 1.7 >Reporter: Julien Nioche > Fix For: 1.8 > > Attachments: NUTCH-1676-2x.patch, NUTCH-1676.patch, NUTCH-1676.patch, > NUTCH-1676.patch, NUTCH-1676.patch > > > Adding https support to our http protocol would be a good thing even if it > does not handle the security. This would save us from having to use the > http-client plugin which is buggy in its current form. > Patch generated from > https://github.com/Aloisius/nutch/commit/d3e15a1db0eb323ccdcf5ad69a3d3a01ec65762c#commitcomment-4720772 > Needs testing... -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13880295#comment-13880295 ] Tejas Patil commented on NUTCH-1465: Hi [~lewismc], +1 for the first two suggestions. For #3: I skimmed through the methods inside URLUtil.java and nothing came to my notice that I could use in the Sitemap code you pointed. Can you please confirm ? A big thanks mate for trying out the feature. Hopefully we get this into 1.8 release. Cheers !! > Support sitemaps in Nutch > - > > Key: NUTCH-1465 > URL: https://issues.apache.org/jira/browse/NUTCH-1465 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Lewis John McGibbney >Assignee: Tejas Patil > Fix For: 1.8 > > Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, > NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, > NUTCH-1465-trunk.v3.patch > > > I recently came across this rather stagnant codebase[0] which is ASL v2.0 > licensed and appears to have been used successfully to parse sitemaps as per > the discussion here[1]. > [0] http://sourceforge.net/projects/sitemap-parser/ > [1] > http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated NUTCH-1465: --- Fix Version/s: (was: 1.9) 1.8 > Support sitemaps in Nutch > - > > Key: NUTCH-1465 > URL: https://issues.apache.org/jira/browse/NUTCH-1465 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Lewis John McGibbney >Assignee: Tejas Patil > Fix For: 1.8 > > Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, > NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, > NUTCH-1465-trunk.v3.patch > > > I recently came across this rather stagnant codebase[0] which is ASL v2.0 > licensed and appears to have been used successfully to parse sitemaps as per > the discussion here[1]. > [0] http://sourceforge.net/projects/sitemap-parser/ > [1] > http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1712) Use MultipleInputs in Injector to make it a single mapreduce job
[ https://issues.apache.org/jira/browse/NUTCH-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13880288#comment-13880288 ] Tejas Patil commented on NUTCH-1712: The performance gains due to this patch won't be phenomenal for small seeds file w/o any metadata and large crawldb's. The only savings with this patch is in terms of saving time over :- 1. dumping the output of the first job (ie. datum objects for the seed urls) 2. reading this output as input for the next job 3. job launch and cleanup. > Use MultipleInputs in Injector to make it a single mapreduce job > > > Key: NUTCH-1712 > URL: https://issues.apache.org/jira/browse/NUTCH-1712 > Project: Nutch > Issue Type: Improvement > Components: injector >Affects Versions: 1.7 >Reporter: Tejas Patil >Assignee: Tejas Patil > Fix For: 1.8 > > Attachments: NUTCH-1712-trunk.v1.patch > > > Currently Injector creates two mapreduce jobs: > 1. sort job: get the urls from seeds file, emit CrawlDatum objects. > 2. merge job: read CrawlDatum objects from both crawldb and output of sort > job. Merge and emit final CrawlDatum objects. > Using MultipleInputs, we can read CrawlDatum objects from crawldb and urls > from seeds file simultaneously and perform inject in a single map-reduce job. > Also, here are additional things covered with this jira: > 1. Pushed filtering and normalization above metadata extraction so that the > unwanted records are ruled out quickly. > 2. Migrated to new mapreduce API > 3. Improved documentation > 4. New junits with better coverage > Relevant discussion over nutch-dev can be found here: > http://mail-archives.apache.org/mod_mbox/nutch-dev/201401.mbox/%3ccafkhtfyxo6wl7gyuv+a5y1pzntdcoqpz4jz_up_bkp9cje8...@mail.gmail.com%3E -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Resolved] (NUTCH-1164) Write JUnit tests for protocol-http
[ https://issues.apache.org/jira/browse/NUTCH-1164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil resolved NUTCH-1164. Resolution: Fixed The patch is better now and all tests pass. It needed little modification: you can't check string equality using equals sign and re-factoring. Committed to 2.x (rev 1560786). Thanks a lot for your contribution [~Sertac Turkel] !! > Write JUnit tests for protocol-http > --- > > Key: NUTCH-1164 > URL: https://issues.apache.org/jira/browse/NUTCH-1164 > Project: Nutch > Issue Type: Sub-task >Affects Versions: nutchgora >Reporter: Lewis John McGibbney > Labels: test > Fix For: 2.4 > > Attachments: NUTCH-1164.patch, > TEST-org.apache.nutch.protocol.http.TestProtocolHttp.txt > > > This issue should provide a single Junit test as part of an effort to provide > JUnit tests for all nutchgora plugins -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (NUTCH-1712) Use MultipleInputs in Injector to make it a single mapreduce job
[ https://issues.apache.org/jira/browse/NUTCH-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated NUTCH-1712: --- Attachment: NUTCH-1712-trunk.v1.patch > Use MultipleInputs in Injector to make it a single mapreduce job > > > Key: NUTCH-1712 > URL: https://issues.apache.org/jira/browse/NUTCH-1712 > Project: Nutch > Issue Type: Improvement > Components: injector >Affects Versions: 1.7 >Reporter: Tejas Patil >Assignee: Tejas Patil > Fix For: 1.8 > > Attachments: NUTCH-1712-trunk.v1.patch > > > Currently Injector creates two mapreduce jobs: > 1. sort job: get the urls from seeds file, emit CrawlDatum objects. > 2. merge job: read CrawlDatum objects from both crawldb and output of sort > job. Merge and emit final CrawlDatum objects. > Using MultipleInputs, we can read CrawlDatum objects from crawldb and urls > from seeds file simultaneously and perform inject in a single map-reduce job. > Also, here are additional things covered with this jira: > 1. Pushed filtering and normalization above metadata extraction so that the > unwanted records are ruled out quickly. > 2. Migrated to new mapreduce API > 3. Improved documentation > 4. New junits with better coverage > Relevant discussion over nutch-dev can be found here: > http://mail-archives.apache.org/mod_mbox/nutch-dev/201401.mbox/%3ccafkhtfyxo6wl7gyuv+a5y1pzntdcoqpz4jz_up_bkp9cje8...@mail.gmail.com%3E -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (NUTCH-1712) Use MultipleInputs in Injector to make it a single mapreduce job
[ https://issues.apache.org/jira/browse/NUTCH-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated NUTCH-1712: --- Description: Currently Injector creates two mapreduce jobs: 1. sort job: get the urls from seeds file, emit CrawlDatum objects. 2. merge job: read CrawlDatum objects from both crawldb and output of sort job. Merge and emit final CrawlDatum objects. Using MultipleInputs, we can read CrawlDatum objects from crawldb and urls from seeds file simultaneously and perform inject in a single map-reduce job. Also, here are additional things covered with this jira: 1. Pushed filtering and normalization above metadata extraction so that the unwanted records are ruled out quickly. 2. Migrated to new mapreduce API 3. Improved documentation 4. New junits with better coverage Relevant discussion over nutch-dev can be found here: http://mail-archives.apache.org/mod_mbox/nutch-dev/201401.mbox/%3ccafkhtfyxo6wl7gyuv+a5y1pzntdcoqpz4jz_up_bkp9cje8...@mail.gmail.com%3E was: Currently Injector creates two mapreduce jobs: 1. sort job: get the urls from seeds file, emit CrawlDatum objects. 2. merge job: read CrawlDatum objects from both crawldb and output of sort job. Merge and emit final CrawlDatum objects. Using MultipleInputs, we can read CrawlDatum objects from crawldb and urls from seeds file simultaneously and perform inject in a single map-reduce job. Also, there are few other things adressed in this patch: 1. Pushed filtering and normalization above metadata extraction so that the unwanted records are ruled out quickly. 2. Migrated to new mapreduce API 3. Improved documentation 4. New junits with better coverage Relevant discussion over nutch-dev can be found here: http://mail-archives.apache.org/mod_mbox/nutch-dev/201401.mbox/%3ccafkhtfyxo6wl7gyuv+a5y1pzntdcoqpz4jz_up_bkp9cje8...@mail.gmail.com%3E > Use MultipleInputs in Injector to make it a single mapreduce job > > > Key: NUTCH-1712 > URL: https://issues.apache.org/jira/browse/NUTCH-1712 > Project: Nutch > Issue Type: Improvement > Components: injector >Affects Versions: 1.7 >Reporter: Tejas Patil >Assignee: Tejas Patil > Fix For: 1.8 > > Attachments: NUTCH-1712-trunk.v1.patch > > > Currently Injector creates two mapreduce jobs: > 1. sort job: get the urls from seeds file, emit CrawlDatum objects. > 2. merge job: read CrawlDatum objects from both crawldb and output of sort > job. Merge and emit final CrawlDatum objects. > Using MultipleInputs, we can read CrawlDatum objects from crawldb and urls > from seeds file simultaneously and perform inject in a single map-reduce job. > Also, here are additional things covered with this jira: > 1. Pushed filtering and normalization above metadata extraction so that the > unwanted records are ruled out quickly. > 2. Migrated to new mapreduce API > 3. Improved documentation > 4. New junits with better coverage > Relevant discussion over nutch-dev can be found here: > http://mail-archives.apache.org/mod_mbox/nutch-dev/201401.mbox/%3ccafkhtfyxo6wl7gyuv+a5y1pzntdcoqpz4jz_up_bkp9cje8...@mail.gmail.com%3E -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Created] (NUTCH-1712) Use MultipleInputs in Injector to make it a single mapreduce job
Tejas Patil created NUTCH-1712: -- Summary: Use MultipleInputs in Injector to make it a single mapreduce job Key: NUTCH-1712 URL: https://issues.apache.org/jira/browse/NUTCH-1712 Project: Nutch Issue Type: Improvement Components: injector Affects Versions: 1.7 Reporter: Tejas Patil Assignee: Tejas Patil Fix For: 1.8 Currently Injector creates two mapreduce jobs: 1. sort job: get the urls from seeds file, emit CrawlDatum objects. 2. merge job: read CrawlDatum objects from both crawldb and output of sort job. Merge and emit final CrawlDatum objects. Using MultipleInputs, we can read CrawlDatum objects from crawldb and urls from seeds file simultaneously and perform inject in a single map-reduce job. Also, there are few other things adressed in this patch: 1. Pushed filtering and normalization above metadata extraction so that the unwanted records are ruled out quickly. 2. Migrated to new mapreduce API 3. Improved documentation 4. New junits with better coverage Relevant discussion over nutch-dev can be found here: http://mail-archives.apache.org/mod_mbox/nutch-dev/201401.mbox/%3ccafkhtfyxo6wl7gyuv+a5y1pzntdcoqpz4jz_up_bkp9cje8...@mail.gmail.com%3E -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated NUTCH-1465: --- Attachment: NUTCH-1465-trunk.v3.patch Now that HostDb (NUTCH-1365) is in trunk, updated the patch (v3). Also, - included job counters - more documentation - added sitemap references in log4j.properties and bin/nutch script. For usage, see https://wiki.apache.org/nutch/SitemapFeature > Support sitemaps in Nutch > - > > Key: NUTCH-1465 > URL: https://issues.apache.org/jira/browse/NUTCH-1465 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Lewis John McGibbney >Assignee: Tejas Patil > Fix For: 1.9 > > Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, > NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, > NUTCH-1465-trunk.v3.patch > > > I recently came across this rather stagnant codebase[0] which is ASL v2.0 > licensed and appears to have been used successfully to parse sitemaps as per > the discussion here[1]. > [0] http://sourceforge.net/projects/sitemap-parser/ > [1] > http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1325) HostDB for Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13878623#comment-13878623 ] Tejas Patil commented on NUTCH-1325: Hi [~markus17], Thanks for the correction. This feature would have not been without you in the first place. Apart from being a good addition to Nutch, HostDb has also helped in getting a simple design for Sitemap feature (NUTCH-1465). Cheers !!! > HostDB for Nutch > > > Key: NUTCH-1325 > URL: https://issues.apache.org/jira/browse/NUTCH-1325 > Project: Nutch > Issue Type: New Feature >Reporter: Markus Jelsma >Assignee: Tejas Patil > Fix For: 1.8 > > Attachments: NUTCH-1325-1.6-1.patch, NUTCH-1325-trunk-v3.patch, > NUTCH-1325-trunk-v4.patch, NUTCH-1325.trunk.v2.path > > > A HostDB for Nutch and associated tools to create and read a database > containing information on hosts. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (NUTCH-1164) Write JUnit tests for protocol-http
[ https://issues.apache.org/jira/browse/NUTCH-1164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated NUTCH-1164: --- Attachment: TEST-org.apache.nutch.protocol.http.TestProtocolHttp.txt Hi [~Sertac Turkel], I tried out your patch and encountered test case failure: {noformat} test: [echo] Testing plugin: protocol-http [junit] Running org.apache.nutch.protocol.http.TestProtocolHttp [junit] Tests run: 2, Failures: 1, Errors: 0, Time elapsed: 1.244 sec [junit] Test org.apache.nutch.protocol.http.TestProtocolHttp FAILED {noformat} I have attached the test case failure log for reference. > Write JUnit tests for protocol-http > --- > > Key: NUTCH-1164 > URL: https://issues.apache.org/jira/browse/NUTCH-1164 > Project: Nutch > Issue Type: Sub-task >Affects Versions: nutchgora >Reporter: Lewis John McGibbney > Labels: test > Fix For: 2.4 > > Attachments: NUTCH-1158.patch, > TEST-org.apache.nutch.protocol.http.TestProtocolHttp.txt > > > This issue should provide a single Junit test as part of an effort to provide > JUnit tests for all nutchgora plugins -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Resolved] (NUTCH-1325) HostDB for Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil resolved NUTCH-1325. Resolution: Fixed Fix Version/s: (was: 1.9) 1.8 Thanks [~markus17] for the heads up :) I have committed the patch to trunk (rev 1560316). > HostDB for Nutch > > > Key: NUTCH-1325 > URL: https://issues.apache.org/jira/browse/NUTCH-1325 > Project: Nutch > Issue Type: New Feature >Reporter: Markus Jelsma >Assignee: Tejas Patil > Fix For: 1.8 > > Attachments: NUTCH-1325-1.6-1.patch, NUTCH-1325-trunk-v3.patch, > NUTCH-1325-trunk-v4.patch, NUTCH-1325.trunk.v2.path > > > A HostDB for Nutch and associated tools to create and read a database > containing information on hosts. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Assigned] (NUTCH-1325) HostDB for Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil reassigned NUTCH-1325: -- Assignee: Tejas Patil (was: Markus Jelsma) > HostDB for Nutch > > > Key: NUTCH-1325 > URL: https://issues.apache.org/jira/browse/NUTCH-1325 > Project: Nutch > Issue Type: New Feature >Reporter: Markus Jelsma >Assignee: Tejas Patil > Fix For: 1.9 > > Attachments: NUTCH-1325-1.6-1.patch, NUTCH-1325-trunk-v3.patch, > NUTCH-1325-trunk-v4.patch, NUTCH-1325.trunk.v2.path > > > A HostDB for Nutch and associated tools to create and read a database > containing information on hosts. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated NUTCH-1465: --- Attachment: NUTCH-1465-trunk.v2.patch Attaching NUTCH-1465-trunk.v2.patch which has implementation of *option (B)* _Have separate job for the sitemap stuff and merge its output into the crawldb_ +I have tied both the cases in this patch:+ 1. users with targeted crawl who want to get sitemaps injected from a list of sitemap urls - the use case which [~wastl-nagel] had pointed out. 2. large open web crawls where users cannot afford to generate sitemap seeds for all the hosts and want nutch to inject sitemaps automatically. +To try out this patch:+ 1. Apply the patch for HostDb feature (https://issues.apache.org/jira/secure/attachment/12624178/NUTCH-1325-trunk-v4.patch) 2. Apply this patch (NUTCH-1465-trunk.v2.patch) 3. (optional) Add this to conf/log4j.properties at line 11: {noformat} log4j.logger.org.apache.nutch.util.SitemapProcessor=INFO,cmdstdout {noformat} 3. Run using {noformat} bin/nutch org.apache.nutch.util.SitemapProcessor {noformat} I have started working on a *wiki page* describing this feature: https://wiki.apache.org/nutch/SitemapFeature Any suggestion and comments are welcome. > Support sitemaps in Nutch > - > > Key: NUTCH-1465 > URL: https://issues.apache.org/jira/browse/NUTCH-1465 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Lewis John McGibbney >Assignee: Tejas Patil > Fix For: 1.9 > > Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, > NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch > > > I recently came across this rather stagnant codebase[0] which is ASL v2.0 > licensed and appears to have been used successfully to parse sitemaps as per > the discussion here[1]. > [0] http://sourceforge.net/projects/sitemap-parser/ > [1] > http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (NUTCH-1325) HostDB for Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated NUTCH-1325: --- Attachment: NUTCH-1325-trunk-v4.patch > HostDB for Nutch > > > Key: NUTCH-1325 > URL: https://issues.apache.org/jira/browse/NUTCH-1325 > Project: Nutch > Issue Type: New Feature >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.9 > > Attachments: NUTCH-1325-1.6-1.patch, NUTCH-1325-trunk-v3.patch, > NUTCH-1325-trunk-v4.patch, NUTCH-1325.trunk.v2.path > > > A HostDB for Nutch and associated tools to create and read a database > containing information on hosts. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (NUTCH-1325) HostDB for Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated NUTCH-1325: --- Attachment: (was: NUTCH-1325-trunk-v4.patch) > HostDB for Nutch > > > Key: NUTCH-1325 > URL: https://issues.apache.org/jira/browse/NUTCH-1325 > Project: Nutch > Issue Type: New Feature >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.9 > > Attachments: NUTCH-1325-1.6-1.patch, NUTCH-1325-trunk-v3.patch, > NUTCH-1325.trunk.v2.path > > > A HostDB for Nutch and associated tools to create and read a database > containing information on hosts. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (NUTCH-1325) HostDB for Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated NUTCH-1325: --- Attachment: NUTCH-1325-trunk-v4.patch Attaching NUTCH-1325-trunk-v4.patch with following changes: - Fixed filterNormalize() to prevent from incorrectly pre-pending "http://"; to normal urls. - Migrated HostDb to new map-reduce API > HostDB for Nutch > > > Key: NUTCH-1325 > URL: https://issues.apache.org/jira/browse/NUTCH-1325 > Project: Nutch > Issue Type: New Feature >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.9 > > Attachments: NUTCH-1325-1.6-1.patch, NUTCH-1325-trunk-v3.patch, > NUTCH-1325-trunk-v4.patch, NUTCH-1325.trunk.v2.path > > > A HostDB for Nutch and associated tools to create and read a database > containing information on hosts. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1630) How to achieve finishing fetch approximately at the same time for each queue (a.k.a adaptive queue size)
[ https://issues.apache.org/jira/browse/NUTCH-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13875981#comment-13875981 ] Tejas Patil commented on NUTCH-1630: Hi [~talat], I didn't knew about NUTCH-1413. That the perfect way of getting the average response time. For larger crawls which spawn several days, this would give a good approximation of the response time. With that, the points in the first two paragraphs of my earlier comment are resolved. For the third paragraph, as you have made it configurable, crawl owners would have to make this choice. The concept behind the patch is good and would be value addition to Nutch. As [~jnioche] suggested, it would be super awesome if this could be a plugin or made less tangled with the Generate and Fetch code so that it accidentally doesn't introduce any bugs. > How to achieve finishing fetch approximately at the same time for each queue > (a.k.a adaptive queue size) > - > > Key: NUTCH-1630 > URL: https://issues.apache.org/jira/browse/NUTCH-1630 > Project: Nutch > Issue Type: Improvement >Affects Versions: 2.1, 2.2, 2.2.1 >Reporter: Talat UYARER > Labels: improvement > Fix For: 2.3 > > Attachments: NUTCH-1630.patch, NUTCH-1630v2.patch > > > Problem Definition: > When crawling, due to unproportional size of queues; fetching needs to wait > for a long time for long lasting queues when shorter ones are finished. That > means you may have to wait for a couple of days for some of queues. > Normally we define max queue size with generate.max.count but that's a static > value. However number of URLs to be fetched increases with each depth. > Defining same length for all queues does not mean all queues will finish > around the same time. This problem has been addressed by some other users > before [1]. So we came up with a different approach to this issue. > Solution: > Nutch has three mods for creating fetch queues (byHost, byDomain, ByIp). Our > solution can be applicable to all three mods. > 1-Define a "fetch workload of current queue" (FW) value for each queue based > on the previous fetches of that queue. > We calculate this by: > FW=average response time of previous depth * number of urls in current > queue > 2- Calculate the harmonic mean [2] of all FW's to get the average workload of > current depth (AW) > 3- Get the length for a queue by dividing AW by previously known average > response time of that queue: > Queue Length=AW / average response time > Using this algoritm leads to a fetch phase where all queues finish up around > the same time. > As soon as posible i will send my patch. Do you have any comments ? > [1] http://osdir.com/ml/dev.nutch.apache.org/2011-11/msg00101.html > [2] In our opinion; harmonic mean is best in our case because our data has a > few points that are much higher than the rest. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1630) How to achieve finishing fetch approximately at the same time for each queue (a.k.a adaptive queue size)
[ https://issues.apache.org/jira/browse/NUTCH-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13875947#comment-13875947 ] Tejas Patil commented on NUTCH-1630: Hi [~talat], So from 2nd depth onwards, you would ping the host in generate phase and get the response time. For large scale crawl setups, Generator itself might runs for few hours and at the time when you ping the host it might be loaded or there might be network traffic. When the acutal fetch phase runs, the response time might be different depending upon the load on the server. As I mentioned in earlier comment, I thought you were doing a cumulative sum of response timings for several urls of a host and then getting an average from it... which would give a better response time numbers. This would be harder to code in the existing codebase and might look ugly as fetcher needs to pass on this information to generator. +A more broader concern for crawls which run for days+ Server response timings itself change as the local time changes. For example during day time (say 8:00 - 11:00 am) there might be decent requests from users to the server as compared to night time (say 1:00 - 4:00 am) when there are very small number of users requesting the servers. Pinging the server during at some point in the 24 hour day would not give a good approximation for the response time for long running crawls. +Effect on crawlspace of slow servers+ If a server is genuine slow (say due to low end hardware), then it would always have slower response time as compared to other servers. Effectively, we would end up having smaller fetch queue for that host and thus creating huge backlog of its urls which would end up sitting in crawldb for not being generated over and over again. I would take your side on this: try to fetch as much as we can. But some crawl owners might be unhappy with this. > How to achieve finishing fetch approximately at the same time for each queue > (a.k.a adaptive queue size) > - > > Key: NUTCH-1630 > URL: https://issues.apache.org/jira/browse/NUTCH-1630 > Project: Nutch > Issue Type: Improvement >Affects Versions: 2.1, 2.2, 2.2.1 >Reporter: Talat UYARER > Labels: improvement > Fix For: 2.3 > > Attachments: NUTCH-1630.patch, NUTCH-1630v2.patch > > > Problem Definition: > When crawling, due to unproportional size of queues; fetching needs to wait > for a long time for long lasting queues when shorter ones are finished. That > means you may have to wait for a couple of days for some of queues. > Normally we define max queue size with generate.max.count but that's a static > value. However number of URLs to be fetched increases with each depth. > Defining same length for all queues does not mean all queues will finish > around the same time. This problem has been addressed by some other users > before [1]. So we came up with a different approach to this issue. > Solution: > Nutch has three mods for creating fetch queues (byHost, byDomain, ByIp). Our > solution can be applicable to all three mods. > 1-Define a "fetch workload of current queue" (FW) value for each queue based > on the previous fetches of that queue. > We calculate this by: > FW=average response time of previous depth * number of urls in current > queue > 2- Calculate the harmonic mean [2] of all FW's to get the average workload of > current depth (AW) > 3- Get the length for a queue by dividing AW by previously known average > response time of that queue: > Queue Length=AW / average response time > Using this algoritm leads to a fetch phase where all queues finish up around > the same time. > As soon as posible i will send my patch. Do you have any comments ? > [1] http://osdir.com/ml/dev.nutch.apache.org/2011-11/msg00101.html > [2] In our opinion; harmonic mean is best in our case because our data has a > few points that are much higher than the rest. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1697) SegmentMerger to implement Tool
[ https://issues.apache.org/jira/browse/NUTCH-1697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13875910#comment-13875910 ] Tejas Patil commented on NUTCH-1697: Hi [~markus17], Correct me if I am wrong: Hadoop properties should be passed as *-D property=value* (note the space after -D). The way you were passing ie. (*-Dproperty=value*) is applicable for JVM system properties and won't be picked up by Tool > SegmentMerger to implement Tool > --- > > Key: NUTCH-1697 > URL: https://issues.apache.org/jira/browse/NUTCH-1697 > Project: Nutch > Issue Type: Improvement >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Trivial > Fix For: 1.8 > > Attachments: NUTCH-1697-trunk.patch > > -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1630) How to achieve finishing fetch approximately at the same time for each queue (a.k.a adaptive queue size)
[ https://issues.apache.org/jira/browse/NUTCH-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13875902#comment-13875902 ] Tejas Patil commented on NUTCH-1630: Hi [~icebergx5], How do you obtain the average response time of previous depth ? I was hoping that it would be somewhere in the fetch phase where you somehow stored the response timings for each host then then pass on that information to the generate phase. > How to achieve finishing fetch approximately at the same time for each queue > (a.k.a adaptive queue size) > - > > Key: NUTCH-1630 > URL: https://issues.apache.org/jira/browse/NUTCH-1630 > Project: Nutch > Issue Type: Improvement >Affects Versions: 2.1, 2.2, 2.2.1 >Reporter: Talat UYARER > Labels: improvement > Fix For: 2.3 > > Attachments: NUTCH-1630.patch, NUTCH-1630v2.patch > > > Problem Definition: > When crawling, due to unproportional size of queues; fetching needs to wait > for a long time for long lasting queues when shorter ones are finished. That > means you may have to wait for a couple of days for some of queues. > Normally we define max queue size with generate.max.count but that's a static > value. However number of URLs to be fetched increases with each depth. > Defining same length for all queues does not mean all queues will finish > around the same time. This problem has been addressed by some other users > before [1]. So we came up with a different approach to this issue. > Solution: > Nutch has three mods for creating fetch queues (byHost, byDomain, ByIp). Our > solution can be applicable to all three mods. > 1-Define a "fetch workload of current queue" (FW) value for each queue based > on the previous fetches of that queue. > We calculate this by: > FW=average response time of previous depth * number of urls in current > queue > 2- Calculate the harmonic mean [2] of all FW's to get the average workload of > current depth (AW) > 3- Get the length for a queue by dividing AW by previously known average > response time of that queue: > Queue Length=AW / average response time > Using this algoritm leads to a fetch phase where all queues finish up around > the same time. > As soon as posible i will send my patch. Do you have any comments ? > [1] http://osdir.com/ml/dev.nutch.apache.org/2011-11/msg00101.html > [2] In our opinion; harmonic mean is best in our case because our data has a > few points that are much higher than the rest. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1680) CrawldbReader to dump minRetry value
[ https://issues.apache.org/jira/browse/NUTCH-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13875687#comment-13875687 ] Tejas Patil commented on NUTCH-1680: +1 > CrawldbReader to dump minRetry value > > > Key: NUTCH-1680 > URL: https://issues.apache.org/jira/browse/NUTCH-1680 > Project: Nutch > Issue Type: Improvement > Components: crawldb >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Trivial > Fix For: 1.8 > > Attachments: NUTCH-1680-trunk.patch > > > CrawlDBReader should be able to dump records based on minimum retry value. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1325) HostDB for Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13862240#comment-13862240 ] Tejas Patil commented on NUTCH-1325: Could anyone please look at the patch and let us know if there are any flaws or improvements that must be addressed ? > HostDB for Nutch > > > Key: NUTCH-1325 > URL: https://issues.apache.org/jira/browse/NUTCH-1325 > Project: Nutch > Issue Type: New Feature >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.9 > > Attachments: NUTCH-1325-1.6-1.patch, NUTCH-1325-trunk-v3.patch, > NUTCH-1325.trunk.v2.path > > > A HostDB for Nutch and associated tools to create and read a database > containing information on hosts. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13862237#comment-13862237 ] Tejas Patil commented on NUTCH-1465: Hi [~wastl-nagel], Yes. I think that it should be there too. I will be working on the patch this weekend and update on the same. Thanks for your inputs and suggestions till now in, were super helpful in chalking out the right specs for this feature. > Support sitemaps in Nutch > - > > Key: NUTCH-1465 > URL: https://issues.apache.org/jira/browse/NUTCH-1465 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Lewis John McGibbney >Assignee: Tejas Patil > Fix For: 1.9 > > Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, > NUTCH-1465-trunk.v1.patch > > > I recently came across this rather stagnant codebase[0] which is ASL v2.0 > licensed and appears to have been used successfully to parse sitemaps as per > the discussion here[1]. > [0] http://sourceforge.net/projects/sitemap-parser/ > [1] > http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-356) Plugin repository cache can lead to memory leak
[ https://issues.apache.org/jira/browse/NUTCH-356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13861217#comment-13861217 ] Tejas Patil commented on NUTCH-356: --- +1 for commit. > Plugin repository cache can lead to memory leak > --- > > Key: NUTCH-356 > URL: https://issues.apache.org/jira/browse/NUTCH-356 > Project: Nutch > Issue Type: Bug >Affects Versions: 0.8 >Reporter: Enrico Triolo > Fix For: 2.3, 1.8 > > Attachments: ASF.LICENSE.NOT.GRANTED--NutchTest.java, > ASF.LICENSE.NOT.GRANTED--patch.txt, NUTCH-356-trunk.patch, cache_classes.patch > > > While I was trying to solve a problem I reported a while ago (see Nutch-314), > I found out that actually the problem was related to the plugin cache used in > class PluginRepository.java. > As I said in Nutch-314, I think I somehow 'force' the way nutch is meant to > work, since I need to frequently submit new urls and append their contents to > the index; I don't (and I can't) have an urls.txt file with all urls I'm > going to fetch, but I recreate it each time a new url is submitted. > Thus, I think in the majority of times you won't have problems using nutch > as-is, since the problem I found occours only if nutch is used in a way > similar to the one I use. > To simplify your test I'm attaching a class that performs something similar > to what I need. It fetches and index some sample urls; to avoid webmasters > complaints I left the sample urls list empty, so you should modify the source > code and add some urls. > Then you only have to run it and watch your memory consumption with top. In > my experience I get an OutOfMemoryException after a couple of minutes, but it > clearly depends on your heap settings and on the plugins you are using (I'm > using > 'protocol-file|protocol-http|parse-(rss|html|msword|pdf|text)|language-identifier|index-(basic|more)|query-(basic|more|site|url)|urlfilter-regex|summary-basic|scoring-opic'). > The problem is bound to the PluginRepository 'singleton' instance, since it > never get released. It seems that some class maintains a reference to it and > this class is never released since it is cached somewhere in the > configuration. > So I modified the PluginRepository's 'get' method so that it never uses the > cache and always returns a new instance (you can find the patch in > attachment). This way the memory consumption is always stable and I get no > OOM anymore. > Clearly this is not the solution, since I guess there are many performance > issues involved, but for the moment it works. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1454) parsing chm failed
[ https://issues.apache.org/jira/browse/NUTCH-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13860803#comment-13860803 ] Tejas Patil commented on NUTCH-1454: TIKA-1122 is fixed and I have verified that 'parsechecker' works fine with the same. Upgrading to Tika 1.5 (yet to be released) should fix this for Nutch. > parsing chm failed > -- > > Key: NUTCH-1454 > URL: https://issues.apache.org/jira/browse/NUTCH-1454 > Project: Nutch > Issue Type: Bug > Components: parser >Affects Versions: 1.5.1 >Reporter: Sebastian Nagel >Priority: Minor > Fix For: 1.9 > > > (reported by Jan Riewe, see > http://lucene.472066.n3.nabble.com/CHM-Files-and-Tika-td3999735.html) > Nutch fails to parse chm files with > {quote} > ERROR tika.TikaParser - Can't retrieve Tika parser for mime-type > application/vnd.ms-htmlhelp > {quote} > Tested with chm test files from Tika: > {code} > % bin/nutch parsechecker > file:/.../tika/trunk/tika-parsers/src/test/resources/test-documents/testChm.chm > {code} > Tika parses this document (but does not extract any content). -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1691) DomainBlacklist url filter does not allow -D filter file override
[ https://issues.apache.org/jira/browse/NUTCH-1691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13860678#comment-13860678 ] Tejas Patil commented on NUTCH-1691: Hi [~markus17], Its a good solution. +1 from me. I would like to know the way you are invoking the plugin. I tried to use "bin/nutch plugin urlfilter-domainblacklist" but that didn't work as it doesn't got main(). > DomainBlacklist url filter does not allow -D filter file override > - > > Key: NUTCH-1691 > URL: https://issues.apache.org/jira/browse/NUTCH-1691 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.7 >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.8, 2.4 > > Attachments: NUTCH-1691-trunk.patch > > > This filter does not accept -Durlfilter.domainblacklist.file= overrides. The > plugin's file attribute is always used. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Closed] (NUTCH-1670) set same crawldb directory in mergedb parameter
[ https://issues.apache.org/jira/browse/NUTCH-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil closed NUTCH-1670. -- Resolution: Fixed Committed the patch by [~amuseme] to trunk (rev 1554883). > set same crawldb directory in mergedb parameter > --- > > Key: NUTCH-1670 > URL: https://issues.apache.org/jira/browse/NUTCH-1670 > Project: Nutch > Issue Type: Bug > Components: crawldb >Affects Versions: 1.7 >Reporter: lufeng >Assignee: lufeng >Priority: Minor > Labels: PatchAvailable > Fix For: 1.8 > > Attachments: NUTCH-1670.patch > > > when merge two crawldb using the same crawldb directory in bin/nutch merge > paramater, it will throw data not found exception. > bin/nutch mergedb crawldb_t1 crawldb_t1 crawldb_2 > bin/nutch generate crawldb_t1 segment -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (NUTCH-1080) Type safe members , arguments for better readability
[ https://issues.apache.org/jira/browse/NUTCH-1080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated NUTCH-1080: --- Fix Version/s: 1.8 > Type safe members , arguments for better readability > - > > Key: NUTCH-1080 > URL: https://issues.apache.org/jira/browse/NUTCH-1080 > Project: Nutch > Issue Type: Improvement > Components: fetcher >Reporter: Karthik K > Fix For: 2.3, 1.8 > > Attachments: NUTCH-1080-tejasp-trunk-v2.patch, NUTCH-1080.patch, > NUTCH-rel_14-1080.patch > > > Enable generics for some of the API, for better type safety and readability, > in the process. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1080) Type safe members , arguments for better readability
[ https://issues.apache.org/jira/browse/NUTCH-1080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13860643#comment-13860643 ] Tejas Patil commented on NUTCH-1080: Committed to trunk (rev 1554881). Will port the same to 2.x > Type safe members , arguments for better readability > - > > Key: NUTCH-1080 > URL: https://issues.apache.org/jira/browse/NUTCH-1080 > Project: Nutch > Issue Type: Improvement > Components: fetcher >Reporter: Karthik K > Fix For: 2.3, 1.8 > > Attachments: NUTCH-1080-tejasp-trunk-v2.patch, NUTCH-1080.patch, > NUTCH-rel_14-1080.patch > > > Enable generics for some of the API, for better type safety and readability, > in the process. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Assigned] (NUTCH-1080) Type safe members , arguments for better readability
[ https://issues.apache.org/jira/browse/NUTCH-1080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil reassigned NUTCH-1080: -- Assignee: Tejas Patil > Type safe members , arguments for better readability > - > > Key: NUTCH-1080 > URL: https://issues.apache.org/jira/browse/NUTCH-1080 > Project: Nutch > Issue Type: Improvement > Components: fetcher >Reporter: Karthik K >Assignee: Tejas Patil > Fix For: 2.3, 1.8 > > Attachments: NUTCH-1080-tejasp-trunk-v2.patch, NUTCH-1080.patch, > NUTCH-rel_14-1080.patch > > > Enable generics for some of the API, for better type safety and readability, > in the process. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (NUTCH-1080) Type safe members , arguments for better readability
[ https://issues.apache.org/jira/browse/NUTCH-1080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated NUTCH-1080: --- Attachment: NUTCH-1080-tejasp-trunk-v2.patch Attaching a patch for trunk. Uploaded the same over review board: https://reviews.apache.org/r/16563/ Comments are welcome !!! > Type safe members , arguments for better readability > - > > Key: NUTCH-1080 > URL: https://issues.apache.org/jira/browse/NUTCH-1080 > Project: Nutch > Issue Type: Improvement > Components: fetcher >Reporter: Karthik K > Fix For: 2.3 > > Attachments: NUTCH-1080-tejasp-trunk-v2.patch, NUTCH-1080.patch, > NUTCH-rel_14-1080.patch > > > Enable generics for some of the API, for better type safety and readability, > in the process. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (NUTCH-1325) HostDB for Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated NUTCH-1325: --- Attachment: NUTCH-1325-trunk-v3.patch A final patch (NUTCH-1325-trunk-v3.patch) to complete this feature. Uploaded the patch over review board too: https://reviews.apache.org/r/16555/ Comments are welcome !!! > HostDB for Nutch > > > Key: NUTCH-1325 > URL: https://issues.apache.org/jira/browse/NUTCH-1325 > Project: Nutch > Issue Type: New Feature >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.9 > > Attachments: NUTCH-1325-1.6-1.patch, NUTCH-1325-trunk-v3.patch, > NUTCH-1325.trunk.v2.path > > > A HostDB for Nutch and associated tools to create and read a database > containing information on hosts. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1670) set same crawldb directory in mergedb parameter
[ https://issues.apache.org/jira/browse/NUTCH-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13859987#comment-13859987 ] Tejas Patil commented on NUTCH-1670: Hi [~amuseme.lu], The patch looks good to me. +1 from me for commit. > set same crawldb directory in mergedb parameter > --- > > Key: NUTCH-1670 > URL: https://issues.apache.org/jira/browse/NUTCH-1670 > Project: Nutch > Issue Type: Bug > Components: crawldb >Affects Versions: 1.7 >Reporter: lufeng >Assignee: lufeng >Priority: Minor > Labels: PatchAvailable > Fix For: 1.8 > > Attachments: NUTCH-1670.patch > > > when merge two crawldb using the same crawldb directory in bin/nutch merge > paramater, it will throw data not found exception. > bin/nutch mergedb crawldb_t1 crawldb_t1 crawldb_2 > bin/nutch generate crawldb_t1 segment -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1687) Pick queue in Round Robin
[ https://issues.apache.org/jira/browse/NUTCH-1687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13859358#comment-13859358 ] Tejas Patil commented on NUTCH-1687: Created a review request: https://reviews.apache.org/r/16535/ > Pick queue in Round Robin > - > > Key: NUTCH-1687 > URL: https://issues.apache.org/jira/browse/NUTCH-1687 > Project: Nutch > Issue Type: Improvement > Components: fetcher >Reporter: Tien Nguyen Manh >Priority: Minor > Fix For: 2.3, 1.8 > > Attachments: NUTCH-1687.patch, NUTCH-1687.tejasp.v1.patch > > > Currently we chose queue to pick url from start of queues list, so queue at > the start of list have more change to be pick first, that can cause problem > of long tail queue, which only few queue available at the end which have many > urls. > public synchronized FetchItem getFetchItem() { > final Iterator> it = > queues.entrySet().iterator(); ==> always reset to find queue from > start > while (it.hasNext()) { > > I think it is better to pick queue in round robin, that can make reduce time > to find the available queue and make all queue was picked in round robin and > if we use TopN during generator there are no long tail queue at the end. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (NUTCH-1687) Pick queue in Round Robin
[ https://issues.apache.org/jira/browse/NUTCH-1687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated NUTCH-1687: --- Attachment: NUTCH-1687.tejasp.v1.patch I feel that there is no need for creating a separate class for Circular linked list and maintaining the circular list along with the original map. Uploading "NUTCH-1687.tejasp.v1.patch" : Uses [LinkedHashMap|http://docs.oracle.com/javase/7/docs/api/java/util/LinkedHashMap.html] along with a [Guava cyclic iterator|http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/collect/Iterables.html#cycle(java.lang.Iterable)] to iterate the map of queues in a circular fashion. With that no separate list needs to be maintained. Comments are welcome. > Pick queue in Round Robin > - > > Key: NUTCH-1687 > URL: https://issues.apache.org/jira/browse/NUTCH-1687 > Project: Nutch > Issue Type: Improvement > Components: fetcher >Reporter: Tien Nguyen Manh >Priority: Minor > Fix For: 2.3, 1.8 > > Attachments: NUTCH-1687.patch, NUTCH-1687.tejasp.v1.patch > > > Currently we chose queue to pick url from start of queues list, so queue at > the start of list have more change to be pick first, that can cause problem > of long tail queue, which only few queue available at the end which have many > urls. > public synchronized FetchItem getFetchItem() { > final Iterator> it = > queues.entrySet().iterator(); ==> always reset to find queue from > start > while (it.hasNext()) { > > I think it is better to pick queue in round robin, that can make reduce time > to find the available queue and make all queue was picked in round robin and > if we use TopN during generator there are no long tail queue at the end. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1687) Pick queue in Round Robin
[ https://issues.apache.org/jira/browse/NUTCH-1687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13859275#comment-13859275 ] Tejas Patil commented on NUTCH-1687: This is one good point by [~tiennm]. Although this might not give significant performance improvement, it would fairly distribute requests across all fetch queues. Some comments wrt the patch: 1. Do you really need to make the methods of CircularLinkedList class thread safe ? The methods in "FetchItemQueues" which interact with the CircularLinkedList (ie. getFetchItemQueue and getFetchItem) are all synchronized. So, its ensured that only one thread accesses the list at a time. 2. Why is 'id' needed in FetchItemQueue ? > Pick queue in Round Robin > - > > Key: NUTCH-1687 > URL: https://issues.apache.org/jira/browse/NUTCH-1687 > Project: Nutch > Issue Type: Improvement > Components: fetcher >Reporter: Tien Nguyen Manh >Priority: Minor > Fix For: 2.3, 1.8 > > Attachments: NUTCH-1687.patch > > > Currently we chose queue to pick url from start of queues list, so queue at > the start of list have more change to be pick first, that can cause problem > of long tail queue, which only few queue available at the end which have many > urls. > public synchronized FetchItem getFetchItem() { > final Iterator> it = > queues.entrySet().iterator(); ==> always reset to find queue from > start > while (it.hasNext()) { > > I think it is better to pick queue in round robin, that can make reduce time > to find the available queue and make all queue was picked in round robin and > if we use TopN during generator there are no long tail queue at the end. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (NUTCH-1687) Pick queue in Round Robin
[ https://issues.apache.org/jira/browse/NUTCH-1687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated NUTCH-1687: --- Fix Version/s: 1.8 > Pick queue in Round Robin > - > > Key: NUTCH-1687 > URL: https://issues.apache.org/jira/browse/NUTCH-1687 > Project: Nutch > Issue Type: Improvement > Components: fetcher >Reporter: Tien Nguyen Manh >Priority: Minor > Fix For: 2.3, 1.8 > > Attachments: NUTCH-1687.patch > > > Currently we chose queue to pick url from start of queues list, so queue at > the start of list have more change to be pick first, that can cause problem > of long tail queue, which only few queue available at the end which have many > urls. > public synchronized FetchItem getFetchItem() { > final Iterator> it = > queues.entrySet().iterator(); ==> always reset to find queue from > start > while (it.hasNext()) { > > I think it is better to pick queue in round robin, that can make reduce time > to find the available queue and make all queue was picked in round robin and > if we use TopN during generator there are no long tail queue at the end. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1689) Improve CrawlDb stats
[ https://issues.apache.org/jira/browse/NUTCH-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13855416#comment-13855416 ] Tejas Patil commented on NUTCH-1689: Some concerns: 1. While you are removing fields from the output, there can be people relying on the existing output (grepping or awking to get required fields). It ain't wise to simply remove off all the fields directly. Keep things backward compatible. 2. You can make the command configurable so that users get to select what all fields they want in the output 3. While submitting patch, commenting out the older code is not the best way. Remove those lines instead of commenting them out. > Improve CrawlDb stats > - > > Key: NUTCH-1689 > URL: https://issues.apache.org/jira/browse/NUTCH-1689 > Project: Nutch > Issue Type: Improvement >Reporter: Nguyen Manh Tien >Priority: Minor > Fix For: 2.3 > > Attachments: NUTCH-1689.patch > > > Crawldb stats now is slow due to it load all fields from store, I change to > load only necessary fields. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13848723#comment-13848723 ] Tejas Patil commented on NUTCH-1465: Hi [~wastl-nagel], Nice share. The only grudge I have with that approach is that users will have to pick up sitemap urls for hosts *manually* and feed to the sitemap injector. It would fit well where users are performing targeted crawling. For a large scale, open web crawl use case: (i) the number of initial hosts can be large : one time burden for users (ii) crawler discovers new hosts with time : constant pain for users to look out for the new hosts discovered and then get sitemaps from robots.txt manually. With HostDB from NUTCH-1325 and B, users won't suffer here. > do we really need an extra DB? I should have been clear with the explanation. "sitemapDB" is some temporary location where all crawl datums of sitemap entries would be written. This can be deleted after merge with the main crawlDB. Quite analogous to what inject operation does. > NUTCH-1622 would enable solution A: outlinks now can hold extra info. I didn't knew that. Still I would go in favor of B as it is clean and A would involve messing around with existing codebase at several places. > Support sitemaps in Nutch > - > > Key: NUTCH-1465 > URL: https://issues.apache.org/jira/browse/NUTCH-1465 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Lewis John McGibbney >Assignee: Tejas Patil > Fix For: 1.9 > > Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, > NUTCH-1465-trunk.v1.patch > > > I recently came across this rather stagnant codebase[0] which is ASL v2.0 > licensed and appears to have been used successfully to parse sitemaps as per > the discussion here[1]. > [0] http://sourceforge.net/projects/sitemap-parser/ > [1] > http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Comment Edited] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13848723#comment-13848723 ] Tejas Patil edited comment on NUTCH-1465 at 12/16/13 12:09 AM: --- Hi [~wastl-nagel], Nice share. The only grudge I have with that approach is that users will have to pick up sitemap urls for hosts *manually* and feed to the sitemap injector. It would fit well where users are performing targeted crawling. For a large scale, open web crawl use case: i) the number of initial hosts can be large : one time burden for users ii) crawler discovers new hosts with time : constant pain for users to look out for the new hosts discovered and then get sitemaps from robots.txt manually. With HostDB from NUTCH-1325 and B, users won't suffer here. > do we really need an extra DB? I should have been clear with the explanation. "sitemapDB" is some temporary location where all crawl datums of sitemap entries would be written. This can be deleted after merge with the main crawlDB. Quite analogous to what inject operation does. > NUTCH-1622 would enable solution A: outlinks now can hold extra info. I didn't knew that. Still I would go in favor of B as it is clean and A would involve messing around with existing codebase at several places. was (Author: tejasp): Hi [~wastl-nagel], Nice share. The only grudge I have with that approach is that users will have to pick up sitemap urls for hosts *manually* and feed to the sitemap injector. It would fit well where users are performing targeted crawling. For a large scale, open web crawl use case: (i) the number of initial hosts can be large : one time burden for users (ii) crawler discovers new hosts with time : constant pain for users to look out for the new hosts discovered and then get sitemaps from robots.txt manually. With HostDB from NUTCH-1325 and B, users won't suffer here. > do we really need an extra DB? I should have been clear with the explanation. "sitemapDB" is some temporary location where all crawl datums of sitemap entries would be written. This can be deleted after merge with the main crawlDB. Quite analogous to what inject operation does. > NUTCH-1622 would enable solution A: outlinks now can hold extra info. I didn't knew that. Still I would go in favor of B as it is clean and A would involve messing around with existing codebase at several places. > Support sitemaps in Nutch > - > > Key: NUTCH-1465 > URL: https://issues.apache.org/jira/browse/NUTCH-1465 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Lewis John McGibbney >Assignee: Tejas Patil > Fix For: 1.9 > > Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, > NUTCH-1465-trunk.v1.patch > > > I recently came across this rather stagnant codebase[0] which is ASL v2.0 > licensed and appears to have been used successfully to parse sitemaps as per > the discussion here[1]. > [0] http://sourceforge.net/projects/sitemap-parser/ > [1] > http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13848561#comment-13848561 ] Tejas Patil commented on NUTCH-1465: Revisited this Jira after a long time and gave a thought how this can be done cleanly. Two ways for implementing this: *(A) Do the sitemap stuff in the fetch phase of nutch cycle.* This was my original approach which the (in-progress) patch addresses. This would involve tweaking core nutch classes at several locations. Pros: - Sitemaps are nothing but normal pages with several outlinks. Fits well in the 'fetch' cycle. Cons: - Sitemaps can be very huge in size. Fetching them need large size and time limits. Fetch code must have a special case to take into account that the url is a sitemap url and use custom limits => leads to hacky coding style. - Outlink class cannot hold extra information contained in sitemaps (like lastmod, changefreq). Modify it to hold this information too. This would be specific for sitemaps only yet we end up making all outlinks to hold this info. We could create a special type of outlink and take care of this. *(B) Have separate job for the sitemap stuff and merge its output into the crawldb.* i. User populates a list of hosts (or uses HostDB from NUTCH-1325). Now we got all the hosts to be processed. ii. Run a map-reduce job: for each host, - get the robots page, extract sitemap urls, - get xml content of these sitemap pages - create crawl datums with the requried info and write this to a sitemapDB iii. Use CrawlDbMerger utility to merge the sitemapDB and crawldb Pros: - Cleaner code. - Users have control when to perform sitemap extraction. This is better than (A) wherein sitemap urls are sitting in the crawldb and get fetched along with normal pages (thus, eating up fetch time of every fetch phase). We can have a sitemap_fequency used insdie the crawl script so that users say that after 'x' nutch cycles, run sitemap processing. Cons: - Additional map-reduce jobs are needed. I think that this must be reasonable. Running sitemap job 1-5 times in a month on a production level crawl would work out well. I am inclined towards implementing (B) > Support sitemaps in Nutch > - > > Key: NUTCH-1465 > URL: https://issues.apache.org/jira/browse/NUTCH-1465 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Lewis John McGibbney >Assignee: Tejas Patil > Fix For: 1.9 > > Attachments: NUTCH-1465-trunk.v1.patch > > > I recently came across this rather stagnant codebase[0] which is ASL v2.0 > licensed and appears to have been used successfully to parse sitemaps as per > the discussion here[1]. > [0] http://sourceforge.net/projects/sitemap-parser/ > [1] > http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (NUTCH-1325) HostDB for Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13848517#comment-13848517 ] Tejas Patil commented on NUTCH-1325: Hi [~markus17], I stopped by this Jira (after a long time !!!) with an intention of getting it to a stage where we could have it inside trunk. You had replied to my two concerns. For (1): {noformat}host_a.example.org, host_b.example.org ==> example.org{noformat} This might *NOT* be a good idea. (a) The websites for say "cs.uci.edu" and "bio.uci.edu" might be hosted independently. It can be argued to consider them as different hosts. (b) I am not sure about the standards, but if something like "uci.cs.edu" is valid (subdomain is suffix of domain) then there would be a problem when we resolve "uci.cs.edu" and "ucla.cs.edu" to "cs.edu". For (2): "I use the HTTP:// scheme but not all hosts may allow that scheme. We have a modified domain filter that optionally takes a scheme so we can force HTTPS for specific domains. Those domains are filtered out because HTTP is not allowed." Do you have any suggestion to work this out ? > HostDB for Nutch > > > Key: NUTCH-1325 > URL: https://issues.apache.org/jira/browse/NUTCH-1325 > Project: Nutch > Issue Type: New Feature >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.9 > > Attachments: NUTCH-1325-1.6-1.patch, NUTCH-1325.trunk.v2.path > > > A HostDB for Nutch and associated tools to create and read a database > containing information on hosts. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (NUTCH-1577) Add target for creating eclipse project
[ https://issues.apache.org/jira/browse/NUTCH-1577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13848493#comment-13848493 ] Tejas Patil commented on NUTCH-1577: There was some checkin(s) in past few months which have lead to one jar (solr-solrj-3.4.0.jar) being required to be in eclipse classpath and 'ant eclipse' not building the project smoothly. Fixed the same. Committed at revision 1550987. > Add target for creating eclipse project > --- > > Key: NUTCH-1577 > URL: https://issues.apache.org/jira/browse/NUTCH-1577 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.6, 2.1 >Reporter: Tejas Patil >Assignee: Tejas Patil >Priority: Minor > Labels: build, eclipse > Fix For: 1.7, 2.2 > > Attachments: NUTCH-1577.2.x.patch, NUTCH-1577.trunk.patch > > > Currently, loading Nutch source code in Eclipse as a project is cumbersome > and involves lot of manual steps as given over > [wiki|http://wiki.apache.org/nutch/RunNutchInEclipse]. It would be great to > automate this. Adding a ant target to do that would remove burden off from > developers. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (NUTCH-1325) HostDB for Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13736459#comment-13736459 ] Tejas Patil commented on NUTCH-1325: Hi [~markus17], > think i've got a slightly newer version of the tools but don't know what > actually changed in the past year. I'll try to diff and upload it. Could you kindly upload the newer version ? > HostDB for Nutch > > > Key: NUTCH-1325 > URL: https://issues.apache.org/jira/browse/NUTCH-1325 > Project: Nutch > Issue Type: New Feature >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.9 > > Attachments: NUTCH-1325-1.6-1.patch, NUTCH-1325.trunk.v2.path > > > A HostDB for Nutch and associated tools to create and read a database > containing information on hosts. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1599) Obtain consensus on new description of Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13699115#comment-13699115 ] Tejas Patil commented on NUTCH-1599: I agree with Julien: Nutch should be described as a web-crawler. Markus took it to the next level by adding more technicality :) So "Highly extensible and scalable web crawler software" it is !! > Obtain consensus on new description of Nutch > > > Key: NUTCH-1599 > URL: https://issues.apache.org/jira/browse/NUTCH-1599 > Project: Nutch > Issue Type: Improvement > Components: documentation >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 2.3, 1.8 > > > As we seem to be sustaining pushes and maintenance (touch wood) of two > branches, I think it is about time we agreed on a more accurate description > of what Nutch actually is. > We currently have (taken directly from our site) > {code:xml} > Apache Nutch is an open source web-search software project. Stemming from > Apache Lucene, it now builds on Apache Solr adding web-specifics, such as a > crawler, a link-graph database and parsing support handled by Apache Tika for > HTML and and array other document formats. > Nutch can run on a single machine, but gains a lot of its strength from > running in a Hadoop cluster > The system can be enhanced (eg other document formats can be parsed) using a > highly flexible, easily extensible and thoroughly maintained plugin > infrastructure. > {code} > I suggest/propose something along the lines of > {code:xml} > Apache Nutch is an open source web-search software project. Stemming from > Apache Lucene, the community now develops and maintains two branches: > * 1.x; description of 1.x here > * 2.x; description of 2.x here > Both branches add web-specifics, such as a crawler, a link-graph database and > parsing support handled by Apache Tika for HTML and anarray other document > formats. > Nutch can run on a single machine, but gains a lot of its strength from > running in a Hadoop cluster > The system can be enhanced (eg other document formats can be parsed) using a > highly flexible, easily extensible and thoroughly maintained plugin > infrastructure. > {code} > Any thoughts? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1602) improve the readability of metadata in readdb dump normal
[ https://issues.apache.org/jira/browse/NUTCH-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13699096#comment-13699096 ] Tejas Patil commented on NUTCH-1602: Hi Lufeng, +1 from me too. One minor suggestion: You could add space in between "=" and ";" to make it even better. > improve the readability of metadata in readdb dump normal > -- > > Key: NUTCH-1602 > URL: https://issues.apache.org/jira/browse/NUTCH-1602 > Project: Nutch > Issue Type: Improvement > Components: crawldb >Affects Versions: 1.7 >Reporter: lufeng >Assignee: lufeng >Priority: Minor > Fix For: 1.8 > > Attachments: NUTCH-1602.patch > > > the dumped metadata format is not readable. > {code:xml} > $bin/nutch readdb crawldb/ -dump dir > http://www.baidu.com/ Version: 7 > Status: 3 (db_gone) > Fetch time: Sat Aug 17 22:35:37 CST 2013 > Modified time: Thu Jan 01 08:00:00 CST 1970 > Retries since fetch: 0 > Retry interval: 3888000 seconds (45 days) > Score: 1.0 > Signature: null > Metadata: m1: v22m3: v3m2: v2m5: v5m4: m4_pst_: robots_denied(18), > lastModified=0m6: v6 > {code} > so I improve the Metadata format to this > {code:xml} > Metadata: m1=v22;m3=v3;m2=v2;m5=v5;m4=m4;_pst_=robots_denied(18), > lastModified=0;m6=v6; > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1327) QueryStringNormalizer
[ https://issues.apache.org/jira/browse/NUTCH-1327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13696840#comment-13696840 ] Tejas Patil commented on NUTCH-1327: Hi Markus, 1. The patch when applied as is didn't compile the plugin. I had to add entries into src/plugin/build.xml to get it compiled. 2. Can you kindly add some javadoc comments in QuerystringURLNormalizer class so that people can quickly get an idea about what this plugin would do ? > QueryStringNormalizer > - > > Key: NUTCH-1327 > URL: https://issues.apache.org/jira/browse/NUTCH-1327 > Project: Nutch > Issue Type: New Feature >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.9 > > Attachments: NUTCH-1327-1.8-1.patch > > > A normalizer for dealing with query strings. Sorting query strings is helpful > in preventing duplicates for some (bad) websites. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1126) JUnit test for urlfilter-prefix
[ https://issues.apache.org/jira/browse/NUTCH-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13692069#comment-13692069 ] Tejas Patil commented on NUTCH-1126: Thanks Talat and Cihad :) One small thing: @author tags should not be used in Apache projects - see http://mail-archives.apache.org/mod_mbox/www-community/200402.mbox/%3c403a144a.5040...@apache.org%3E Please remove those while submitting patches. > JUnit test for urlfilter-prefix > --- > > Key: NUTCH-1126 > URL: https://issues.apache.org/jira/browse/NUTCH-1126 > Project: Nutch > Issue Type: Sub-task > Components: build >Affects Versions: 1.4 >Reporter: Lewis John McGibbney >Assignee: Markus Jelsma >Priority: Minor > Fix For: 2.3, 1.8 > > Attachments: test_case_for_urlfilter-prefix.patch > > > This issue is part of the larger attempt to provide a Junit test case for > every Nutch plugin. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1578) Upgrade to Hadoop 1.2.0
[ https://issues.apache.org/jira/browse/NUTCH-1578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13672868#comment-13672868 ] Tejas Patil commented on NUTCH-1578: +1. We should go for this. > Upgrade to Hadoop 1.2.0 > --- > > Key: NUTCH-1578 > URL: https://issues.apache.org/jira/browse/NUTCH-1578 > Project: Nutch > Issue Type: Task >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Trivial > Fix For: 1.7, 2.3 > > > Hadoop 1.2.0 finally has the ability to run mappers in parallel when running > in local mode. In trunk at least the generator seems to run slightly faster. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1577) Add target for creating eclipse project
[ https://issues.apache.org/jira/browse/NUTCH-1577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13672650#comment-13672650 ] Tejas Patil commented on NUTCH-1577: Hi [~wastl-nagel], +1 for the suggestion. I have sorted the plugin packages now. Committed to trunk (r1488768) and 2.x (r1488770). > Add target for creating eclipse project > --- > > Key: NUTCH-1577 > URL: https://issues.apache.org/jira/browse/NUTCH-1577 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.6, 2.1 >Reporter: Tejas Patil >Assignee: Tejas Patil >Priority: Minor > Labels: build, eclipse > Fix For: 1.7, 2.2 > > Attachments: NUTCH-1577.2.x.patch, NUTCH-1577.trunk.patch > > > Currently, loading Nutch source code in Eclipse as a project is cumbersome > and involves lot of manual steps as given over > [wiki|http://wiki.apache.org/nutch/RunNutchInEclipse]. It would be great to > automate this. Adding a ant target to do that would remove burden off from > developers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-1577) Add target for creating eclipse project
[ https://issues.apache.org/jira/browse/NUTCH-1577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil resolved NUTCH-1577. Resolution: Fixed Updated the documentation page [RunNutchInEclipse|http://wiki.apache.org/nutch/RunNutchInEclipse] to reflect the new steps. > Add target for creating eclipse project > --- > > Key: NUTCH-1577 > URL: https://issues.apache.org/jira/browse/NUTCH-1577 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.6, 2.1 >Reporter: Tejas Patil >Assignee: Tejas Patil >Priority: Minor > Labels: build, eclipse > Fix For: 1.7, 2.2 > > Attachments: NUTCH-1577.2.x.patch, NUTCH-1577.trunk.patch > > > Currently, loading Nutch source code in Eclipse as a project is cumbersome > and involves lot of manual steps as given over > [wiki|http://wiki.apache.org/nutch/RunNutchInEclipse]. It would be great to > automate this. Adding a ant target to do that would remove burden off from > developers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1577) Add target for creating eclipse project
[ https://issues.apache.org/jira/browse/NUTCH-1577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13671823#comment-13671823 ] Tejas Patil commented on NUTCH-1577: Committed to trunk at rev1488396. My next task is to update the wiki page with the new steps and then close this jira. > Add target for creating eclipse project > --- > > Key: NUTCH-1577 > URL: https://issues.apache.org/jira/browse/NUTCH-1577 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.6, 2.1 >Reporter: Tejas Patil >Assignee: Tejas Patil >Priority: Minor > Labels: build, eclipse > Fix For: 1.7, 2.2 > > Attachments: NUTCH-1577.2.x.patch, NUTCH-1577.trunk.patch > > > Currently, loading Nutch source code in Eclipse as a project is cumbersome > and involves lot of manual steps as given over > [wiki|http://wiki.apache.org/nutch/RunNutchInEclipse]. It would be great to > automate this. Adding a ant target to do that would remove burden off from > developers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1577) Add target for creating eclipse project
[ https://issues.apache.org/jira/browse/NUTCH-1577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated NUTCH-1577: --- Attachment: NUTCH-1577.2.x.patch Patch for 2.x > Add target for creating eclipse project > --- > > Key: NUTCH-1577 > URL: https://issues.apache.org/jira/browse/NUTCH-1577 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.6, 2.1 >Reporter: Tejas Patil >Assignee: Tejas Patil >Priority: Minor > Labels: build, eclipse > Fix For: 1.7, 2.2 > > Attachments: NUTCH-1577.2.x.patch, NUTCH-1577.trunk.patch > > > Currently, loading Nutch source code in Eclipse as a project is cumbersome > and involves lot of manual steps as given over > [wiki|http://wiki.apache.org/nutch/RunNutchInEclipse]. It would be great to > automate this. Adding a ant target to do that would remove burden off from > developers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Comment Edited] (NUTCH-1577) Add target for creating eclipse project
[ https://issues.apache.org/jira/browse/NUTCH-1577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13671305#comment-13671305 ] Tejas Patil edited comment on NUTCH-1577 at 5/31/13 10:22 AM: -- Here is a patch for trunk. How to use it: * on a SVN checkout of trunk, apply the patch * run "ant eclipse" * In eclipse: "File" -> "Import" -> "Existing projects into workspace". Give the path of the trunk directory. Initially it would show some errors (red dots) but those will go away after it builds the workspace. was (Author: tejasp): Here is a patch for trunk. How to use it: * on a SVN checkout of trunk, apply the patch * run "ant eclipse" * In eclipse: "File" -> "Import" -> "Existing projects into workspace". Give the path of the trunk directory. Initially it would show some errors (red dots) but those will go away after it auto-compiles the newly imported project. > Add target for creating eclipse project > --- > > Key: NUTCH-1577 > URL: https://issues.apache.org/jira/browse/NUTCH-1577 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.6, 2.1 >Reporter: Tejas Patil >Assignee: Tejas Patil >Priority: Minor > Labels: build, eclipse > Fix For: 1.7, 2.2 > > Attachments: NUTCH-1577.trunk.patch > > > Currently, loading Nutch source code in Eclipse as a project is cumbersome > and involves lot of manual steps as given over > [wiki|http://wiki.apache.org/nutch/RunNutchInEclipse]. It would be great to > automate this. Adding a ant target to do that would remove burden off from > developers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1577) Add target for creating eclipse project
[ https://issues.apache.org/jira/browse/NUTCH-1577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated NUTCH-1577: --- Attachment: NUTCH-1577.trunk.patch Here is a patch for trunk. How to use it: * on a SVN checkout of trunk, apply the patch * run "ant eclipse" * In eclipse: "File" -> "Import" -> "Existing projects into workspace". Give the patch of the trunk directory. Initially it would show some errors (red dots) but those will go away after it auto-compiles the newly imported project. > Add target for creating eclipse project > --- > > Key: NUTCH-1577 > URL: https://issues.apache.org/jira/browse/NUTCH-1577 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.6, 2.1 >Reporter: Tejas Patil >Assignee: Tejas Patil >Priority: Minor > Labels: build, eclipse > Fix For: 1.7, 2.2 > > Attachments: NUTCH-1577.trunk.patch > > > Currently, loading Nutch source code in Eclipse as a project is cumbersome > and involves lot of manual steps as given over > [wiki|http://wiki.apache.org/nutch/RunNutchInEclipse]. It would be great to > automate this. Adding a ant target to do that would remove burden off from > developers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Comment Edited] (NUTCH-1577) Add target for creating eclipse project
[ https://issues.apache.org/jira/browse/NUTCH-1577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13671305#comment-13671305 ] Tejas Patil edited comment on NUTCH-1577 at 5/31/13 10:19 AM: -- Here is a patch for trunk. How to use it: * on a SVN checkout of trunk, apply the patch * run "ant eclipse" * In eclipse: "File" -> "Import" -> "Existing projects into workspace". Give the path of the trunk directory. Initially it would show some errors (red dots) but those will go away after it auto-compiles the newly imported project. was (Author: tejasp): Here is a patch for trunk. How to use it: * on a SVN checkout of trunk, apply the patch * run "ant eclipse" * In eclipse: "File" -> "Import" -> "Existing projects into workspace". Give the patch of the trunk directory. Initially it would show some errors (red dots) but those will go away after it auto-compiles the newly imported project. > Add target for creating eclipse project > --- > > Key: NUTCH-1577 > URL: https://issues.apache.org/jira/browse/NUTCH-1577 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.6, 2.1 >Reporter: Tejas Patil >Assignee: Tejas Patil >Priority: Minor > Labels: build, eclipse > Fix For: 1.7, 2.2 > > Attachments: NUTCH-1577.trunk.patch > > > Currently, loading Nutch source code in Eclipse as a project is cumbersome > and involves lot of manual steps as given over > [wiki|http://wiki.apache.org/nutch/RunNutchInEclipse]. It would be great to > automate this. Adding a ant target to do that would remove burden off from > developers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1577) Add target for creating eclipse project
Tejas Patil created NUTCH-1577: -- Summary: Add target for creating eclipse project Key: NUTCH-1577 URL: https://issues.apache.org/jira/browse/NUTCH-1577 Project: Nutch Issue Type: Improvement Affects Versions: 2.1, 1.6 Reporter: Tejas Patil Assignee: Tejas Patil Priority: Minor Fix For: 1.7, 2.2 Currently, loading Nutch source code in Eclipse as a project is cumbersome and involves lot of manual steps as given over [wiki|http://wiki.apache.org/nutch/RunNutchInEclipse]. It would be great to automate this. Adding a ant target to do that would remove burden off from developers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1563) FetchSchedule#getFields is never used by GeneraterJob
[ https://issues.apache.org/jira/browse/NUTCH-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13665267#comment-13665267 ] Tejas Patil commented on NUTCH-1563: You pushed it at the right place [~amuseme] :) If there is nothing left to be done for this Jira, please close it off. > FetchSchedule#getFields is never used by GeneraterJob > - > > Key: NUTCH-1563 > URL: https://issues.apache.org/jira/browse/NUTCH-1563 > Project: Nutch > Issue Type: Bug > Components: generator >Affects Versions: 2.1 >Reporter: lufeng >Assignee: lufeng >Priority: Minor > Fix For: 2.2 > > Attachments: NUTCH-1563.patch > > > The method of getFields in FetchSchedule if never used, so if user extends > the FetchSchedule and want to get some fields of WebPage, it always return > null. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1563) FetchSchedule#getFields is never used by GeneraterJob
[ https://issues.apache.org/jira/browse/NUTCH-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13664408#comment-13664408 ] Tejas Patil commented on NUTCH-1563: I think this is relevant to only 2.x and [~amuseme.lu] has pushed the patch to svn. Any work left here ? > FetchSchedule#getFields is never used by GeneraterJob > - > > Key: NUTCH-1563 > URL: https://issues.apache.org/jira/browse/NUTCH-1563 > Project: Nutch > Issue Type: Bug > Components: generator >Affects Versions: 2.1 >Reporter: lufeng >Assignee: lufeng >Priority: Minor > Fix For: 2.3 > > Attachments: NUTCH-1563.patch > > > The method of getFields in FetchSchedule if never used, so if user extends > the FetchSchedule and want to get some fields of WebPage, it always return > null. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-1249) Resolve all issues flagged up by adding javac -Xlint arguement
[ https://issues.apache.org/jira/browse/NUTCH-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil resolved NUTCH-1249. Resolution: Fixed Fix Version/s: 2.2 Assignee: Tejas Patil (was: Lewis John McGibbney) Ported the patch for trunk to 2.x. All the tests are passing (verified on Java 1.7.0_10 and 1.6.0_38). Committed to svn at rev 1485125. > Resolve all issues flagged up by adding javac -Xlint arguement > -- > > Key: NUTCH-1249 > URL: https://issues.apache.org/jira/browse/NUTCH-1249 > Project: Nutch > Issue Type: Improvement > Components: build >Affects Versions: nutchgora >Reporter: Lewis John McGibbney >Assignee: Tejas Patil >Priority: Minor > Fix For: 1.7, 2.2 > > Attachments: NUTCH-1249.trunk.patch > > > There are a heap of issues flagged up by NUTCH-1237, I think over time it > would be great to get these addressed and resolved. > What is interesting is that adding the same arguements to > /src/plugin/plugin-build.xml actually breaks my build as tests begin to fail. > Some of this stuff is documented in the link below > http://docs.oracle.com/javase/1.5.0/docs/tooldocs/windows/javac.html#options -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-1275) Fix [unchecked] javac warnings
[ https://issues.apache.org/jira/browse/NUTCH-1275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil resolved NUTCH-1275. Resolution: Fixed Fix Version/s: 2.2 Got resolved with NUTCH-1249 > Fix [unchecked] javac warnings > -- > > Key: NUTCH-1275 > URL: https://issues.apache.org/jira/browse/NUTCH-1275 > Project: Nutch > Issue Type: Sub-task > Components: build >Affects Versions: nutchgora, 1.5 >Reporter: Lewis John McGibbney >Assignee: Tejas Patil >Priority: Minor > Fix For: 1.7, 2.2 > > > We can simply suppress these warnings using > {code} > SuppressWarnings [unchecked] > {code} > However if there is a another method for resolving these warnings then they > should be implemented if deemed beneficial to code quality. > Some resources > http://java.sun.com/docs/books/jls/third_edition/html/conversions.html#190772 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1275) Fix [unchecked] javac warnings
[ https://issues.apache.org/jira/browse/NUTCH-1275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13663792#comment-13663792 ] Tejas Patil commented on NUTCH-1275: Hi [~lewismc], I am working on a patch for 2.x. > Fix [unchecked] javac warnings > -- > > Key: NUTCH-1275 > URL: https://issues.apache.org/jira/browse/NUTCH-1275 > Project: Nutch > Issue Type: Sub-task > Components: build >Affects Versions: nutchgora, 1.5 >Reporter: Lewis John McGibbney >Assignee: Tejas Patil >Priority: Minor > Fix For: 1.7 > > > We can simply suppress these warnings using > {code} > SuppressWarnings [unchecked] > {code} > However if there is a another method for resolving these warnings then they > should be implemented if deemed beneficial to code quality. > Some resources > http://java.sun.com/docs/books/jls/third_edition/html/conversions.html#190772 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (NUTCH-1275) Fix [unchecked] javac warnings
[ https://issues.apache.org/jira/browse/NUTCH-1275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil reassigned NUTCH-1275: -- Assignee: Tejas Patil > Fix [unchecked] javac warnings > -- > > Key: NUTCH-1275 > URL: https://issues.apache.org/jira/browse/NUTCH-1275 > Project: Nutch > Issue Type: Sub-task > Components: build >Affects Versions: nutchgora, 1.5 >Reporter: Lewis John McGibbney >Assignee: Tejas Patil >Priority: Minor > Fix For: 1.7 > > > We can simply suppress these warnings using > {code} > SuppressWarnings [unchecked] > {code} > However if there is a another method for resolving these warnings then they > should be implemented if deemed beneficial to code quality. > Some resources > http://java.sun.com/docs/books/jls/third_edition/html/conversions.html#190772 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1569) Upgrade 2.x to Gora 0.3
[ https://issues.apache.org/jira/browse/NUTCH-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13662571#comment-13662571 ] Tejas Patil commented on NUTCH-1569: If using some other backend would be an overkill, then lets stick to MemStore. +1 for me too. > Upgrade 2.x to Gora 0.3 > --- > > Key: NUTCH-1569 > URL: https://issues.apache.org/jira/browse/NUTCH-1569 > Project: Nutch > Issue Type: Improvement > Components: build, storage >Affects Versions: 2.2 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 2.2 > > Attachments: NUTCH-1569.patch, NUTCH-1569.v2.patch > > > We just released the Maven artifacts and I would like to upgrade before we > push the RC for 2.2 :) > Patch coming up -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-1513) Support Robots.txt for Ftp urls
[ https://issues.apache.org/jira/browse/NUTCH-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil resolved NUTCH-1513. Resolution: Fixed Committed to trunk (rev 1484638) and 2.x (rev 1484637) > Support Robots.txt for Ftp urls > --- > > Key: NUTCH-1513 > URL: https://issues.apache.org/jira/browse/NUTCH-1513 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.7, 2.2 >Reporter: Tejas Patil >Assignee: Tejas Patil >Priority: Minor > Labels: robots.txt > Fix For: 2.3, 1.8 > > Attachments: NUTCH-1513.2.x.v2.patch, NUTCH-1513.trunk.patch, > NUTCH-1513.trunk.v2.patch > > > As per [0], a FTP website can have robots.txt like [1]. In the nutch code, > Ftp plugin is not parsing the robots file and accepting all urls. > In "_src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/Ftp.java_" > {noformat} public RobotRules getRobotRules(Text url, CrawlDatum datum) { > return EmptyRobotRules.RULES; > }{noformat} > Its not clear of this was part of design or if its a bug. > [0] : > https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt > [1] : ftp://example.com/robots.txt -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1275) Fix [unchecked] javac warnings
[ https://issues.apache.org/jira/browse/NUTCH-1275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13662554#comment-13662554 ] Tejas Patil commented on NUTCH-1275: Committed to trunk @ revision 1484634. For patch see NUTCH-1249 > Fix [unchecked] javac warnings > -- > > Key: NUTCH-1275 > URL: https://issues.apache.org/jira/browse/NUTCH-1275 > Project: Nutch > Issue Type: Sub-task > Components: build >Affects Versions: nutchgora, 1.5 >Reporter: Lewis John McGibbney >Priority: Minor > Fix For: 1.7 > > > We can simply suppress these warnings using > {code} > SuppressWarnings [unchecked] > {code} > However if there is a another method for resolving these warnings then they > should be implemented if deemed beneficial to code quality. > Some resources > http://java.sun.com/docs/books/jls/third_edition/html/conversions.html#190772 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1249) Resolve all issues flagged up by adding javac -Xlint arguement
[ https://issues.apache.org/jira/browse/NUTCH-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13662553#comment-13662553 ] Tejas Patil commented on NUTCH-1249: Committed to trunk @ revision 1484634 > Resolve all issues flagged up by adding javac -Xlint arguement > -- > > Key: NUTCH-1249 > URL: https://issues.apache.org/jira/browse/NUTCH-1249 > Project: Nutch > Issue Type: Improvement > Components: build >Affects Versions: nutchgora >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Minor > Fix For: 1.7 > > Attachments: NUTCH-1249.trunk.patch > > > There are a heap of issues flagged up by NUTCH-1237, I think over time it > would be great to get these addressed and resolved. > What is interesting is that adding the same arguements to > /src/plugin/plugin-build.xml actually breaks my build as tests begin to fail. > Some of this stuff is documented in the link below > http://docs.oracle.com/javase/1.5.0/docs/tooldocs/windows/javac.html#options -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1569) Upgrade 2.x to Gora 0.3
[ https://issues.apache.org/jira/browse/NUTCH-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13662540#comment-13662540 ] Tejas Patil commented on NUTCH-1569: Hey Lewis, I took a fresh checkout of 2.x and applied that patch. I am using HBase for storage. About disabling the junits: Is there anything else apart from 'MemStore' that can be used ? If not, then what we are currently should be fine. > Upgrade 2.x to Gora 0.3 > --- > > Key: NUTCH-1569 > URL: https://issues.apache.org/jira/browse/NUTCH-1569 > Project: Nutch > Issue Type: Improvement > Components: build, storage >Affects Versions: 2.2 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 2.2 > > Attachments: NUTCH-1569.patch, NUTCH-1569.v2.patch > > > We just released the Maven artifacts and I would like to upgrade before we > push the RC for 2.2 :) > Patch coming up -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-1053) Parsing of RSS feeds fails
[ https://issues.apache.org/jira/browse/NUTCH-1053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil resolved NUTCH-1053. Resolution: Fixed Committed to trunk (rev 1484628) and 2.x (rev 1484627). NOTE : Currently feeds parser is not supported (and hence disabled) in 2.x. > Parsing of RSS feeds fails > --- > > Key: NUTCH-1053 > URL: https://issues.apache.org/jira/browse/NUTCH-1053 > Project: Nutch > Issue Type: Bug > Components: parser >Affects Versions: 1.4 >Reporter: Julien Nioche >Assignee: Julien Nioche > Fix For: 1.7 > > Attachments: nutch-1053.patch, NUTCH-1053.trunk.patch, seed.txt > > > See discussion on > http://lucene.472066.n3.nabble.com/RSS-feed-parsing-on-Nutch-1-3-td3166487.html > Have been able to reproduce the problem and will look into it -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1569) Upgrade 2.x to Gora 0.3
[ https://issues.apache.org/jira/browse/NUTCH-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13662497#comment-13662497 ] Tejas Patil commented on NUTCH-1569: I am running 2.x with this patch since past few hours and so far have not found anything breaking. Crawl is running smooth. Will keep you posted. > Upgrade 2.x to Gora 0.3 > --- > > Key: NUTCH-1569 > URL: https://issues.apache.org/jira/browse/NUTCH-1569 > Project: Nutch > Issue Type: Improvement > Components: build, storage >Affects Versions: 2.2 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 2.2 > > Attachments: NUTCH-1569.patch, NUTCH-1569.v2.patch > > > We just released the Maven artifacts and I would like to upgrade before we > push the RC for 2.2 :) > Patch coming up -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1545) capture batchId and remove references to segments in 2.x crawl script.
[ https://issues.apache.org/jira/browse/NUTCH-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13662230#comment-13662230 ] Tejas Patil commented on NUTCH-1545: +1 for commit. > capture batchId and remove references to segments in 2.x crawl script. > -- > > Key: NUTCH-1545 > URL: https://issues.apache.org/jira/browse/NUTCH-1545 > Project: Nutch > Issue Type: Task >Affects Versions: 2.1 >Reporter: Lewis John McGibbney >Assignee: lufeng >Priority: Minor > Fix For: 2.2 > > Attachments: NUTCH-1545.patch, NUTCH-1545-v2.patch > > > The concept of segment is replaced by batchId in 2.x > I'm currently getting rid of segments references in 2.x > This issue was flagged up and separate from NUTCH-1532 which I am working on. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1573) Upgrade to most recent JUnit 4.x to improve test flexibility
[ https://issues.apache.org/jira/browse/NUTCH-1573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13661663#comment-13661663 ] Tejas Patil commented on NUTCH-1573: Oh... just saw your comment that you have committed it > Upgrade to most recent JUnit 4.x to improve test flexibility > > > Key: NUTCH-1573 > URL: https://issues.apache.org/jira/browse/NUTCH-1573 > Project: Nutch > Issue Type: Improvement > Components: build, test >Affects Versions: 1.6, 2.1 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.7, 2.2 > > Attachments: NUTCH-1573.2.x.v1.patch, NUTCH-1573.2.x.v2.patch > > > I wanted to try using the @Ignore functionality within JUnit, however I don't > think it is available in the current JUnit version we use in Nutch. We should > upgrade. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1573) Upgrade to most recent JUnit 4.x to improve test flexibility
[ https://issues.apache.org/jira/browse/NUTCH-1573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13661662#comment-13661662 ] Tejas Patil commented on NUTCH-1573: [~lewismc] great !! Only if there were no homeworks then my life would have been awesome and I could have worked on ASF projects when-ever I wanted :( Anyways, I will verify the patch on my system and update you soon. Lets get this change to repo today !! > Upgrade to most recent JUnit 4.x to improve test flexibility > > > Key: NUTCH-1573 > URL: https://issues.apache.org/jira/browse/NUTCH-1573 > Project: Nutch > Issue Type: Improvement > Components: build, test >Affects Versions: 1.6, 2.1 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.7, 2.2 > > Attachments: NUTCH-1573.2.x.v1.patch, NUTCH-1573.2.x.v2.patch > > > I wanted to try using the @Ignore functionality within JUnit, however I don't > think it is available in the current JUnit version we use in Nutch. We should > upgrade. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1573) Upgrade to most recent JUnit 4.x to improve test flexibility
[ https://issues.apache.org/jira/browse/NUTCH-1573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13661450#comment-13661450 ] Tejas Patil commented on NUTCH-1573: Hi Lewis, Quick question: Besides modifying the ivy dependency (and then adding @ignore tag for NUTCH-1569), is there anything else that needs to be done ? > Upgrade to most recent JUnit 4.x to improve test flexibility > > > Key: NUTCH-1573 > URL: https://issues.apache.org/jira/browse/NUTCH-1573 > Project: Nutch > Issue Type: Improvement > Components: build, test >Affects Versions: 1.6, 2.1 >Reporter: Lewis John McGibbney > Fix For: 1.7, 2.2 > > > I wanted to try using the @Ignore functionality within JUnit, however I don't > think it is available in the current JUnit version we use in Nutch. We should > upgrade. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Comment Edited] (NUTCH-1566) bin/nutch to allow whitespace in paths
[ https://issues.apache.org/jira/browse/NUTCH-1566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13661264#comment-13661264 ] Tejas Patil edited comment on NUTCH-1566 at 5/18/13 4:05 AM: - Hi Seb, I tried the patch over a windows machine with cygwin and it worked (I have not ran all possible scenarios exhaustively...just tried few). One minor suggestion: With the current patch, I see this error message (on cygwin console) while running nutch in local mode: {noformat}cygpath: can't convert empty path{noformat} I figured out the responsible place (line 115) in the nutch script: {noformat}NUTCH_JOB=`cygpath -p -w "$NUTCH_JOB"`{noformat} As the NUTCH_JOB value is empty while running in local mode, it gave that error message. The if case for adjusting NUTCH_JOB at lines 113-116 in [nutch script|http://svn.apache.org/viewvc/nutch/trunk/src/bin/nutch?view=markup] could be moved in the block just above it to address that. What say ? was (Author: tejasp): Hi Seb, I tried the patch over a windows machine with cygwin and it worked (I have not ran all possible scenarios exhaustively...just tried few). One minor suggestion: With the current patch, I see this {noformat}cygpath: can't convert empty path{noformat} I figured out the responsible place (line 115) in the nutch script: {noformat}NUTCH_JOB=`cygpath -p -w "$NUTCH_JOB"`{noformat} As the NUTCH_JOB value is empty while running in local mode, it gave that error message. The if case for adjusting NUTCH_JOB at lines 113-116 in [nutch script|http://svn.apache.org/viewvc/nutch/trunk/src/bin/nutch?view=markup] could be moved in the block just above it to address that. What say ? > bin/nutch to allow whitespace in paths > -- > > Key: NUTCH-1566 > URL: https://issues.apache.org/jira/browse/NUTCH-1566 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.6, 2.1 >Reporter: Sebastian Nagel >Priority: Minor > Fix For: 1.7, 2.3 > > Attachments: NUTCH-1566-trunk.patch > > > bin/nutch and bin/crawl choke if a path contains white space, eg, if > JAVA_HOME is "{{C:\Program Files\jdk}}". If you don't have the permission to > change the path it is impossible to run Nutch. This has been reported > frequently > ([1|http://stackoverflow.com/questions/9345629/nutch-cygwin-how-to-set-java-home], > > [2|http://lucene.472066.n3.nabble.com/Problem-running-Nutch-on-Win-7-Cygwin-td3487163.html], > and > [3|http://nutchinstall.blogspot.de/2007/07/setting-up-cygwin-and-nutch.html]), > see also NUTCH-19. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1566) bin/nutch to allow whitespace in paths
[ https://issues.apache.org/jira/browse/NUTCH-1566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13661264#comment-13661264 ] Tejas Patil commented on NUTCH-1566: Hi Seb, I tried the patch over a windows machine with cygwin and it worked (I have not ran all possible scenarios exhaustively...just tried few). One minor suggestion: With the current patch, I see this {noformat}cygpath: can't convert empty path{noformat} I figured out the responsible place (line 115) in the nutch script: {noformat}NUTCH_JOB=`cygpath -p -w "$NUTCH_JOB"`{noformat} As the NUTCH_JOB value is empty while running in local mode, it gave that error message. The if case for adjusting NUTCH_JOB at lines 113-116 in [nutch script|http://svn.apache.org/viewvc/nutch/trunk/src/bin/nutch?view=markup] could be moved in the block just above it to address that. What say ? > bin/nutch to allow whitespace in paths > -- > > Key: NUTCH-1566 > URL: https://issues.apache.org/jira/browse/NUTCH-1566 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.6, 2.1 >Reporter: Sebastian Nagel >Priority: Minor > Fix For: 1.7, 2.3 > > Attachments: NUTCH-1566-trunk.patch > > > bin/nutch and bin/crawl choke if a path contains white space, eg, if > JAVA_HOME is "{{C:\Program Files\jdk}}". If you don't have the permission to > change the path it is impossible to run Nutch. This has been reported > frequently > ([1|http://stackoverflow.com/questions/9345629/nutch-cygwin-how-to-set-java-home], > > [2|http://lucene.472066.n3.nabble.com/Problem-running-Nutch-on-Win-7-Cygwin-td3487163.html], > and > [3|http://nutchinstall.blogspot.de/2007/07/setting-up-cygwin-and-nutch.html]), > see also NUTCH-19. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira