[jira] [Commented] (NUTCH-1934) Refactor Fetcher in trunk
[ https://issues.apache.org/jira/browse/NUTCH-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14503663#comment-14503663 ] Lewis John McGibbney commented on NUTCH-1934: - Anyone able to take this for a spin or even to verify if it can apply against trunk anymore? It is a non trivial patch but one which makes the Fetcher much easier for us all to work with if we get the refactoring correct. Thanks Refactor Fetcher in trunk - Key: NUTCH-1934 URL: https://issues.apache.org/jira/browse/NUTCH-1934 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 1.10 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Labels: memex Fix For: 1.11 Attachments: NUTCH-1934-trunkv2.patch, NUTCH-1934.patch Put simply [Fetcher|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java] is too big. This is kinda strange as the size of this file is unique (I think) from every other class within Nutch. The others are reasonably well modularized and split into constituent classes which make sense. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-1990) Use URI.normalise() in BasicURLNormalizer
[ https://issues.apache.org/jira/browse/NUTCH-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1990: --- Attachment: NUTCH-1990-v1.patch Uuuh, a lot of garbage :( I've also run the test after spending BasicURLNormalizer a main() method: * found another bug in the current version: http://107jamz.com/registration/?referer=http://107jamz.com; looses the double slash in the query part. That's because currently the slash and dot segment normalization is run on the part returned by url.getFile(). Should be run only on the part returned getPath(). But that's fixed by the new version. * the trial is 50% slower using Julien's test set. But that's expected because only a small fraction of the URLs contains paths with dot segments or double slashes. * but after a check is added to avoid needless work: it's as fast as previously (maybe a slightly faster): 0:49.78 (before), 1:03.11 (trial), 0:45.49 (patch v1) Use URI.normalise() in BasicURLNormalizer - Key: NUTCH-1990 URL: https://issues.apache.org/jira/browse/NUTCH-1990 Project: Nutch Issue Type: Improvement Affects Versions: 1.9 Reporter: Julien Nioche Assignee: Julien Nioche Attachments: NUTCH-1990-trial1.patch, NUTCH-1990-v1.patch One of the things that [BasicURLNormalizer|https://github.com/apache/nutch/blob/trunk/src/plugin/urlnormalizer-basic/src/java/org/apache/nutch/net/urlnormalizer/basic/BasicURLNormalizer.java] is to remove unnecessary dot segments in path. Instead of implementing the logic ourselves with some antiquated regex library, we should simply use [http://docs.oracle.com/javase/7/docs/api/java/net/URI.html#normalize()] which does the same and is probably more efficient. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1934) Refactor Fetcher in trunk
[ https://issues.apache.org/jira/browse/NUTCH-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14503758#comment-14503758 ] Lewis John McGibbney commented on NUTCH-1934: - Thanks [~mjoyce] this is a big help in determining if this applies against trunk. If it is ripe for testing an eval then hopefully more people can chime in before too many patches make it in to trunk Fetcher and I need to rebase again. Refactor Fetcher in trunk - Key: NUTCH-1934 URL: https://issues.apache.org/jira/browse/NUTCH-1934 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 1.10 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Labels: memex Fix For: 1.11 Attachments: NUTCH-1934-trunkv2.patch, NUTCH-1934.patch Put simply [Fetcher|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java] is too big. This is kinda strange as the size of this file is unique (I think) from every other class within Nutch. The others are reasonably well modularized and split into constituent classes which make sense. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1934) Refactor Fetcher in trunk
[ https://issues.apache.org/jira/browse/NUTCH-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14503795#comment-14503795 ] Chris A. Mattmann commented on NUTCH-1934: -- +1 to commit if it applies cleanly and tests pass. Refactor Fetcher in trunk - Key: NUTCH-1934 URL: https://issues.apache.org/jira/browse/NUTCH-1934 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 1.10 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Labels: memex Fix For: 1.11 Attachments: NUTCH-1934-trunkv2.patch, NUTCH-1934.patch Put simply [Fetcher|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java] is too big. This is kinda strange as the size of this file is unique (I think) from every other class within Nutch. The others are reasonably well modularized and split into constituent classes which make sense. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1934) Refactor Fetcher in trunk
[ https://issues.apache.org/jira/browse/NUTCH-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14503866#comment-14503866 ] Lewis John McGibbney commented on NUTCH-1934: - This patch really needs tested thoroughly. It's a major refactoring of a 1000 line Java file which we all know as trunk Fetcher. Although no existing functionality has changed... I believe I've now implemented some method calls as static so we need to make sure this is OK. -- *Lewis* Refactor Fetcher in trunk - Key: NUTCH-1934 URL: https://issues.apache.org/jira/browse/NUTCH-1934 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 1.10 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Labels: memex Fix For: 1.11 Attachments: NUTCH-1934-trunkv2.patch, NUTCH-1934.patch Put simply [Fetcher|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java] is too big. This is kinda strange as the size of this file is unique (I think) from every other class within Nutch. The others are reasonably well modularized and split into constituent classes which make sense. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1934) Refactor Fetcher in trunk
[ https://issues.apache.org/jira/browse/NUTCH-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14503904#comment-14503904 ] Jorge Luis Betancourt Gonzalez commented on NUTCH-1934: --- +1 to [~chrismattmann] comment, If the tests pass without any problem I think we can commit and do some more testing, the basic test that covers the monolithic fetcher right now is a great starting point, and of course take it for a spin :) I plan on taking some time to prepare some midsize crawl before/after the commit if it helps. Refactor Fetcher in trunk - Key: NUTCH-1934 URL: https://issues.apache.org/jira/browse/NUTCH-1934 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 1.10 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Labels: memex Fix For: 1.11 Attachments: NUTCH-1934-trunkv2.patch, NUTCH-1934.patch Put simply [Fetcher|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java] is too big. This is kinda strange as the size of this file is unique (I think) from every other class within Nutch. The others are reasonably well modularized and split into constituent classes which make sense. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1934) Refactor Fetcher in trunk
[ https://issues.apache.org/jira/browse/NUTCH-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14503746#comment-14503746 ] Michael Joyce commented on NUTCH-1934: -- Hey [~lewismc], Patch applied clean to trunk for me and simple crawl over one site worked just fine. Couldn't run the tests unfortunately since I seem to have some config problem locally, but hopefully that's a start at least. Refactor Fetcher in trunk - Key: NUTCH-1934 URL: https://issues.apache.org/jira/browse/NUTCH-1934 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 1.10 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Labels: memex Fix For: 1.11 Attachments: NUTCH-1934-trunkv2.patch, NUTCH-1934.patch Put simply [Fetcher|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java] is too big. This is kinda strange as the size of this file is unique (I think) from every other class within Nutch. The others are reasonably well modularized and split into constituent classes which make sense. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1934) Refactor Fetcher in trunk
[ https://issues.apache.org/jira/browse/NUTCH-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14504006#comment-14504006 ] Lewis John McGibbney commented on NUTCH-1934: - +1 on that sentiment Will commit tomorrow to allow EU folks to wake up On Monday, April 20, 2015, Jorge Luis Betancourt Gonzalez (JIRA) -- *Lewis* Refactor Fetcher in trunk - Key: NUTCH-1934 URL: https://issues.apache.org/jira/browse/NUTCH-1934 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 1.10 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Labels: memex Fix For: 1.11 Attachments: NUTCH-1934-trunkv2.patch, NUTCH-1934.patch Put simply [Fetcher|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java] is too big. This is kinda strange as the size of this file is unique (I think) from every other class within Nutch. The others are reasonably well modularized and split into constituent classes which make sense. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1934) Refactor Fetcher in trunk
[ https://issues.apache.org/jira/browse/NUTCH-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14503727#comment-14503727 ] Michael Joyce commented on NUTCH-1934: -- Once sec Lewis and I'll take a quick scope. Refactor Fetcher in trunk - Key: NUTCH-1934 URL: https://issues.apache.org/jira/browse/NUTCH-1934 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 1.10 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Labels: memex Fix For: 1.11 Attachments: NUTCH-1934-trunkv2.patch, NUTCH-1934.patch Put simply [Fetcher|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java] is too big. This is kinda strange as the size of this file is unique (I think) from every other class within Nutch. The others are reasonably well modularized and split into constituent classes which make sense. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1934) Refactor Fetcher in trunk
[ https://issues.apache.org/jira/browse/NUTCH-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14503882#comment-14503882 ] Chris A. Mattmann commented on NUTCH-1934: -- well my point is on this - you can keep this as a patch and spend the effort to take a 1000 line Java file and keep it up to date with trunk or you can risk that you broke something in trunk, but make the fixes to that 10x easier by having it committed. Your call :) Refactor Fetcher in trunk - Key: NUTCH-1934 URL: https://issues.apache.org/jira/browse/NUTCH-1934 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 1.10 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Labels: memex Fix For: 1.11 Attachments: NUTCH-1934-trunkv2.patch, NUTCH-1934.patch Put simply [Fetcher|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java] is too big. This is kinda strange as the size of this file is unique (I think) from every other class within Nutch. The others are reasonably well modularized and split into constituent classes which make sense. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (NUTCH-1987) Make bin/crawl indexer agnostic
[ https://issues.apache.org/jira/browse/NUTCH-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann resolved NUTCH-1987. -- Resolution: Fixed Thanks [~jo...@apache.org] Appreciate it! Thanks Seb for the review! {noformat} [chipotle:~/tmp/nutch-1.10-trunk] mattmann% svn commit -m Fix for NUTCH-1987 - Make bin/crawl indexer agnostic contributed by Michael Joyce mltjo...@gmail.com this closes #18. SendingCHANGES.txt Sendingconf/nutch-default.xml Sendingsrc/bin/crawl Transmitting file data ... Committed revision 1675022. [chipotle:~/tmp/nutch-1.10-trunk] mattmann% {noformat} Make bin/crawl indexer agnostic --- Key: NUTCH-1987 URL: https://issues.apache.org/jira/browse/NUTCH-1987 Project: Nutch Issue Type: Improvement Affects Versions: 1.9 Reporter: Michael Joyce Assignee: Chris A. Mattmann Labels: memex Fix For: 1.10 The crawl script makes it a bit challenging to use an indexer that isn't Solr. For instance, when I want to use the indexer-elastic plugin I still need to call the crawler script with a fake Solr URL otherwise it will skip the indexing step all together. {code} bin/crawl urls/ crawl/ http://fakeurl.com:9200; 1 {code} It would be nice to keep configuration for the Solr indexer in the conf files (to mirror the elastic search indexer conf and others) and to make the indexing parameter simply toggle whether indexing does or doesn't occur instead of also trying to configure the indexer at the same time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1987) Make bin/crawl indexer agnostic
[ https://issues.apache.org/jira/browse/NUTCH-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14504195#comment-14504195 ] ASF GitHub Bot commented on NUTCH-1987: --- Github user asfgit closed the pull request at: https://github.com/apache/nutch/pull/18 Make bin/crawl indexer agnostic --- Key: NUTCH-1987 URL: https://issues.apache.org/jira/browse/NUTCH-1987 Project: Nutch Issue Type: Improvement Affects Versions: 1.9 Reporter: Michael Joyce Assignee: Chris A. Mattmann Labels: memex Fix For: 1.10 The crawl script makes it a bit challenging to use an indexer that isn't Solr. For instance, when I want to use the indexer-elastic plugin I still need to call the crawler script with a fake Solr URL otherwise it will skip the indexing step all together. {code} bin/crawl urls/ crawl/ http://fakeurl.com:9200; 1 {code} It would be nice to keep configuration for the Solr indexer in the conf files (to mirror the elastic search indexer conf and others) and to make the indexing parameter simply toggle whether indexing does or doesn't occur instead of also trying to configure the indexer at the same time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1987) Make bin/crawl indexer agnostic
[ https://issues.apache.org/jira/browse/NUTCH-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14504190#comment-14504190 ] Chris A. Mattmann commented on NUTCH-1987: -- Thanks Mike, this looks good to me. I'll commit this shortly thanks for resolving Seb's comments. Make bin/crawl indexer agnostic --- Key: NUTCH-1987 URL: https://issues.apache.org/jira/browse/NUTCH-1987 Project: Nutch Issue Type: Improvement Affects Versions: 1.9 Reporter: Michael Joyce Assignee: Chris A. Mattmann Labels: memex Fix For: 1.10 The crawl script makes it a bit challenging to use an indexer that isn't Solr. For instance, when I want to use the indexer-elastic plugin I still need to call the crawler script with a fake Solr URL otherwise it will skip the indexing step all together. {code} bin/crawl urls/ crawl/ http://fakeurl.com:9200; 1 {code} It would be nice to keep configuration for the Solr indexer in the conf files (to mirror the elastic search indexer conf and others) and to make the indexing parameter simply toggle whether indexing does or doesn't occur instead of also trying to configure the indexer at the same time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1987) Make bin/crawl indexer agnostic
[ https://issues.apache.org/jira/browse/NUTCH-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14504282#comment-14504282 ] Hudson commented on NUTCH-1987: --- SUCCESS: Integrated in Nutch-trunk #3074 (See [https://builds.apache.org/job/Nutch-trunk/3074/]) Fix for NUTCH-1987 - Make bin/crawl indexer agnostic contributed by Michael Joyce mltjo...@gmail.com this closes #18. (mattmann: http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1675022) * /nutch/trunk/CHANGES.txt * /nutch/trunk/conf/nutch-default.xml * /nutch/trunk/src/bin/crawl Make bin/crawl indexer agnostic --- Key: NUTCH-1987 URL: https://issues.apache.org/jira/browse/NUTCH-1987 Project: Nutch Issue Type: Improvement Affects Versions: 1.9 Reporter: Michael Joyce Assignee: Chris A. Mattmann Labels: memex Fix For: 1.10 The crawl script makes it a bit challenging to use an indexer that isn't Solr. For instance, when I want to use the indexer-elastic plugin I still need to call the crawler script with a fake Solr URL otherwise it will skip the indexing step all together. {code} bin/crawl urls/ crawl/ http://fakeurl.com:9200; 1 {code} It would be nice to keep configuration for the Solr indexer in the conf files (to mirror the elastic search indexer conf and others) and to make the indexing parameter simply toggle whether indexing does or doesn't occur instead of also trying to configure the indexer at the same time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] nutch pull request: NUTCH-1987 - Make bin/crawl indexer agnostic
Github user asfgit closed the pull request at: https://github.com/apache/nutch/pull/18 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Assigned] (NUTCH-1987) Make bin/crawl indexer agnostic
[ https://issues.apache.org/jira/browse/NUTCH-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann reassigned NUTCH-1987: Assignee: Chris A. Mattmann Make bin/crawl indexer agnostic --- Key: NUTCH-1987 URL: https://issues.apache.org/jira/browse/NUTCH-1987 Project: Nutch Issue Type: Improvement Affects Versions: 1.9 Reporter: Michael Joyce Assignee: Chris A. Mattmann Labels: memex Fix For: 1.10 The crawl script makes it a bit challenging to use an indexer that isn't Solr. For instance, when I want to use the indexer-elastic plugin I still need to call the crawler script with a fake Solr URL otherwise it will skip the indexing step all together. {code} bin/crawl urls/ crawl/ http://fakeurl.com:9200; 1 {code} It would be nice to keep configuration for the Solr indexer in the conf files (to mirror the elastic search indexer conf and others) and to make the indexing parameter simply toggle whether indexing does or doesn't occur instead of also trying to configure the indexer at the same time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Work started] (NUTCH-1987) Make bin/crawl indexer agnostic
[ https://issues.apache.org/jira/browse/NUTCH-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-1987 started by Chris A. Mattmann. Make bin/crawl indexer agnostic --- Key: NUTCH-1987 URL: https://issues.apache.org/jira/browse/NUTCH-1987 Project: Nutch Issue Type: Improvement Affects Versions: 1.9 Reporter: Michael Joyce Assignee: Chris A. Mattmann Labels: memex Fix For: 1.10 The crawl script makes it a bit challenging to use an indexer that isn't Solr. For instance, when I want to use the indexer-elastic plugin I still need to call the crawler script with a fake Solr URL otherwise it will skip the indexing step all together. {code} bin/crawl urls/ crawl/ http://fakeurl.com:9200; 1 {code} It would be nice to keep configuration for the Solr indexer in the conf files (to mirror the elastic search indexer conf and others) and to make the indexing parameter simply toggle whether indexing does or doesn't occur instead of also trying to configure the indexer at the same time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1990) Use URI.normalise() in BasicURLNormalizer
[ https://issues.apache.org/jira/browse/NUTCH-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14503072#comment-14503072 ] Julien Nioche commented on NUTCH-1990: -- Thanks [~wastl-nagel]! I have extracted 3332418 URLs from a random segment of CommonCrawl (CC-MAIN-20150226074059-0-ip-10-28-5-156.ec2.internal.warc.gz). Parsed it with JSoup, the URLS are meant to be absolute but contains a lot of garbage, so it is as real life as can be. I tested the impact of your patch by injecting these URLs. We are getting the same number of URLs post-normalisation and it seems to take the same amount of time {code} Injector: Total number of urls rejected by filters: 886704 Injector: Total number of urls after normalization: 2445715 Injector: Total new urls injected: 2445715 Injector: finished at 2015-04-20 16:31:30, elapsed: 00:00:59 {code} Note that the figures above where obtained by removing the patterns for the regex-based normalisation as well as commenting out {code} # skip URLs with slash-delimited segment that repeats 3+ times, to break loops -.*(/[^/]+)/[^/]+\1/[^/]+\1/ {code} in regex-urlfilter.txt as these operations take most of the time. The processing time when leaving these files in their default form is 08:23, which confirms that even if the code modified by your patch was a bit slower (which is not the case) it would be irrelevant compared to the overall time spent normalizing and filtering. See the related discussion in Storm-Crawler [https://github.com/DigitalPebble/storm-crawler/issues/120]. Later on we might want to have some basic normalization code in Crawler-Commons, in which case Nutch could leverage it but for now I think this patch should be committed. The list of URLs used for these tests can be downloaded from [https://drive.google.com/open?id=0B4ebzXTbUoiAY0hXNjUtdnJGN3Mauthuser=0], just in case someone wants to reproduce the steps. Use URI.normalise() in BasicURLNormalizer - Key: NUTCH-1990 URL: https://issues.apache.org/jira/browse/NUTCH-1990 Project: Nutch Issue Type: Improvement Affects Versions: 1.9 Reporter: Julien Nioche Assignee: Julien Nioche Attachments: NUTCH-1990-trial1.patch One of the things that [BasicURLNormalizer|https://github.com/apache/nutch/blob/trunk/src/plugin/urlnormalizer-basic/src/java/org/apache/nutch/net/urlnormalizer/basic/BasicURLNormalizer.java] is to remove unnecessary dot segments in path. Instead of implementing the logic ourselves with some antiquated regex library, we should simply use [http://docs.oracle.com/javase/7/docs/api/java/net/URI.html#normalize()] which does the same and is probably more efficient. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1987) Make bin/crawl indexer agnostic
[ https://issues.apache.org/jira/browse/NUTCH-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14503446#comment-14503446 ] Michael Joyce commented on NUTCH-1987: -- Hi folks, PR has been updated with the requested changes. If you have any questions or think anything else needs changing let me know. Make bin/crawl indexer agnostic --- Key: NUTCH-1987 URL: https://issues.apache.org/jira/browse/NUTCH-1987 Project: Nutch Issue Type: Improvement Affects Versions: 1.9 Reporter: Michael Joyce Labels: memex Fix For: 1.10 The crawl script makes it a bit challenging to use an indexer that isn't Solr. For instance, when I want to use the indexer-elastic plugin I still need to call the crawler script with a fake Solr URL otherwise it will skip the indexing step all together. {code} bin/crawl urls/ crawl/ http://fakeurl.com:9200; 1 {code} It would be nice to keep configuration for the Solr indexer in the conf files (to mirror the elastic search indexer conf and others) and to make the indexing parameter simply toggle whether indexing does or doesn't occur instead of also trying to configure the indexer at the same time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1989) Handling invalid URLs in CommonCrawlDataDumper
[ https://issues.apache.org/jira/browse/NUTCH-1989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14503486#comment-14503486 ] Lewis John McGibbney commented on NUTCH-1989: - Hi [~gostep] bq. The tool logs a warning message if an invalid URL is detected. I am just wondering if we can perform a specific action if invalid URLs occur. We could skip invalid URLs but I notice that also the following URLs are detected as invalid: So basically although we filter out the clearly unvalid URLs, we also seem to filter out valid URLs. We need to work towards a better solution. bq. I would be very pleased to get your feedback on action to perform when invalid URLs are detected, avoiding to drop off data and break the naming schema if -epochFilename option is used. A number of issues here lets take them in following order * action to perform when invalid URLs are detected - try the same as we do in generator https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/crawl/Generator.java#L276-L280 e.g. just use a counter and log them as invalid * avoiding to drop off data and break the naming schema if -epochFilename option is used - some of the above URLs are not valid e.g. http://arstechnica.com/gaming/2015/04/mortal-kombat-x-charges-players-for-easy-fatalities/\/\/ars.to\/1aPaqvW note the two \/\/ towards end of URL. Handling invalid URLs in CommonCrawlDataDumper -- Key: NUTCH-1989 URL: https://issues.apache.org/jira/browse/NUTCH-1989 Project: Nutch Issue Type: Improvement Components: tool Reporter: Giuseppe Totaro Assignee: Chris A. Mattmann Priority: Minor Labels: memex Fix For: 1.10 Attachments: NUTCH-1989.patch Hi all, running the {{CommonCrawlDataDumper}} tool ({{bin/nutch commoncrawldump}}) with the new options (as described in [NUTCH-1975|https://issues.apache.org/jira/browse/NUTCH-1975]) I noticed there are some problems if an invalid URL is detected. For example, the following URLs (that I found in crawled data) break the naming schema provided by using {{-epochFilename}} command-line option: * http://www/ * http:/ More in detail, using {{-epochFilename}} option, files extracted will be organized in a reversed-DNS tree based on the FQDN of the webpage, followed by a SHA1 hash of the complete URL. When the tool detect the URLs as above, it is not able to build the reversed-DNS tree. You can find in attachment a simple patch for detecting invalid URLs. The patch uses the [Apache Commons Validator|http://commons.apache.org/proper/commons-validator/] APIs to detect invalid URLs: {code} UrlValidator urlValidator = new UrlValidator(); if (!urlValidator.isValid(url)) { LOG.warn(Not valid URL detected: + url); } {code} The tool logs a warning message if an invalid URL is detected. I am just wondering if we can perform a specific action if invalid URLs occur. We could skip invalid URLs but I notice that also the following URLs are detected as invalid: {noformat} 2015-04-15 13:49:40,386 WARN tools.CommonCrawlDataDumper - Not valid URL detected: http://www.reddit.com/r/agora/comments/22ezoa/how_to_buy_drugs_on_agora_hur_man_köper_droger_på/ 2015-04-15 13:49:41,603 WARN tools.CommonCrawlDataDumper - Not valid URL detected: http://www/ 2015-04-15 13:49:41,632 WARN tools.CommonCrawlDataDumper - Not valid URL detected: http:/ 2015-04-15 13:49:44,601 WARN tools.CommonCrawlDataDumper - Not valid URL detected: http://allthingsvice.com/2012/05/30/the-great-420-scam/\/\/allthingsvice.com\/2012\/05\/30\/the-great-420-scam\/ 2015-04-15 13:50:34,821 WARN tools.CommonCrawlDataDumper - Not valid URL detected: http://www.reddit.com/r/agora/comments/22ezoa/how_to_buy_drugs_on_agora_hur_man_köper_droger_på/ 2015-04-15 13:50:35,847 WARN tools.CommonCrawlDataDumper - Not valid URL detected: http://www/ 2015-04-15 13:50:35,866 WARN tools.CommonCrawlDataDumper - Not valid URL detected: http:/ 2015-04-15 13:50:38,605 WARN tools.CommonCrawlDataDumper - Not valid URL detected: http://allthingsvice.com/2012/05/30/the-great-420-scam/\/\/allthingsvice.com\/2012\/05\/30\/the-great-420-scam\/ 2015-04-15 13:51:20,013 WARN tools.CommonCrawlDataDumper - Not valid URL detected: http://antilop.cc/sr/users/nomad bloodbath 2015-04-15 13:51:20,499 WARN tools.CommonCrawlDataDumper - Not valid URL detected: http://arstechnica.com/gaming/2015/04/mortal-kombat-x-charges-players-for-easy-fatalities/\/\/ars.to\/1aPaqvW 2015-04-15 13:51:20,500 WARN tools.CommonCrawlDataDumper - Not valid URL detected: http://arstechnica.com/gaming/2015/04/mortal-kombat-x-charges-players-for-easy-fatalities/\/\/arstechnica.com 2015-04-15 13:51:20,500 WARN
[jira] [Commented] (NUTCH-1989) Handling invalid URLs in CommonCrawlDataDumper
[ https://issues.apache.org/jira/browse/NUTCH-1989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14503500#comment-14503500 ] Giuseppe Totaro commented on NUTCH-1989: Hi [~lewismc], thanks a lot for supporting me on this work. In the patch, when an invalid URL is detected is not filtered out. The tool generates a warning log message. I totally agree with you that we need to work towards a better solution. Thanks a lot for your great suggestion. I will add a counter for invalid URLs. I will update about that. Thanks, Giuseppe Handling invalid URLs in CommonCrawlDataDumper -- Key: NUTCH-1989 URL: https://issues.apache.org/jira/browse/NUTCH-1989 Project: Nutch Issue Type: Improvement Components: tool Reporter: Giuseppe Totaro Assignee: Chris A. Mattmann Priority: Minor Labels: memex Fix For: 1.10 Attachments: NUTCH-1989.patch Hi all, running the {{CommonCrawlDataDumper}} tool ({{bin/nutch commoncrawldump}}) with the new options (as described in [NUTCH-1975|https://issues.apache.org/jira/browse/NUTCH-1975]) I noticed there are some problems if an invalid URL is detected. For example, the following URLs (that I found in crawled data) break the naming schema provided by using {{-epochFilename}} command-line option: * http://www/ * http:/ More in detail, using {{-epochFilename}} option, files extracted will be organized in a reversed-DNS tree based on the FQDN of the webpage, followed by a SHA1 hash of the complete URL. When the tool detect the URLs as above, it is not able to build the reversed-DNS tree. You can find in attachment a simple patch for detecting invalid URLs. The patch uses the [Apache Commons Validator|http://commons.apache.org/proper/commons-validator/] APIs to detect invalid URLs: {code} UrlValidator urlValidator = new UrlValidator(); if (!urlValidator.isValid(url)) { LOG.warn(Not valid URL detected: + url); } {code} The tool logs a warning message if an invalid URL is detected. I am just wondering if we can perform a specific action if invalid URLs occur. We could skip invalid URLs but I notice that also the following URLs are detected as invalid: {noformat} 2015-04-15 13:49:40,386 WARN tools.CommonCrawlDataDumper - Not valid URL detected: http://www.reddit.com/r/agora/comments/22ezoa/how_to_buy_drugs_on_agora_hur_man_köper_droger_på/ 2015-04-15 13:49:41,603 WARN tools.CommonCrawlDataDumper - Not valid URL detected: http://www/ 2015-04-15 13:49:41,632 WARN tools.CommonCrawlDataDumper - Not valid URL detected: http:/ 2015-04-15 13:49:44,601 WARN tools.CommonCrawlDataDumper - Not valid URL detected: http://allthingsvice.com/2012/05/30/the-great-420-scam/\/\/allthingsvice.com\/2012\/05\/30\/the-great-420-scam\/ 2015-04-15 13:50:34,821 WARN tools.CommonCrawlDataDumper - Not valid URL detected: http://www.reddit.com/r/agora/comments/22ezoa/how_to_buy_drugs_on_agora_hur_man_köper_droger_på/ 2015-04-15 13:50:35,847 WARN tools.CommonCrawlDataDumper - Not valid URL detected: http://www/ 2015-04-15 13:50:35,866 WARN tools.CommonCrawlDataDumper - Not valid URL detected: http:/ 2015-04-15 13:50:38,605 WARN tools.CommonCrawlDataDumper - Not valid URL detected: http://allthingsvice.com/2012/05/30/the-great-420-scam/\/\/allthingsvice.com\/2012\/05\/30\/the-great-420-scam\/ 2015-04-15 13:51:20,013 WARN tools.CommonCrawlDataDumper - Not valid URL detected: http://antilop.cc/sr/users/nomad bloodbath 2015-04-15 13:51:20,499 WARN tools.CommonCrawlDataDumper - Not valid URL detected: http://arstechnica.com/gaming/2015/04/mortal-kombat-x-charges-players-for-easy-fatalities/\/\/ars.to\/1aPaqvW 2015-04-15 13:51:20,500 WARN tools.CommonCrawlDataDumper - Not valid URL detected: http://arstechnica.com/gaming/2015/04/mortal-kombat-x-charges-players-for-easy-fatalities/\/\/arstechnica.com 2015-04-15 13:51:20,500 WARN tools.CommonCrawlDataDumper - Not valid URL detected: http://arstechnica.com/gaming/2015/04/mortal-kombat-x-charges-players-for-easy-fatalities/\/\/arstechnica.com\/gaming\/2015\/04\/mortal-kombat-x-charges-players-for-easy-fatalities\/ 2015-04-15 13:51:20,500 WARN tools.CommonCrawlDataDumper - Not valid URL detected: http://arstechnica.com/gaming/2015/04/mortal-kombat-x-charges-players-for-easy-fatalities/\/\/cdn.arstechnica.net\/wp-content\/themes\/arstechnica\/assets 2015-04-15 13:51:20,500 WARN tools.CommonCrawlDataDumper - Not valid URL detected: http://arstechnica.com/gaming/2015/04/mortal-kombat-x-charges-players-for-easy-fatalities/\/civis 2015-04-15 13:51:20,588 WARN tools.CommonCrawlDataDumper - Not valid URL detected: