[jira] [Commented] (NUTCH-1934) Refactor Fetcher in trunk

2015-04-20 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14503663#comment-14503663
 ] 

Lewis John McGibbney commented on NUTCH-1934:
-

Anyone able to take this for a spin or even to verify if it can apply against 
trunk anymore? It is a non trivial patch but one which makes the Fetcher much 
easier for us all to work with if we get the refactoring correct. Thanks

 Refactor Fetcher in trunk
 -

 Key: NUTCH-1934
 URL: https://issues.apache.org/jira/browse/NUTCH-1934
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 1.10
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
  Labels: memex
 Fix For: 1.11

 Attachments: NUTCH-1934-trunkv2.patch, NUTCH-1934.patch


 Put simply 
 [Fetcher|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java]
  is too big.
 This is kinda strange as the size of this file is unique (I think) from every 
 other class within Nutch. The others are reasonably well modularized and 
 split into constituent classes which make sense.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1990) Use URI.normalise() in BasicURLNormalizer

2015-04-20 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1990:
---
Attachment: NUTCH-1990-v1.patch

Uuuh, a lot of garbage :(  I've also run the test after spending 
BasicURLNormalizer a main() method:
* found another bug in the current version: 
http://107jamz.com/registration/?referer=http://107jamz.com; looses the double 
slash in the query part. That's because currently the slash and dot segment 
normalization is run on the part returned by url.getFile(). Should be run only 
on the part returned getPath(). But that's fixed by the new version.
* the trial is 50% slower using Julien's test set. But that's expected because 
only a small fraction of the URLs contains paths with dot segments or double 
slashes.
* but after a check is added to avoid needless work: it's as fast as previously 
(maybe a slightly faster): 0:49.78 (before), 1:03.11 (trial), 0:45.49 (patch v1)


 Use URI.normalise() in BasicURLNormalizer
 -

 Key: NUTCH-1990
 URL: https://issues.apache.org/jira/browse/NUTCH-1990
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.9
Reporter: Julien Nioche
Assignee: Julien Nioche
 Attachments: NUTCH-1990-trial1.patch, NUTCH-1990-v1.patch


 One of the things that 
 [BasicURLNormalizer|https://github.com/apache/nutch/blob/trunk/src/plugin/urlnormalizer-basic/src/java/org/apache/nutch/net/urlnormalizer/basic/BasicURLNormalizer.java]
  is to remove unnecessary dot segments in path.
 Instead of implementing the logic ourselves with some antiquated regex 
 library, we should simply use 
 [http://docs.oracle.com/javase/7/docs/api/java/net/URI.html#normalize()] 
 which does the same and is probably more efficient.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1934) Refactor Fetcher in trunk

2015-04-20 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14503758#comment-14503758
 ] 

Lewis John McGibbney commented on NUTCH-1934:
-

Thanks [~mjoyce] this is a big help in determining if this applies against 
trunk. 
If it is ripe for testing an  eval then hopefully more people can chime in 
before too many patches make it in to trunk Fetcher and I need to rebase again.

 Refactor Fetcher in trunk
 -

 Key: NUTCH-1934
 URL: https://issues.apache.org/jira/browse/NUTCH-1934
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 1.10
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
  Labels: memex
 Fix For: 1.11

 Attachments: NUTCH-1934-trunkv2.patch, NUTCH-1934.patch


 Put simply 
 [Fetcher|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java]
  is too big.
 This is kinda strange as the size of this file is unique (I think) from every 
 other class within Nutch. The others are reasonably well modularized and 
 split into constituent classes which make sense.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1934) Refactor Fetcher in trunk

2015-04-20 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14503795#comment-14503795
 ] 

Chris A. Mattmann commented on NUTCH-1934:
--

+1 to commit if it applies cleanly and tests pass.

 Refactor Fetcher in trunk
 -

 Key: NUTCH-1934
 URL: https://issues.apache.org/jira/browse/NUTCH-1934
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 1.10
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
  Labels: memex
 Fix For: 1.11

 Attachments: NUTCH-1934-trunkv2.patch, NUTCH-1934.patch


 Put simply 
 [Fetcher|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java]
  is too big.
 This is kinda strange as the size of this file is unique (I think) from every 
 other class within Nutch. The others are reasonably well modularized and 
 split into constituent classes which make sense.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1934) Refactor Fetcher in trunk

2015-04-20 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14503866#comment-14503866
 ] 

Lewis John McGibbney commented on NUTCH-1934:
-

This patch really needs tested thoroughly.
It's a major refactoring of a 1000 line Java file which we all know as
trunk Fetcher.
Although no existing functionality has changed... I believe I've now
implemented some method calls as static so we need to make sure this is OK.




-- 
*Lewis*


 Refactor Fetcher in trunk
 -

 Key: NUTCH-1934
 URL: https://issues.apache.org/jira/browse/NUTCH-1934
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 1.10
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
  Labels: memex
 Fix For: 1.11

 Attachments: NUTCH-1934-trunkv2.patch, NUTCH-1934.patch


 Put simply 
 [Fetcher|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java]
  is too big.
 This is kinda strange as the size of this file is unique (I think) from every 
 other class within Nutch. The others are reasonably well modularized and 
 split into constituent classes which make sense.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1934) Refactor Fetcher in trunk

2015-04-20 Thread Jorge Luis Betancourt Gonzalez (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14503904#comment-14503904
 ] 

Jorge Luis Betancourt Gonzalez commented on NUTCH-1934:
---

+1 to [~chrismattmann] comment, 

If the tests pass without any problem I think we can commit and do some more 
testing, the basic test that covers the monolithic fetcher right now is a great 
starting point, and of course take it for a spin :) I plan on taking some time 
to prepare some midsize crawl before/after the commit if it helps.

 Refactor Fetcher in trunk
 -

 Key: NUTCH-1934
 URL: https://issues.apache.org/jira/browse/NUTCH-1934
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 1.10
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
  Labels: memex
 Fix For: 1.11

 Attachments: NUTCH-1934-trunkv2.patch, NUTCH-1934.patch


 Put simply 
 [Fetcher|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java]
  is too big.
 This is kinda strange as the size of this file is unique (I think) from every 
 other class within Nutch. The others are reasonably well modularized and 
 split into constituent classes which make sense.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1934) Refactor Fetcher in trunk

2015-04-20 Thread Michael Joyce (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14503746#comment-14503746
 ] 

Michael Joyce commented on NUTCH-1934:
--

Hey [~lewismc], 

Patch applied clean to trunk for me and simple crawl over one site worked just 
fine. Couldn't run the tests unfortunately since I seem to have some config 
problem locally, but hopefully that's a start at least.

 Refactor Fetcher in trunk
 -

 Key: NUTCH-1934
 URL: https://issues.apache.org/jira/browse/NUTCH-1934
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 1.10
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
  Labels: memex
 Fix For: 1.11

 Attachments: NUTCH-1934-trunkv2.patch, NUTCH-1934.patch


 Put simply 
 [Fetcher|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java]
  is too big.
 This is kinda strange as the size of this file is unique (I think) from every 
 other class within Nutch. The others are reasonably well modularized and 
 split into constituent classes which make sense.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1934) Refactor Fetcher in trunk

2015-04-20 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14504006#comment-14504006
 ] 

Lewis John McGibbney commented on NUTCH-1934:
-

+1 on that sentiment
Will commit tomorrow to allow EU folks to wake up

On Monday, April 20, 2015, Jorge Luis Betancourt Gonzalez (JIRA) 



-- 
*Lewis*


 Refactor Fetcher in trunk
 -

 Key: NUTCH-1934
 URL: https://issues.apache.org/jira/browse/NUTCH-1934
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 1.10
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
  Labels: memex
 Fix For: 1.11

 Attachments: NUTCH-1934-trunkv2.patch, NUTCH-1934.patch


 Put simply 
 [Fetcher|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java]
  is too big.
 This is kinda strange as the size of this file is unique (I think) from every 
 other class within Nutch. The others are reasonably well modularized and 
 split into constituent classes which make sense.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1934) Refactor Fetcher in trunk

2015-04-20 Thread Michael Joyce (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14503727#comment-14503727
 ] 

Michael Joyce commented on NUTCH-1934:
--

Once sec Lewis and I'll take a quick scope.

 Refactor Fetcher in trunk
 -

 Key: NUTCH-1934
 URL: https://issues.apache.org/jira/browse/NUTCH-1934
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 1.10
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
  Labels: memex
 Fix For: 1.11

 Attachments: NUTCH-1934-trunkv2.patch, NUTCH-1934.patch


 Put simply 
 [Fetcher|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java]
  is too big.
 This is kinda strange as the size of this file is unique (I think) from every 
 other class within Nutch. The others are reasonably well modularized and 
 split into constituent classes which make sense.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1934) Refactor Fetcher in trunk

2015-04-20 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14503882#comment-14503882
 ] 

Chris A. Mattmann commented on NUTCH-1934:
--

well my point is on this - you can keep this as a patch and spend the effort to 
take a  1000 line Java file and keep it up to date with trunk or you can risk 
that you broke something in trunk, but make the fixes to that 10x 
easier by having it committed. Your call :)

 Refactor Fetcher in trunk
 -

 Key: NUTCH-1934
 URL: https://issues.apache.org/jira/browse/NUTCH-1934
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 1.10
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
  Labels: memex
 Fix For: 1.11

 Attachments: NUTCH-1934-trunkv2.patch, NUTCH-1934.patch


 Put simply 
 [Fetcher|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java]
  is too big.
 This is kinda strange as the size of this file is unique (I think) from every 
 other class within Nutch. The others are reasonably well modularized and 
 split into constituent classes which make sense.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-1987) Make bin/crawl indexer agnostic

2015-04-20 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann resolved NUTCH-1987.
--
Resolution: Fixed

Thanks [~jo...@apache.org] Appreciate it! Thanks Seb for the review!

{noformat}
[chipotle:~/tmp/nutch-1.10-trunk] mattmann% svn commit -m Fix for NUTCH-1987 - 
Make bin/crawl indexer agnostic contributed by Michael Joyce 
mltjo...@gmail.com this closes #18.
SendingCHANGES.txt
Sendingconf/nutch-default.xml
Sendingsrc/bin/crawl
Transmitting file data ...
Committed revision 1675022.
[chipotle:~/tmp/nutch-1.10-trunk] mattmann% 
{noformat}


 Make bin/crawl indexer agnostic
 ---

 Key: NUTCH-1987
 URL: https://issues.apache.org/jira/browse/NUTCH-1987
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.9
Reporter: Michael Joyce
Assignee: Chris A. Mattmann
  Labels: memex
 Fix For: 1.10


 The crawl script makes it a bit challenging to use an indexer that isn't 
 Solr. For instance, when I want to use the indexer-elastic plugin I still 
 need to call the crawler script with a fake Solr URL otherwise it will skip 
 the indexing step all together.
 {code}
 bin/crawl urls/ crawl/ http://fakeurl.com:9200; 1
 {code}
 It would be nice to keep configuration for the Solr indexer in the conf files 
 (to mirror the elastic search indexer conf and others) and to make the 
 indexing parameter simply toggle whether indexing does or doesn't occur 
 instead of also trying to configure the indexer at the same time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1987) Make bin/crawl indexer agnostic

2015-04-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14504195#comment-14504195
 ] 

ASF GitHub Bot commented on NUTCH-1987:
---

Github user asfgit closed the pull request at:

https://github.com/apache/nutch/pull/18


 Make bin/crawl indexer agnostic
 ---

 Key: NUTCH-1987
 URL: https://issues.apache.org/jira/browse/NUTCH-1987
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.9
Reporter: Michael Joyce
Assignee: Chris A. Mattmann
  Labels: memex
 Fix For: 1.10


 The crawl script makes it a bit challenging to use an indexer that isn't 
 Solr. For instance, when I want to use the indexer-elastic plugin I still 
 need to call the crawler script with a fake Solr URL otherwise it will skip 
 the indexing step all together.
 {code}
 bin/crawl urls/ crawl/ http://fakeurl.com:9200; 1
 {code}
 It would be nice to keep configuration for the Solr indexer in the conf files 
 (to mirror the elastic search indexer conf and others) and to make the 
 indexing parameter simply toggle whether indexing does or doesn't occur 
 instead of also trying to configure the indexer at the same time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1987) Make bin/crawl indexer agnostic

2015-04-20 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14504190#comment-14504190
 ] 

Chris A. Mattmann commented on NUTCH-1987:
--

Thanks Mike, this looks good to me. I'll commit this shortly thanks for 
resolving Seb's comments.

 Make bin/crawl indexer agnostic
 ---

 Key: NUTCH-1987
 URL: https://issues.apache.org/jira/browse/NUTCH-1987
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.9
Reporter: Michael Joyce
Assignee: Chris A. Mattmann
  Labels: memex
 Fix For: 1.10


 The crawl script makes it a bit challenging to use an indexer that isn't 
 Solr. For instance, when I want to use the indexer-elastic plugin I still 
 need to call the crawler script with a fake Solr URL otherwise it will skip 
 the indexing step all together.
 {code}
 bin/crawl urls/ crawl/ http://fakeurl.com:9200; 1
 {code}
 It would be nice to keep configuration for the Solr indexer in the conf files 
 (to mirror the elastic search indexer conf and others) and to make the 
 indexing parameter simply toggle whether indexing does or doesn't occur 
 instead of also trying to configure the indexer at the same time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1987) Make bin/crawl indexer agnostic

2015-04-20 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14504282#comment-14504282
 ] 

Hudson commented on NUTCH-1987:
---

SUCCESS: Integrated in Nutch-trunk #3074 (See 
[https://builds.apache.org/job/Nutch-trunk/3074/])
Fix for NUTCH-1987 - Make bin/crawl indexer agnostic contributed by Michael 
Joyce mltjo...@gmail.com this closes #18. (mattmann: 
http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1675022)
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/conf/nutch-default.xml
* /nutch/trunk/src/bin/crawl


 Make bin/crawl indexer agnostic
 ---

 Key: NUTCH-1987
 URL: https://issues.apache.org/jira/browse/NUTCH-1987
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.9
Reporter: Michael Joyce
Assignee: Chris A. Mattmann
  Labels: memex
 Fix For: 1.10


 The crawl script makes it a bit challenging to use an indexer that isn't 
 Solr. For instance, when I want to use the indexer-elastic plugin I still 
 need to call the crawler script with a fake Solr URL otherwise it will skip 
 the indexing step all together.
 {code}
 bin/crawl urls/ crawl/ http://fakeurl.com:9200; 1
 {code}
 It would be nice to keep configuration for the Solr indexer in the conf files 
 (to mirror the elastic search indexer conf and others) and to make the 
 indexing parameter simply toggle whether indexing does or doesn't occur 
 instead of also trying to configure the indexer at the same time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] nutch pull request: NUTCH-1987 - Make bin/crawl indexer agnostic

2015-04-20 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/nutch/pull/18


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Assigned] (NUTCH-1987) Make bin/crawl indexer agnostic

2015-04-20 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann reassigned NUTCH-1987:


Assignee: Chris A. Mattmann

 Make bin/crawl indexer agnostic
 ---

 Key: NUTCH-1987
 URL: https://issues.apache.org/jira/browse/NUTCH-1987
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.9
Reporter: Michael Joyce
Assignee: Chris A. Mattmann
  Labels: memex
 Fix For: 1.10


 The crawl script makes it a bit challenging to use an indexer that isn't 
 Solr. For instance, when I want to use the indexer-elastic plugin I still 
 need to call the crawler script with a fake Solr URL otherwise it will skip 
 the indexing step all together.
 {code}
 bin/crawl urls/ crawl/ http://fakeurl.com:9200; 1
 {code}
 It would be nice to keep configuration for the Solr indexer in the conf files 
 (to mirror the elastic search indexer conf and others) and to make the 
 indexing parameter simply toggle whether indexing does or doesn't occur 
 instead of also trying to configure the indexer at the same time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Work started] (NUTCH-1987) Make bin/crawl indexer agnostic

2015-04-20 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-1987 started by Chris A. Mattmann.

 Make bin/crawl indexer agnostic
 ---

 Key: NUTCH-1987
 URL: https://issues.apache.org/jira/browse/NUTCH-1987
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.9
Reporter: Michael Joyce
Assignee: Chris A. Mattmann
  Labels: memex
 Fix For: 1.10


 The crawl script makes it a bit challenging to use an indexer that isn't 
 Solr. For instance, when I want to use the indexer-elastic plugin I still 
 need to call the crawler script with a fake Solr URL otherwise it will skip 
 the indexing step all together.
 {code}
 bin/crawl urls/ crawl/ http://fakeurl.com:9200; 1
 {code}
 It would be nice to keep configuration for the Solr indexer in the conf files 
 (to mirror the elastic search indexer conf and others) and to make the 
 indexing parameter simply toggle whether indexing does or doesn't occur 
 instead of also trying to configure the indexer at the same time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1990) Use URI.normalise() in BasicURLNormalizer

2015-04-20 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14503072#comment-14503072
 ] 

Julien Nioche commented on NUTCH-1990:
--

Thanks [~wastl-nagel]! 

I have extracted 3332418 URLs from a random segment of CommonCrawl 
(CC-MAIN-20150226074059-0-ip-10-28-5-156.ec2.internal.warc.gz). Parsed it 
with JSoup, the URLS are meant to be absolute but contains a lot of garbage, so 
it is as real life as can be.

I tested the impact of your patch by injecting these URLs. We are getting the 
same number of URLs post-normalisation and it seems to take the same amount of 
time

{code}
Injector: Total number of urls rejected by filters: 886704
Injector: Total number of urls after normalization: 2445715
Injector: Total new urls injected: 2445715
Injector: finished at 2015-04-20 16:31:30, elapsed: 00:00:59
{code}

Note that the figures above where obtained by removing the patterns for the 
regex-based normalisation as well as commenting out

{code}
# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/
{code}

in regex-urlfilter.txt as these operations take most of the time. The 
processing time when leaving these files in their default form is 08:23, which 
confirms that even if the code modified by your patch was a bit slower (which 
is not the case) it would be irrelevant compared to the overall time spent 
normalizing and filtering.

See the related discussion in Storm-Crawler 
[https://github.com/DigitalPebble/storm-crawler/issues/120].

Later on we might want to have some basic normalization code in 
Crawler-Commons, in which case Nutch could leverage it but for now I think this 
patch should be committed.

The list of URLs used for these tests can be downloaded from 
[https://drive.google.com/open?id=0B4ebzXTbUoiAY0hXNjUtdnJGN3Mauthuser=0], 
just in case someone wants to reproduce the steps. 







 Use URI.normalise() in BasicURLNormalizer
 -

 Key: NUTCH-1990
 URL: https://issues.apache.org/jira/browse/NUTCH-1990
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.9
Reporter: Julien Nioche
Assignee: Julien Nioche
 Attachments: NUTCH-1990-trial1.patch


 One of the things that 
 [BasicURLNormalizer|https://github.com/apache/nutch/blob/trunk/src/plugin/urlnormalizer-basic/src/java/org/apache/nutch/net/urlnormalizer/basic/BasicURLNormalizer.java]
  is to remove unnecessary dot segments in path.
 Instead of implementing the logic ourselves with some antiquated regex 
 library, we should simply use 
 [http://docs.oracle.com/javase/7/docs/api/java/net/URI.html#normalize()] 
 which does the same and is probably more efficient.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1987) Make bin/crawl indexer agnostic

2015-04-20 Thread Michael Joyce (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14503446#comment-14503446
 ] 

Michael Joyce commented on NUTCH-1987:
--

Hi folks, PR has been updated with the requested changes. If you have any 
questions or think anything else needs changing let me know.

 Make bin/crawl indexer agnostic
 ---

 Key: NUTCH-1987
 URL: https://issues.apache.org/jira/browse/NUTCH-1987
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.9
Reporter: Michael Joyce
  Labels: memex
 Fix For: 1.10


 The crawl script makes it a bit challenging to use an indexer that isn't 
 Solr. For instance, when I want to use the indexer-elastic plugin I still 
 need to call the crawler script with a fake Solr URL otherwise it will skip 
 the indexing step all together.
 {code}
 bin/crawl urls/ crawl/ http://fakeurl.com:9200; 1
 {code}
 It would be nice to keep configuration for the Solr indexer in the conf files 
 (to mirror the elastic search indexer conf and others) and to make the 
 indexing parameter simply toggle whether indexing does or doesn't occur 
 instead of also trying to configure the indexer at the same time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1989) Handling invalid URLs in CommonCrawlDataDumper

2015-04-20 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14503486#comment-14503486
 ] 

Lewis John McGibbney commented on NUTCH-1989:
-

Hi [~gostep]
bq. The tool logs a warning message if an invalid URL is detected. I am just 
wondering if we can perform a specific action if invalid URLs occur. We could 
skip invalid URLs but I notice that also the following URLs are detected as 
invalid:
So basically although we filter out the clearly unvalid URLs, we also seem to 
filter out valid URLs. We need to work towards a better solution.
bq. I would be very pleased to get your feedback on action to perform when 
invalid URLs are detected, avoiding to drop off data and break the naming 
schema if -epochFilename option is used.
A number of issues here lets take them in following order
 * action to perform when invalid URLs are detected - try the same as we do in 
generator 
https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/crawl/Generator.java#L276-L280
 e.g. just use a counter and log them as invalid
 * avoiding to drop off data and break the naming schema if -epochFilename 
option is used - some of the above URLs are not valid e.g. 
http://arstechnica.com/gaming/2015/04/mortal-kombat-x-charges-players-for-easy-fatalities/\/\/ars.to\/1aPaqvW
 note the two \/\/ towards end of URL.


 Handling invalid URLs in CommonCrawlDataDumper
 --

 Key: NUTCH-1989
 URL: https://issues.apache.org/jira/browse/NUTCH-1989
 Project: Nutch
  Issue Type: Improvement
  Components: tool
Reporter: Giuseppe Totaro
Assignee: Chris A. Mattmann
Priority: Minor
  Labels: memex
 Fix For: 1.10

 Attachments: NUTCH-1989.patch


 Hi all,
 running the {{CommonCrawlDataDumper}} tool ({{bin/nutch commoncrawldump}}) 
 with the new options (as described in 
 [NUTCH-1975|https://issues.apache.org/jira/browse/NUTCH-1975]) I noticed 
 there are some problems if an invalid URL is detected.
 For example, the following URLs (that I found in crawled data) break the 
 naming schema provided by using {{-epochFilename}} command-line option:
 * http://www/
 * http:/
 More in detail, using {{-epochFilename}} option, files extracted will be 
 organized in a reversed-DNS tree based on the FQDN of the webpage, followed 
 by a SHA1 hash of the complete URL. When the tool detect the URLs as above, 
 it is not able to build the reversed-DNS tree.
 You can find in attachment a simple patch for detecting invalid URLs. The 
 patch uses the [Apache Commons 
 Validator|http://commons.apache.org/proper/commons-validator/] APIs to detect 
 invalid URLs:
 {code}
 UrlValidator urlValidator = new UrlValidator();
 if (!urlValidator.isValid(url)) {
   LOG.warn(Not valid URL detected:  + url);
 }
 {code}
 The tool logs a warning message if an invalid URL is detected. I am just 
 wondering if we can perform a specific action if invalid URLs occur. We could 
 skip invalid URLs but I notice that also the following URLs are detected as 
 invalid:
 {noformat}
 2015-04-15 13:49:40,386 WARN  tools.CommonCrawlDataDumper - Not valid URL 
 detected: 
 http://www.reddit.com/r/agora/comments/22ezoa/how_to_buy_drugs_on_agora_hur_man_köper_droger_på/
 2015-04-15 13:49:41,603 WARN  tools.CommonCrawlDataDumper - Not valid URL 
 detected: http://www/
 2015-04-15 13:49:41,632 WARN  tools.CommonCrawlDataDumper - Not valid URL 
 detected: http:/
 2015-04-15 13:49:44,601 WARN  tools.CommonCrawlDataDumper - Not valid URL 
 detected: 
 http://allthingsvice.com/2012/05/30/the-great-420-scam/\/\/allthingsvice.com\/2012\/05\/30\/the-great-420-scam\/
 2015-04-15 13:50:34,821 WARN  tools.CommonCrawlDataDumper - Not valid URL 
 detected: 
 http://www.reddit.com/r/agora/comments/22ezoa/how_to_buy_drugs_on_agora_hur_man_köper_droger_på/
 2015-04-15 13:50:35,847 WARN  tools.CommonCrawlDataDumper - Not valid URL 
 detected: http://www/
 2015-04-15 13:50:35,866 WARN  tools.CommonCrawlDataDumper - Not valid URL 
 detected: http:/
 2015-04-15 13:50:38,605 WARN  tools.CommonCrawlDataDumper - Not valid URL 
 detected: 
 http://allthingsvice.com/2012/05/30/the-great-420-scam/\/\/allthingsvice.com\/2012\/05\/30\/the-great-420-scam\/
 2015-04-15 13:51:20,013 WARN  tools.CommonCrawlDataDumper - Not valid URL 
 detected: http://antilop.cc/sr/users/nomad bloodbath
 2015-04-15 13:51:20,499 WARN  tools.CommonCrawlDataDumper - Not valid URL 
 detected: 
 http://arstechnica.com/gaming/2015/04/mortal-kombat-x-charges-players-for-easy-fatalities/\/\/ars.to\/1aPaqvW
 2015-04-15 13:51:20,500 WARN  tools.CommonCrawlDataDumper - Not valid URL 
 detected: 
 http://arstechnica.com/gaming/2015/04/mortal-kombat-x-charges-players-for-easy-fatalities/\/\/arstechnica.com
 2015-04-15 13:51:20,500 WARN  

[jira] [Commented] (NUTCH-1989) Handling invalid URLs in CommonCrawlDataDumper

2015-04-20 Thread Giuseppe Totaro (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14503500#comment-14503500
 ] 

Giuseppe Totaro commented on NUTCH-1989:


Hi [~lewismc], thanks a lot for supporting me on this work.
In the patch, when an invalid URL is detected is not filtered out. The tool 
generates a warning log message. I totally agree with you that we need to work 
towards a better solution.
Thanks a lot for your great suggestion. I will add a counter for invalid URLs. 
I will update about that.
Thanks,
Giuseppe

 Handling invalid URLs in CommonCrawlDataDumper
 --

 Key: NUTCH-1989
 URL: https://issues.apache.org/jira/browse/NUTCH-1989
 Project: Nutch
  Issue Type: Improvement
  Components: tool
Reporter: Giuseppe Totaro
Assignee: Chris A. Mattmann
Priority: Minor
  Labels: memex
 Fix For: 1.10

 Attachments: NUTCH-1989.patch


 Hi all,
 running the {{CommonCrawlDataDumper}} tool ({{bin/nutch commoncrawldump}}) 
 with the new options (as described in 
 [NUTCH-1975|https://issues.apache.org/jira/browse/NUTCH-1975]) I noticed 
 there are some problems if an invalid URL is detected.
 For example, the following URLs (that I found in crawled data) break the 
 naming schema provided by using {{-epochFilename}} command-line option:
 * http://www/
 * http:/
 More in detail, using {{-epochFilename}} option, files extracted will be 
 organized in a reversed-DNS tree based on the FQDN of the webpage, followed 
 by a SHA1 hash of the complete URL. When the tool detect the URLs as above, 
 it is not able to build the reversed-DNS tree.
 You can find in attachment a simple patch for detecting invalid URLs. The 
 patch uses the [Apache Commons 
 Validator|http://commons.apache.org/proper/commons-validator/] APIs to detect 
 invalid URLs:
 {code}
 UrlValidator urlValidator = new UrlValidator();
 if (!urlValidator.isValid(url)) {
   LOG.warn(Not valid URL detected:  + url);
 }
 {code}
 The tool logs a warning message if an invalid URL is detected. I am just 
 wondering if we can perform a specific action if invalid URLs occur. We could 
 skip invalid URLs but I notice that also the following URLs are detected as 
 invalid:
 {noformat}
 2015-04-15 13:49:40,386 WARN  tools.CommonCrawlDataDumper - Not valid URL 
 detected: 
 http://www.reddit.com/r/agora/comments/22ezoa/how_to_buy_drugs_on_agora_hur_man_köper_droger_på/
 2015-04-15 13:49:41,603 WARN  tools.CommonCrawlDataDumper - Not valid URL 
 detected: http://www/
 2015-04-15 13:49:41,632 WARN  tools.CommonCrawlDataDumper - Not valid URL 
 detected: http:/
 2015-04-15 13:49:44,601 WARN  tools.CommonCrawlDataDumper - Not valid URL 
 detected: 
 http://allthingsvice.com/2012/05/30/the-great-420-scam/\/\/allthingsvice.com\/2012\/05\/30\/the-great-420-scam\/
 2015-04-15 13:50:34,821 WARN  tools.CommonCrawlDataDumper - Not valid URL 
 detected: 
 http://www.reddit.com/r/agora/comments/22ezoa/how_to_buy_drugs_on_agora_hur_man_köper_droger_på/
 2015-04-15 13:50:35,847 WARN  tools.CommonCrawlDataDumper - Not valid URL 
 detected: http://www/
 2015-04-15 13:50:35,866 WARN  tools.CommonCrawlDataDumper - Not valid URL 
 detected: http:/
 2015-04-15 13:50:38,605 WARN  tools.CommonCrawlDataDumper - Not valid URL 
 detected: 
 http://allthingsvice.com/2012/05/30/the-great-420-scam/\/\/allthingsvice.com\/2012\/05\/30\/the-great-420-scam\/
 2015-04-15 13:51:20,013 WARN  tools.CommonCrawlDataDumper - Not valid URL 
 detected: http://antilop.cc/sr/users/nomad bloodbath
 2015-04-15 13:51:20,499 WARN  tools.CommonCrawlDataDumper - Not valid URL 
 detected: 
 http://arstechnica.com/gaming/2015/04/mortal-kombat-x-charges-players-for-easy-fatalities/\/\/ars.to\/1aPaqvW
 2015-04-15 13:51:20,500 WARN  tools.CommonCrawlDataDumper - Not valid URL 
 detected: 
 http://arstechnica.com/gaming/2015/04/mortal-kombat-x-charges-players-for-easy-fatalities/\/\/arstechnica.com
 2015-04-15 13:51:20,500 WARN  tools.CommonCrawlDataDumper - Not valid URL 
 detected: 
 http://arstechnica.com/gaming/2015/04/mortal-kombat-x-charges-players-for-easy-fatalities/\/\/arstechnica.com\/gaming\/2015\/04\/mortal-kombat-x-charges-players-for-easy-fatalities\/
 2015-04-15 13:51:20,500 WARN  tools.CommonCrawlDataDumper - Not valid URL 
 detected: 
 http://arstechnica.com/gaming/2015/04/mortal-kombat-x-charges-players-for-easy-fatalities/\/\/cdn.arstechnica.net\/wp-content\/themes\/arstechnica\/assets
 2015-04-15 13:51:20,500 WARN  tools.CommonCrawlDataDumper - Not valid URL 
 detected: 
 http://arstechnica.com/gaming/2015/04/mortal-kombat-x-charges-players-for-easy-fatalities/\/civis
 2015-04-15 13:51:20,588 WARN  tools.CommonCrawlDataDumper - Not valid URL 
 detected: