[jira] [Commented] (NUTCH-1247) CrawlDatum.retries should be int

2012-01-17 Thread Sebastian Nagel (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13187716#comment-13187716
 ] 

Sebastian Nagel commented on NUTCH-1247:


Interestingly, I also found a couple of URLs with unreasonable high retry 
counter in the data where NUTCH-1245 was first observed (it was Nutch 1.2). 
* all these URLs failed with some exception (invalid URI or HTTP=403) and not 
404, not found, or robots denied?
  Markus, do the URLs which overflow the retry counter in your Db also belong 
to this class?
* in the segments the status of these URLs is fetch_retry (in crawl_fetch):
  In Fetcher.java the case ProtocolStatus.EXCEPTION inside the switch statement 
in FetcherThread.run() falls through the default where the result is collected 
with STATUS_FETCH_RETRY.

CrawlDbReducer calls FetchSchedule.forceRefetch() only for the cases 
STATUS_FETCH_NOT_MODIFIED or STATUS_FETCH_GONE (here via setPageGoneSchedule). 
The branch STATUS_FETCH_RETRY does not reset the retry counter. Generator never 
calls forceRefetch() nor does it reset the retry counter.

If this analysis is correct there are two possible patches:
* A (CrawlDbReducer): call setPageGoneSchedule for the case STATUS_FETCH_RETRY
* B (Generator): reset the retry counter to zero when a db_gone URL is 
generated again



 CrawlDatum.retries should be int
 

 Key: NUTCH-1247
 URL: https://issues.apache.org/jira/browse/NUTCH-1247
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.4
Reporter: Markus Jelsma
 Fix For: 1.5


 CrawlDatum.retries is a byte and goes bad with larger values.
 12/01/12 18:35:22 INFO crawl.CrawlDbReader: retry -127: 1
 12/01/12 18:35:22 INFO crawl.CrawlDbReader: retry -128: 1

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1247) CrawlDatum.retries should be int

2012-01-17 Thread Sebastian Nagel (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1247:
---

Attachment: NUTCH-1247.patch_B
NUTCH-1247.patch_A

 CrawlDatum.retries should be int
 

 Key: NUTCH-1247
 URL: https://issues.apache.org/jira/browse/NUTCH-1247
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.4
Reporter: Markus Jelsma
 Fix For: 1.5

 Attachments: NUTCH-1247.patch_A, NUTCH-1247.patch_B


 CrawlDatum.retries is a byte and goes bad with larger values.
 12/01/12 18:35:22 INFO crawl.CrawlDbReader: retry -127: 1
 12/01/12 18:35:22 INFO crawl.CrawlDbReader: retry -128: 1

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (NUTCH-1250) parse-html does not parse links with empty anchor

2012-01-17 Thread Andreas Janning (Created) (JIRA)
parse-html does not parse links with empty anchor
-

 Key: NUTCH-1250
 URL: https://issues.apache.org/jira/browse/NUTCH-1250
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
Reporter: Andreas Janning


The parse-html plugin does not generate an outlink if the link has no anchor
For example the following HTML-Code does not create an Outlink:
{code:html} 
  a href=example.com/a
{code}

The JUnit-Test TestDOMContentUtils tries to test this but fails since there is 
a comment inside the a-Tag.
{code:title=TestDOMContentUtils.java|borderStyle=solid}
new String(htmlheadtitle title /title
+ /headbody
+ a href=\g\!--no anchor--/a
+ a href=\g1\ !--whitespace--  /a
+ a href=\g2\  img src=test.gif alt='bla bla' /a
+ /body/html), 
{code}

When you remove the comment the test fails.

{code:title=TestDOMContentUtils.java Test fails|borderStyle=solid}
new String(htmlheadtitle title /title
+ /headbody
+ a href=\g\/a // no anchor
+ a href=\g1\ !--whitespace--  /a
+ a href=\g2\  img src=test.gif alt='bla bla' /a
+ /body/html), 
{code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1242) Allow disabling of URL Filters in ParseSegment

2012-01-17 Thread Edward Drapkin (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Edward Drapkin updated NUTCH-1242:
--

Attachment: (was: ParseSegment.patch)

 Allow disabling of URL Filters in ParseSegment
 --

 Key: NUTCH-1242
 URL: https://issues.apache.org/jira/browse/NUTCH-1242
 Project: Nutch
  Issue Type: Improvement
Reporter: Edward Drapkin
 Fix For: 1.5

 Attachments: ParseSegment.patch, parseoutputformat.patch


 Right now, the ParseSegment job does not allow you to disable URL filtration. 
  For reasons that aren't worth explaining, I need to do this, so I enabled 
 this behavior through the use of a boolean configuration value 
 parse.filter.urls which defaults to true.
 I've attached a simple, preliminary patch that enables this behavior with 
 that configuration option.  I'm not sure if it should be made a command line 
 option or not.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1242) Allow disabling of URL Filters in ParseSegment

2012-01-17 Thread Edward Drapkin (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Edward Drapkin updated NUTCH-1242:
--

Attachment: ParseSegment.patch

Updated patch to add a message to the usage description.

 Allow disabling of URL Filters in ParseSegment
 --

 Key: NUTCH-1242
 URL: https://issues.apache.org/jira/browse/NUTCH-1242
 Project: Nutch
  Issue Type: Improvement
Reporter: Edward Drapkin
 Fix For: 1.5

 Attachments: ParseSegment.patch, parseoutputformat.patch


 Right now, the ParseSegment job does not allow you to disable URL filtration. 
  For reasons that aren't worth explaining, I need to do this, so I enabled 
 this behavior through the use of a boolean configuration value 
 parse.filter.urls which defaults to true.
 I've attached a simple, preliminary patch that enables this behavior with 
 that configuration option.  I'm not sure if it should be made a command line 
 option or not.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Assigned] (NUTCH-1242) Allow disabling of URL Filters in ParseSegment

2012-01-17 Thread Markus Jelsma (Assigned) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma reassigned NUTCH-1242:


Assignee: Markus Jelsma

 Allow disabling of URL Filters in ParseSegment
 --

 Key: NUTCH-1242
 URL: https://issues.apache.org/jira/browse/NUTCH-1242
 Project: Nutch
  Issue Type: Improvement
Reporter: Edward Drapkin
Assignee: Markus Jelsma
 Fix For: 1.5

 Attachments: ParseSegment.patch, parseoutputformat.patch


 Right now, the ParseSegment job does not allow you to disable URL filtration. 
  For reasons that aren't worth explaining, I need to do this, so I enabled 
 this behavior through the use of a boolean configuration value 
 parse.filter.urls which defaults to true.
 I've attached a simple, preliminary patch that enables this behavior with 
 that configuration option.  I'm not sure if it should be made a command line 
 option or not.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1201) Allow for different FetcherThread impls

2012-01-17 Thread Edward Drapkin (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13187874#comment-13187874
 ] 

Edward Drapkin commented on NUTCH-1201:
---

Does this still need to be done?  It seems pretty easy and I'll volunteer to do 
it if it needs to be done.

I was thinking of breaking all of Fetcher apart into more easily 
compartmentalized and pluggable units for my own benefit (as right now it's an 
enormous class that's extremely daunting and hard to change).  If this issue 
still needs work, I think I can break Fetcher apart to allow for pluggable 
fetchers, item queues, queue feeders and fetcher threads, but I don't want to 
invest time into reinventing a wheel that you've already invented (and may not 
have updated JIRA).

Let me know!

 Allow for different FetcherThread impls
 ---

 Key: NUTCH-1201
 URL: https://issues.apache.org/jira/browse/NUTCH-1201
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.5


 For certain cases we need to modify parts in FetcherThread and make it 
 pluggable. This introduces a new config directive fetcher.impl that takes a 
 FQCN and uses that setting Fetcher.fetch to load a class to use for 
 job.setMapRunnerClass(). This new class has to extend Fetcher and and inner 
 class FetcherThread. This allows for overriding methods in FetcherThread but 
 also methods in Fetcher itself if required.
 A follow up on this issue would be to refactor parts of FetcherThread to make 
 it easier to override small sections instead of copying the entire method 
 body for a small change, which is now the case.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1201) Allow for different FetcherThread impls

2012-01-17 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13187884#comment-13187884
 ] 

Markus Jelsma commented on NUTCH-1201:
--

Hi Edward,

I've already modified Fetcher to allow for different Fetcher impls via 
configuration that inherit from Fetcher itself. It works fine and i can 
override methods i need. However, it may not be that elegant. There's no code 
to use other queue impls. I'll cook a patch tomorrow.

 Allow for different FetcherThread impls
 ---

 Key: NUTCH-1201
 URL: https://issues.apache.org/jira/browse/NUTCH-1201
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.5


 For certain cases we need to modify parts in FetcherThread and make it 
 pluggable. This introduces a new config directive fetcher.impl that takes a 
 FQCN and uses that setting Fetcher.fetch to load a class to use for 
 job.setMapRunnerClass(). This new class has to extend Fetcher and and inner 
 class FetcherThread. This allows for overriding methods in FetcherThread but 
 also methods in Fetcher itself if required.
 A follow up on this issue would be to refactor parts of FetcherThread to make 
 it easier to override small sections instead of copying the entire method 
 body for a small change, which is now the case.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




I want to volunteer some time

2012-01-17 Thread Eddie Drapkin

Hello all,

I've got a bunch of spare time coming up in the next several 
weeks/months and would like to volunteer to help the project out.  I'm 
already extremely familiar with the internals of Nutch, as I've been 
hacking at it for our internal use here (at Wolfram Research) for the 
last ~1.5 years or so.  While there's probably a fair amount of code 
that I haven't read, I've at least visited and read some of all of the 
areas of Nutch's core and most of the plugins.


I think I should put that knowledge to good use and contribute back 
(I've already sent some patches in, but nothing major or really even 
that significant), but I'm not sure what needs to be done or where my 
time would be best spent.  I just subscribed to this list, so if there's 
a thread discussing priorities that's current and whatnot, can someone 
point me to it in the archives?  Barring that, can someone point me in 
the direction where I should be looking to contribute?  My best guess is 
to just start attacking JIRA tickets...


Thanks,
Eddie


[jira] [Commented] (NUTCH-1201) Allow for different FetcherThread impls

2012-01-17 Thread Edward Drapkin (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13187904#comment-13187904
 ] 

Edward Drapkin commented on NUTCH-1201:
---

I was thinking more of an approach of breaking Fetcher into these components:

interface Fetcher
class FetcherImpl

interface FetcherThread (extends Thread)
class FetcherThreadImpl

interface FetchItemQueue
class FetchItemQueueImpl

interface FetchQueueFeeder
class FetchQueueFeederImpl

Where all of the *Impl classes would be the current implementations of the 
classes.  I may be over-engineering here (I'm pretty prone to do that), but I 
think that this would open up the potential to heavily profile fetching and 
optimizing under various scenarios as I have a sneaky suspicion there's a lot 
more lock contention and thread spinning that happens during fetching than 
entirely necessary.  It may be beneficial to offer several implementations out 
of the box for various scenarios: single-threaded fetchers, lightweight queues 
for short lists and/or small numbers of fetcher threads, heavyweight queues for 
large workloads, etc.

 Allow for different FetcherThread impls
 ---

 Key: NUTCH-1201
 URL: https://issues.apache.org/jira/browse/NUTCH-1201
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.5


 For certain cases we need to modify parts in FetcherThread and make it 
 pluggable. This introduces a new config directive fetcher.impl that takes a 
 FQCN and uses that setting Fetcher.fetch to load a class to use for 
 job.setMapRunnerClass(). This new class has to extend Fetcher and and inner 
 class FetcherThread. This allows for overriding methods in FetcherThread but 
 also methods in Fetcher itself if required.
 A follow up on this issue would be to refactor parts of FetcherThread to make 
 it easier to override small sections instead of copying the entire method 
 body for a small change, which is now the case.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: I want to volunteer some time

2012-01-17 Thread Markus Jelsma
Hi!

Excellent! You may want to check the list of issues for 1.5. There are several 
issues being worked on from time to time and a number of open issues and even 
a few hairy problems. Contribution as patch or comment on any issue is always 
appreciated. You can also create issues to solve problems yourself as you did 
with the parser filters issue.

Anything is welcome!

Cheers,

 Hello all,
 
 I've got a bunch of spare time coming up in the next several
 weeks/months and would like to volunteer to help the project out.  I'm
 already extremely familiar with the internals of Nutch, as I've been
 hacking at it for our internal use here (at Wolfram Research) for the
 last ~1.5 years or so.  While there's probably a fair amount of code
 that I haven't read, I've at least visited and read some of all of the
 areas of Nutch's core and most of the plugins.
 
 I think I should put that knowledge to good use and contribute back
 (I've already sent some patches in, but nothing major or really even
 that significant), but I'm not sure what needs to be done or where my
 time would be best spent.  I just subscribed to this list, so if there's
 a thread discussing priorities that's current and whatnot, can someone
 point me to it in the archives?  Barring that, can someone point me in
 the direction where I should be looking to contribute?  My best guess is
 to just start attacking JIRA tickets...
 
 Thanks,
 Eddie


Re: I want to volunteer some time

2012-01-17 Thread Julien Nioche
Hi Eddie,

Great to hear that! Just to add to what Markus said there are also quite a
few tasks to do on the NutchGora branch if that's something you'd be
interested in. Or outside the tasks on JIRA, there is always a fair bit to
do on the Wiki e.g. how to run in distributed mode etc...

Just out of curiosity, could you tell us a bit about what you've been using
Nutch for at Wolfram Research?

Thanks for volunteering

Julien

On 17 January 2012 19:15, Markus Jelsma markus.jel...@openindex.io wrote:

 Hi!

 Excellent! You may want to check the list of issues for 1.5. There are
 several
 issues being worked on from time to time and a number of open issues and
 even
 a few hairy problems. Contribution as patch or comment on any issue is
 always
 appreciated. You can also create issues to solve problems yourself as you
 did
 with the parser filters issue.

 Anything is welcome!

 Cheers,

  Hello all,
 
  I've got a bunch of spare time coming up in the next several
  weeks/months and would like to volunteer to help the project out.  I'm
  already extremely familiar with the internals of Nutch, as I've been
  hacking at it for our internal use here (at Wolfram Research) for the
  last ~1.5 years or so.  While there's probably a fair amount of code
  that I haven't read, I've at least visited and read some of all of the
  areas of Nutch's core and most of the plugins.
 
  I think I should put that knowledge to good use and contribute back
  (I've already sent some patches in, but nothing major or really even
  that significant), but I'm not sure what needs to be done or where my
  time would be best spent.  I just subscribed to this list, so if there's
  a thread discussing priorities that's current and whatnot, can someone
  point me to it in the archives?  Barring that, can someone point me in
  the direction where I should be looking to contribute?  My best guess is
  to just start attacking JIRA tickets...
 
  Thanks,
  Eddie




-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com


[jira] [Commented] (NUTCH-1201) Allow for different FetcherThread impls

2012-01-17 Thread Andrzej Bialecki (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13187927#comment-13187927
 ] 

Andrzej Bialecki  commented on NUTCH-1201:
--

I agree that there are situations where you might want a custom fetcher (e.g. 
depth-first crawling), and it would be good to come up with some more specific 
API than just MapRunner.

I'm not convinced yet that providing interfaces (or rather abstract classes) 
for the existing plumbing in Fetcher is a good idea - let's figure out first 
whether this code is reusable at all for some other fetching strategies, 
because if it's not then providing custom queue impls. may offer little value, 
and perhaps customization should be implemented on a different level.

Re. thread spinning - I haven't seen yet an unequivocal case that would prove 
that crawl contention is caused by the thread mgmt in Fetcher. Usually on 
closer look the bottleneck turned out to lie elsewhere (network io, remote 
throttling, dns lookups, politeness rules, etc).

 Allow for different FetcherThread impls
 ---

 Key: NUTCH-1201
 URL: https://issues.apache.org/jira/browse/NUTCH-1201
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.5


 For certain cases we need to modify parts in FetcherThread and make it 
 pluggable. This introduces a new config directive fetcher.impl that takes a 
 FQCN and uses that setting Fetcher.fetch to load a class to use for 
 job.setMapRunnerClass(). This new class has to extend Fetcher and and inner 
 class FetcherThread. This allows for overriding methods in FetcherThread but 
 also methods in Fetcher itself if required.
 A follow up on this issue would be to refactor parts of FetcherThread to make 
 it easier to override small sections instead of copying the entire method 
 body for a small change, which is now the case.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1201) Allow for different FetcherThread impls

2012-01-17 Thread Edward Drapkin (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13187950#comment-13187950
 ] 

Edward Drapkin commented on NUTCH-1201:
---

You bring up a good point, and I was making a pretty blatant assumption that 
the code is in fact reusable for these other cases.

I think at the highest level, fetching will always basically be a 
producer-consumer task, which implies that there will always be these 
components: some queue, something to feed the queue, something to consume from 
the queue, and something to pull it all together into the hadoop job.  If 
there's a better way of architecting the code necessary to run a fetching 
process, it's not something I've seen.  The interfaces that I suggest reflect 
this (and use the same names currently being used) and the default 
implementations would be the existing code, so as to not break BC.

I do think, though, that Fetcher itself ought to be able to be overridden and 
customized (hence providing an interface to it), although we should focus on 
making that something that no one wants to do, so it doesn't even need to be 
discouraged.  I envision a situation in which Fetcher just basically serves as 
glue that holds the other three components together, so a situation where 
some logic needs to be changed would be changed in one of the other components. 
 

We may wind up in a situation where the only benefit to providing custom queue 
behavior is in conjunction with providing custom queue feeder + queue consumer 
behavior... as a matter of fact, I'd fully expect this to frequently be the 
case.  Perhaps a better overall approach here might be to break Fetching into a 
high-level Nutch abstraction, then provide several fetching plugins that can be 
dropped into place depending on the situation, similar to the way that the 
protocol plugins behave.  The fetcher already runs threads outside of the 
hadoop framework, so a generic fetcher job that just invoked a fetching plugin 
wouldn't have to be a regression of any sort.  

The more I think about it, the more I think that this may be the right solution 
to a modular fetching system: Nutch (eventually) shipping with 
fetch-depthfirst and fetch-unthreaded and fetch-default and any other 
scenario that may arise would allow for support for several cases right out of 
box.  This approach would probably be the most difficult in terms of man hours 
and testing (but hey, I'm volunteering, right?), but I think it's probably the 
best way to provide modular fetcher functionality.

If we decide to break the fetcher into a plugin, then the fetcher only has to 
conform to a relatively simply interface.  I'd think that we would provide an 
abstract class that implements that interface and holds together the other 
sub-components mentioned above, as a starting point for the various fetcher 
plugins, but I don't think we would have to require that it be used.  We could, 
similarly, offer abstract class default implementations of the various 
sub-components as well, but we'd nowhere force or require them to be used in 
any capacity.

 Allow for different FetcherThread impls
 ---

 Key: NUTCH-1201
 URL: https://issues.apache.org/jira/browse/NUTCH-1201
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.5


 For certain cases we need to modify parts in FetcherThread and make it 
 pluggable. This introduces a new config directive fetcher.impl that takes a 
 FQCN and uses that setting Fetcher.fetch to load a class to use for 
 job.setMapRunnerClass(). This new class has to extend Fetcher and and inner 
 class FetcherThread. This allows for overriding methods in FetcherThread but 
 also methods in Fetcher itself if required.
 A follow up on this issue would be to refactor parts of FetcherThread to make 
 it easier to override small sections instead of copying the entire method 
 body for a small change, which is now the case.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1251) Deletion of duplicates fails with org.apache.solr.client.solrj.SolrServerException

2012-01-17 Thread Arkadi Kosmynin (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arkadi Kosmynin updated NUTCH-1251:
---

Description: 
Deletion of duplicates fails. This happens because the get all query used to 
get Solr index size is id:[* TO *], which is a range query. Lucene is trying 
to expand it to a Boolean query and gets as many clauses as there are ids in 
the index. This is too many in a real situation and it throws an exception. 

To correct this problem, change the get all query (SOLR_GET_ALL_QUERY) to 
\*:\*, which is the standard Solr get all query.

Indexing log extract:

java.io.IOException: org.apache.solr.client.solrj.SolrServerException: Error 
executing query
at 
org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getRecordReader(SolrDeleteDuplicates.java:236)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:338)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
Caused by: org.apache.solr.client.solrj.SolrServerException: Error executing 
query
at 
org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:95)
at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:118)
at 
org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getRecordReader(SolrDeleteDuplicates.java:234)
... 3 more
Caused by: org.apache.solr.common.SolrException: Internal Server Error

Internal Server Error

request: http://localhost:8081/arch/select?q=id:[* TO 
*]fl=id,boost,tstamp,digeststart=0rows=82938wt=javabinversion=2
at 
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:430)
at 
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
at 
org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:89)
... 5 more



  was:
Deletion of duplicates fails. This happens because the get all query used to 
get Solr index size is id:[* TO *], which is a range query. Lucene is trying 
to expand it to a Boolean query and gets as many clauses as there are ids in 
the index. This is too many in a real situation and it throws an exception. 

To correct this problem, change the get all query (SOLR_GET_ALL_QUERY) to 
*:*, which is the standard Solr get all query.

Indexing log extract:

java.io.IOException: org.apache.solr.client.solrj.SolrServerException: Error 
executing query
at 
org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getRecordReader(SolrDeleteDuplicates.java:236)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:338)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
Caused by: org.apache.solr.client.solrj.SolrServerException: Error executing 
query
at 
org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:95)
at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:118)
at 
org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getRecordReader(SolrDeleteDuplicates.java:234)
... 3 more
Caused by: org.apache.solr.common.SolrException: Internal Server Error

Internal Server Error

request: http://localhost:8081/arch/select?q=id:[* TO 
*]fl=id,boost,tstamp,digeststart=0rows=82938wt=javabinversion=2
at 
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:430)
at 
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
at 
org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:89)
... 5 more




 Deletion of duplicates fails with 
 org.apache.solr.client.solrj.SolrServerException
 --

 Key: NUTCH-1251
 URL: https://issues.apache.org/jira/browse/NUTCH-1251
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.4
 Environment: Any crawl where the number of URLs in Solr exceeds 1024 
 (the default max number of clusters in Lucene boolean query).  
Reporter: Arkadi Kosmynin
Priority: Critical

 Deletion of duplicates fails. This happens because the get all query used 
 to get Solr index size is id:[* TO *], which is a range query. Lucene is 
 trying to expand it to a Boolean query and gets as many clauses as there are 
 ids in the index. This is too many in a real situation and it throws an 
 exception. 
 To correct this problem, change the get all query (SOLR_GET_ALL_QUERY) to 
 \*:\*, which is the standard Solr get all query.
 Indexing log extract:
 

[jira] [Updated] (NUTCH-1251) Deletion of duplicates fails with org.apache.solr.client.solrj.SolrServerException

2012-01-17 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1251:
-

Fix Version/s: 1.5

 Deletion of duplicates fails with 
 org.apache.solr.client.solrj.SolrServerException
 --

 Key: NUTCH-1251
 URL: https://issues.apache.org/jira/browse/NUTCH-1251
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.4
 Environment: Any crawl where the number of URLs in Solr exceeds 1024 
 (the default max number of clusters in Lucene boolean query).  
Reporter: Arkadi Kosmynin
Priority: Critical
 Fix For: 1.5


 Deletion of duplicates fails. This happens because the get all query used 
 to get Solr index size is id:[* TO *], which is a range query. Lucene is 
 trying to expand it to a Boolean query and gets as many clauses as there are 
 ids in the index. This is too many in a real situation and it throws an 
 exception. 
 To correct this problem, change the get all query (SOLR_GET_ALL_QUERY) to 
 \*:\*, which is the standard Solr get all query.
 Indexing log extract:
 java.io.IOException: org.apache.solr.client.solrj.SolrServerException: Error 
 executing query
   at 
 org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getRecordReader(SolrDeleteDuplicates.java:236)
   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:338)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
   at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
 Caused by: org.apache.solr.client.solrj.SolrServerException: Error executing 
 query
   at 
 org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:95)
   at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:118)
   at 
 org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getRecordReader(SolrDeleteDuplicates.java:234)
   ... 3 more
 Caused by: org.apache.solr.common.SolrException: Internal Server Error
 Internal Server Error
 request: http://localhost:8081/arch/select?q=id:[* TO 
 *]fl=id,boost,tstamp,digeststart=0rows=82938wt=javabinversion=2
   at 
 org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:430)
   at 
 org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
   at 
 org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:89)
   ... 5 more

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1251) Deletion of duplicates fails with org.apache.solr.client.solrj.SolrServerException

2012-01-17 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13188095#comment-13188095
 ] 

Markus Jelsma commented on NUTCH-1251:
--

Can you provide a patch for trunk?

 Deletion of duplicates fails with 
 org.apache.solr.client.solrj.SolrServerException
 --

 Key: NUTCH-1251
 URL: https://issues.apache.org/jira/browse/NUTCH-1251
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.4
 Environment: Any crawl where the number of URLs in Solr exceeds 1024 
 (the default max number of clusters in Lucene boolean query).  
Reporter: Arkadi Kosmynin
Priority: Critical
 Fix For: 1.5


 Deletion of duplicates fails. This happens because the get all query used 
 to get Solr index size is id:[* TO *], which is a range query. Lucene is 
 trying to expand it to a Boolean query and gets as many clauses as there are 
 ids in the index. This is too many in a real situation and it throws an 
 exception. 
 To correct this problem, change the get all query (SOLR_GET_ALL_QUERY) to 
 \*:\*, which is the standard Solr get all query.
 Indexing log extract:
 java.io.IOException: org.apache.solr.client.solrj.SolrServerException: Error 
 executing query
   at 
 org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getRecordReader(SolrDeleteDuplicates.java:236)
   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:338)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
   at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
 Caused by: org.apache.solr.client.solrj.SolrServerException: Error executing 
 query
   at 
 org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:95)
   at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:118)
   at 
 org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getRecordReader(SolrDeleteDuplicates.java:234)
   ... 3 more
 Caused by: org.apache.solr.common.SolrException: Internal Server Error
 Internal Server Error
 request: http://localhost:8081/arch/select?q=id:[* TO 
 *]fl=id,boost,tstamp,digeststart=0rows=82938wt=javabinversion=2
   at 
 org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:430)
   at 
 org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
   at 
 org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:89)
   ... 5 more

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1251) Deletion of duplicates fails with org.apache.solr.client.solrj.SolrServerException

2012-01-17 Thread Arkadi Kosmynin (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13188115#comment-13188115
 ] 

Arkadi Kosmynin commented on NUTCH-1251:


It is one line change. File 
org.apache.nutch.indexer.solr.SolrDeleteDuplicates.java, line 90.

 Deletion of duplicates fails with 
 org.apache.solr.client.solrj.SolrServerException
 --

 Key: NUTCH-1251
 URL: https://issues.apache.org/jira/browse/NUTCH-1251
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.4
 Environment: Any crawl where the number of URLs in Solr exceeds 1024 
 (the default max number of clusters in Lucene boolean query).  
Reporter: Arkadi Kosmynin
Priority: Critical
 Fix For: 1.5


 Deletion of duplicates fails. This happens because the get all query used 
 to get Solr index size is id:[* TO *], which is a range query. Lucene is 
 trying to expand it to a Boolean query and gets as many clauses as there are 
 ids in the index. This is too many in a real situation and it throws an 
 exception. 
 To correct this problem, change the get all query (SOLR_GET_ALL_QUERY) to 
 \*:\*, which is the standard Solr get all query.
 Indexing log extract:
 java.io.IOException: org.apache.solr.client.solrj.SolrServerException: Error 
 executing query
   at 
 org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getRecordReader(SolrDeleteDuplicates.java:236)
   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:338)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
   at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
 Caused by: org.apache.solr.client.solrj.SolrServerException: Error executing 
 query
   at 
 org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:95)
   at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:118)
   at 
 org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getRecordReader(SolrDeleteDuplicates.java:234)
   ... 3 more
 Caused by: org.apache.solr.common.SolrException: Internal Server Error
 Internal Server Error
 request: http://localhost:8081/arch/select?q=id:[* TO 
 *]fl=id,boost,tstamp,digeststart=0rows=82938wt=javabinversion=2
   at 
 org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:430)
   at 
 org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
   at 
 org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:89)
   ... 5 more

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[Nutch Wiki] Trivial Update of AdminGroup by LewisJohnMcgibbney

2012-01-17 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The AdminGroup page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/AdminGroup?action=diffrev1=6rev2=7

   * JulienNioche
   * MarkusJelsma
   * ElisabethAdler
+  * EdwardDrapkin
  


Re: I want to volunteer some time

2012-01-17 Thread Lewis John Mcgibbney
Hi Eddie,

I've added you to the AdminGroup for our wiki, you will be able to edit
whichever areas you are interested in, or which you think can/should be
improved.

Your introduction sounds real interesting and as Markus  Julien have said
there is a lot of issues which merit some input, its great that you are
able to contribute. Just a quick side-note, as Julien said we also maintain
a Nutchgora branch, which has some unique characteristics which you might
find interesting.

Best for now

Lewis

On Tue, Jan 17, 2012 at 9:31 PM, Eddie Drapkin edwa...@wolfram.com wrote:

  Alrighty!

 I checked out the JIRA and sort of attacked an issue I think I can
 contribute to... I'll look and try to find more as well.

 I can certainly write documentation if that's a need (when isn't it?),
 just someone point me at the areas that need better documentation and I'll
 do what I can.  You mentioned distributed mode, which is something I
 actually can't really document because it's not something we use - our
 crawler exists as a single intranet server and probably will for the
 foreseeable future.  Do I need any special account privileges to edit wiki
 pages (username is EdwardDrapkin)?

 We use Nutch here to crawl our various intranet sites to build Lucene
 indexes for a few search applications that we have (search.wolfram.com,
 mathworld, etc.).  I've written a rather hefty plugin for it to accommodate
 some of the custom functionality we need (I'd guess it's ~20,000 lines of
 code).  We have our search broken down by our sites (e.g.
 reference.wolfram.com is one index and mathworld is another), which are
 crawled separately, so a lot of our custom functionality is written in
 light of that, particularly scoring.  Because it's custom code for a single
 purpose, a lot of the code is also there to curate the data going into the
 index (custom parsers for a particular site to remove navigation elements,
 for instance).  The most (only, really) interesting thing that I've done
 with it is tracking wiki changes outside of the primary crawl database (I
 keep my own database of page modification times) and creating custom fetch
 lists, so that our wiki can be crawled nightly, as it's rather massive and
 hosted on a shared machine that can't support an intensive crawl every
 night.  I've also re-created the lucene index plugin as part of our plugin,
 as we don't use Solr, but our own search application.

 I'm working now on creating a comprehensive link-graph of all links for a
 particular crawl configuration, while still only crawling the correct URLs,
 so that we can experiment with using various page scoring algorithms.  This
 is why I wanted to not filter the links in the parse stage, so now I can
 have a crawldb with entries from anywhere on the internet while still only
 crawling a particular subdomain.

 I'm not sure what the standard use case is for Nutch, but I think we're
 probably a bit outside of it, but only a bit.

 Thanks,
 Eddie




 On 1/17/2012 1:22 PM, Julien Nioche wrote:

 Hi Eddie,

 Great to hear that! Just to add to what Markus said there are also quite a
 few tasks to do on the NutchGora branch if that's something you'd be
 interested in. Or outside the tasks on JIRA, there is always a fair bit to
 do on the Wiki e.g. how to run in distributed mode etc...

 Just out of curiosity, could you tell us a bit about what you've been
 using Nutch for at Wolfram Research?

 Thanks for volunteering

 Julien

 On 17 January 2012 19:15, Markus Jelsma markus.jel...@openindex.iowrote:

 Hi!

 Excellent! You may want to check the list of issues for 1.5. There are
 several
 issues being worked on from time to time and a number of open issues and
 even
 a few hairy problems. Contribution as patch or comment on any issue is
 always
 appreciated. You can also create issues to solve problems yourself as you
 did
 with the parser filters issue.

 Anything is welcome!

 Cheers,

  Hello all,
 
  I've got a bunch of spare time coming up in the next several
  weeks/months and would like to volunteer to help the project out.  I'm
  already extremely familiar with the internals of Nutch, as I've been
  hacking at it for our internal use here (at Wolfram Research) for the
  last ~1.5 years or so.  While there's probably a fair amount of code
  that I haven't read, I've at least visited and read some of all of the
  areas of Nutch's core and most of the plugins.
 
  I think I should put that knowledge to good use and contribute back
  (I've already sent some patches in, but nothing major or really even
  that significant), but I'm not sure what needs to be done or where my
  time would be best spent.  I just subscribed to this list, so if there's
  a thread discussing priorities that's current and whatnot, can someone
  point me to it in the archives?  Barring that, can someone point me in
  the direction where I should be looking to contribute?  My best guess is
  to just start attacking JIRA tickets...
 
  Thanks,
  Eddie