Re: I want to volunteer some time

2012-01-18 Thread Julien Nioche
Hi Eddie,

* I've also re-created the lucene index plugin as part of our plugin, as we
don't use Solr, but our own search application.  *

One task you could be interested in is to make the indexing backends
pluggable. See https://issues.apache.org/jira/browse/NUTCH-1047  for
details. This would probably involve refactoring all the indexing related
code. Quite a bit of exploring to do but I think this would be both
interesting and useful.

Regarding the tutorial on distributed mode : it makes sense to run Nutch in
pseudo-distributed mode even if you have only one machine available. You
can then see the progress of your crawl using the Hadoop task tracker,
check the counters for the jobs etc...

Julien

-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com


Re: I want to volunteer some time

2012-01-17 Thread Lewis John Mcgibbney
Hi Eddie,

I've added you to the AdminGroup for our wiki, you will be able to edit
whichever areas you are interested in, or which you think can/should be
improved.

Your introduction sounds real interesting and as Markus & Julien have said
there is a lot of issues which merit some input, its great that you are
able to contribute. Just a quick side-note, as Julien said we also maintain
a Nutchgora branch, which has some unique characteristics which you might
find interesting.

Best for now

Lewis

On Tue, Jan 17, 2012 at 9:31 PM, Eddie Drapkin  wrote:

>  Alrighty!
>
> I checked out the JIRA and sort of attacked an issue I think I can
> contribute to... I'll look and try to find more as well.
>
> I can certainly write documentation if that's a need (when isn't it?),
> just someone point me at the areas that need better documentation and I'll
> do what I can.  You mentioned distributed mode, which is something I
> actually can't really document because it's not something we use - our
> crawler exists as a single intranet server and probably will for the
> foreseeable future.  Do I need any special account privileges to edit wiki
> pages (username is EdwardDrapkin)?
>
> We use Nutch here to crawl our various intranet sites to build Lucene
> indexes for a few search applications that we have (search.wolfram.com,
> mathworld, etc.).  I've written a rather hefty plugin for it to accommodate
> some of the custom functionality we need (I'd guess it's ~20,000 lines of
> code).  We have our search broken down by our sites (e.g.
> reference.wolfram.com is one index and mathworld is another), which are
> crawled separately, so a lot of our custom functionality is written in
> light of that, particularly scoring.  Because it's custom code for a single
> purpose, a lot of the code is also there to curate the data going into the
> index (custom parsers for a particular site to remove navigation elements,
> for instance).  The most (only, really) interesting thing that I've done
> with it is tracking wiki changes outside of the primary crawl database (I
> keep my own database of page modification times) and creating custom fetch
> lists, so that our wiki can be crawled nightly, as it's rather massive and
> hosted on a shared machine that can't support an intensive crawl every
> night.  I've also re-created the lucene index plugin as part of our plugin,
> as we don't use Solr, but our own search application.
>
> I'm working now on creating a comprehensive link-graph of all links for a
> particular crawl configuration, while still only crawling the correct URLs,
> so that we can experiment with using various page scoring algorithms.  This
> is why I wanted to not filter the links in the parse stage, so now I can
> have a crawldb with entries from anywhere on the internet while still only
> crawling a particular subdomain.
>
> I'm not sure what the standard use case is for Nutch, but I think we're
> probably a bit outside of it, but only a bit.
>
> Thanks,
> Eddie
>
>
>
>
> On 1/17/2012 1:22 PM, Julien Nioche wrote:
>
> Hi Eddie,
>
> Great to hear that! Just to add to what Markus said there are also quite a
> few tasks to do on the NutchGora branch if that's something you'd be
> interested in. Or outside the tasks on JIRA, there is always a fair bit to
> do on the Wiki e.g. how to run in distributed mode etc...
>
> Just out of curiosity, could you tell us a bit about what you've been
> using Nutch for at Wolfram Research?
>
> Thanks for volunteering
>
> Julien
>
> On 17 January 2012 19:15, Markus Jelsma wrote:
>
>> Hi!
>>
>> Excellent! You may want to check the list of issues for 1.5. There are
>> several
>> issues being worked on from time to time and a number of open issues and
>> even
>> a few hairy problems. Contribution as patch or comment on any issue is
>> always
>> appreciated. You can also create issues to solve problems yourself as you
>> did
>> with the parser filters issue.
>>
>> Anything is welcome!
>>
>> Cheers,
>>
>> > Hello all,
>> >
>> > I've got a bunch of spare time coming up in the next several
>> > weeks/months and would like to volunteer to help the project out.  I'm
>> > already extremely familiar with the internals of Nutch, as I've been
>> > hacking at it for our internal use here (at Wolfram Research) for the
>> > last ~1.5 years or so.  While there's probably a fair amount of code
>> > that I haven't read, I've at least visited and read some of all of the
>> > areas of Nutch's core and most of the plugins.
>> >
>> > I think I should put that knowledge to good use and contribute back
>> > (I've already sent some patches in, but nothing major or really even
>> > that significant), but I'm not sure what needs to be done or where my
>> > time would be best spent.  I just subscribed to this list, so if there's
>> > a thread discussing priorities that's current and whatnot, can someone
>> > point me to it in the archives?  Barring that, can someone point me in
>> > the direction where I should be lookin

Re: I want to volunteer some time

2012-01-17 Thread Eddie Drapkin

Alrighty!

I checked out the JIRA and sort of attacked an issue I think I can 
contribute to... I'll look and try to find more as well.


I can certainly write documentation if that's a need (when isn't it?), 
just someone point me at the areas that need better documentation and 
I'll do what I can.  You mentioned distributed mode, which is something 
I actually can't really document because it's not something we use - our 
crawler exists as a single intranet server and probably will for the 
foreseeable future.  Do I need any special account privileges to edit 
wiki pages (username is EdwardDrapkin)?


We use Nutch here to crawl our various intranet sites to build Lucene 
indexes for a few search applications that we have (search.wolfram.com, 
mathworld, etc.).  I've written a rather hefty plugin for it to 
accommodate some of the custom functionality we need (I'd guess it's 
~20,000 lines of code).  We have our search broken down by our sites 
(e.g. reference.wolfram.com is one index and mathworld is another), 
which are crawled separately, so a lot of our custom functionality is 
written in light of that, particularly scoring.  Because it's custom 
code for a single purpose, a lot of the code is also there to curate the 
data going into the index (custom parsers for a particular site to 
remove navigation elements, for instance).  The most (only, really) 
interesting thing that I've done with it is tracking wiki changes 
outside of the primary crawl database (I keep my own database of page 
modification times) and creating custom fetch lists, so that our wiki 
can be crawled nightly, as it's rather massive and hosted on a shared 
machine that can't support an intensive crawl every night.  I've also 
re-created the lucene index plugin as part of our plugin, as we don't 
use Solr, but our own search application.


I'm working now on creating a comprehensive link-graph of all links for 
a particular crawl configuration, while still only crawling the correct 
URLs, so that we can experiment with using various page scoring 
algorithms.  This is why I wanted to not filter the links in the parse 
stage, so now I can have a crawldb with entries from anywhere on the 
internet while still only crawling a particular subdomain.


I'm not sure what the standard use case is for Nutch, but I think we're 
probably a bit outside of it, but only a bit.


Thanks,
Eddie



On 1/17/2012 1:22 PM, Julien Nioche wrote:

Hi Eddie,

Great to hear that! Just to add to what Markus said there are also 
quite a few tasks to do on the NutchGora branch if that's something 
you'd be interested in. Or outside the tasks on JIRA, there is always 
a fair bit to do on the Wiki e.g. how to run in distributed mode etc...


Just out of curiosity, could you tell us a bit about what you've been 
using Nutch for at Wolfram Research?


Thanks for volunteering

Julien

On 17 January 2012 19:15, Markus Jelsma > wrote:


Hi!

Excellent! You may want to check the list of issues for 1.5. There
are several
issues being worked on from time to time and a number of open
issues and even
a few hairy problems. Contribution as patch or comment on any
issue is always
appreciated. You can also create issues to solve problems yourself
as you did
with the parser filters issue.

Anything is welcome!

Cheers,

> Hello all,
>
> I've got a bunch of spare time coming up in the next several
> weeks/months and would like to volunteer to help the project
out.  I'm
> already extremely familiar with the internals of Nutch, as I've been
> hacking at it for our internal use here (at Wolfram Research)
for the
> last ~1.5 years or so.  While there's probably a fair amount of code
> that I haven't read, I've at least visited and read some of all
of the
> areas of Nutch's core and most of the plugins.
>
> I think I should put that knowledge to good use and contribute back
> (I've already sent some patches in, but nothing major or really even
> that significant), but I'm not sure what needs to be done or
where my
> time would be best spent.  I just subscribed to this list, so if
there's
> a thread discussing priorities that's current and whatnot, can
someone
> point me to it in the archives?  Barring that, can someone point
me in
> the direction where I should be looking to contribute?  My best
guess is
> to just start attacking JIRA tickets...
>
> Thanks,
> Eddie




--
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com




Re: I want to volunteer some time

2012-01-17 Thread Julien Nioche
Hi Eddie,

Great to hear that! Just to add to what Markus said there are also quite a
few tasks to do on the NutchGora branch if that's something you'd be
interested in. Or outside the tasks on JIRA, there is always a fair bit to
do on the Wiki e.g. how to run in distributed mode etc...

Just out of curiosity, could you tell us a bit about what you've been using
Nutch for at Wolfram Research?

Thanks for volunteering

Julien

On 17 January 2012 19:15, Markus Jelsma  wrote:

> Hi!
>
> Excellent! You may want to check the list of issues for 1.5. There are
> several
> issues being worked on from time to time and a number of open issues and
> even
> a few hairy problems. Contribution as patch or comment on any issue is
> always
> appreciated. You can also create issues to solve problems yourself as you
> did
> with the parser filters issue.
>
> Anything is welcome!
>
> Cheers,
>
> > Hello all,
> >
> > I've got a bunch of spare time coming up in the next several
> > weeks/months and would like to volunteer to help the project out.  I'm
> > already extremely familiar with the internals of Nutch, as I've been
> > hacking at it for our internal use here (at Wolfram Research) for the
> > last ~1.5 years or so.  While there's probably a fair amount of code
> > that I haven't read, I've at least visited and read some of all of the
> > areas of Nutch's core and most of the plugins.
> >
> > I think I should put that knowledge to good use and contribute back
> > (I've already sent some patches in, but nothing major or really even
> > that significant), but I'm not sure what needs to be done or where my
> > time would be best spent.  I just subscribed to this list, so if there's
> > a thread discussing priorities that's current and whatnot, can someone
> > point me to it in the archives?  Barring that, can someone point me in
> > the direction where I should be looking to contribute?  My best guess is
> > to just start attacking JIRA tickets...
> >
> > Thanks,
> > Eddie
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com


Re: I want to volunteer some time

2012-01-17 Thread Markus Jelsma
Hi!

Excellent! You may want to check the list of issues for 1.5. There are several 
issues being worked on from time to time and a number of open issues and even 
a few hairy problems. Contribution as patch or comment on any issue is always 
appreciated. You can also create issues to solve problems yourself as you did 
with the parser filters issue.

Anything is welcome!

Cheers,

> Hello all,
> 
> I've got a bunch of spare time coming up in the next several
> weeks/months and would like to volunteer to help the project out.  I'm
> already extremely familiar with the internals of Nutch, as I've been
> hacking at it for our internal use here (at Wolfram Research) for the
> last ~1.5 years or so.  While there's probably a fair amount of code
> that I haven't read, I've at least visited and read some of all of the
> areas of Nutch's core and most of the plugins.
> 
> I think I should put that knowledge to good use and contribute back
> (I've already sent some patches in, but nothing major or really even
> that significant), but I'm not sure what needs to be done or where my
> time would be best spent.  I just subscribed to this list, so if there's
> a thread discussing priorities that's current and whatnot, can someone
> point me to it in the archives?  Barring that, can someone point me in
> the direction where I should be looking to contribute?  My best guess is
> to just start attacking JIRA tickets...
> 
> Thanks,
> Eddie