Re: Nutch 2.0 roadmap

2010-04-08 Thread Doğacan Güney
On Thu, Apr 8, 2010 at 21:11, MilleBii  wrote:
> Not sure what u mean by pig script, but I'd like to be able to make a
> multi-criteria selection of Url for fetching...

I mean a query language like

http://hadoop.apache.org/pig/

if we expose data correctly, then you should be able to generate on any criteria
that you want.

>  The scoring method forces into a kind of mono dimensional approach
> which is not really easy to deal with.
>
> The regex filters are good but it assumes you want select URLs on data
> which is in the URL... Pretty limited in fact
>
> I basically would like to do 'content' based crawling. Say for
> example: that I'm interested in "topic A".
> I'd'like to label URLs that match "Topic A" (user supplied logic).
> Later on I would want to crawl "topic A" urls at a certain frequency
> and non labeled urls for exploring in a different way.
>
>  This looks like hard to do right now
>
> 2010/4/8, Doğacan Güney :
>> Hi,
>>
>> On Wed, Apr 7, 2010 at 21:19, MilleBii  wrote:
>>> Just a question ?
>>> Will the new HBase implementation allow more sophisticated crawling
>>> strategies than the current score based.
>>>
>>> Give you a few  example of what I'd like to do :
>>> Define different crawling frequency for different set of URLs, say
>>> weekly for some url, monthly or more for others.
>>>
>>> Select URLs to re-crawl based on attributes previously extracted.Just
>>> one example: recrawl urls that contained a certain keyword (or set of)
>>>
>>> Select URLs that have not yet been crawled, at the frontier of the
>>> crawl therefore
>>>
>>
>> At some point, it would be nice to change generator so that it is only a
>> handful
>> of methods and a pig (or something else) script. So, we would provide
>> most of the functions
>> you may need during generation (accessing various data) but actual
>> generation would be a pig
>> process. This way, anyone can easily change generate any way they want
>> (even make it more jobs
>> than 2 if they want more complex schemes).
>>
>>>
>>>
>>>
>>> 2010/4/7, Doğacan Güney :
 Hey everyone,

 On Tue, Apr 6, 2010 at 20:23, Andrzej Bialecki  wrote:
> On 2010-04-06 15:43, Julien Nioche wrote:
>> Hi guys,
>>
>> I gather that we'll jump straight to  2.0 after 1.1 and that 2.0 will
>> be
>> based on what is currently referred to as NutchBase. Shall we create a
>> branch for 2.0 in the Nutch SVN repository and have a label accordingly
>> for
>> JIRA so that we can file issues / feature requests on 2.0? Do you think
>> that
>> the current NutchBase could be used as a basis for the 2.0 branch?
>
> I'm not sure what is the status of the nutchbase - it's missed a lot of
> fixes and changes in trunk since it's been last touched ...
>

 I know... But I still intend to finish it, I just need to schedule
 some time for it.

 My vote would be to go with nutchbase.

>>
>> Talking about features, what else would we add apart from :
>>
>> * support for HBase : via ORM or not (see
>> NUTCH-808
>> )
>
> This IMHO is promising, this could open the doors to small-to-medium
> installations that are currently too cumbersome to handle.
>

 Yeah, there is already a simple ORM within nutchbase that is
 avro-based and should
 be generic enough to also support MySQL, cassandra and berkeleydb. But
 any good ORM will
 be a very good addition.

>> * plugin cleanup : Tika only for parsing - get rid of everything else?
>
> Basically, yes - keep only stuff like HtmlParseFilters (probably with a
> different API) so that we can post-process the DOM created in Tika from
> whatever original format.
>
> Also, the goal of the crawler-commons project is to provide APIs and
> implementations of stuff that is needed for every open source crawler
> project, like: robots handling, url filtering and url normalization, URL
> state management, perhaps deduplication. We should coordinate our
> efforts, and share code freely so that other projects (bixo, heritrix,
> droids) may contribute to this shared pool of functionality, much like
> Tika does for the common need of parsing complex formats.
>
>> * remove index / search and delegate to SOLR
>
> +1 - we may still keep a thin abstract layer to allow other
> indexing/search backends, but the current mess of indexing/query filters
> and competing indexing frameworks (lucene, fields, solr) should go away.
> We should go directly from DOM to a NutchDocument, and stop there.
>

 Agreed. I would like to add support for katta and other indexing
 backends at some point but
 NutchDocument should be our canonical representation. The rest should
 be up to indexing backends.

> Regarding search - currently the search API is too low-level, with the
> cus

Re: Nutch 2.0 roadmap

2010-04-08 Thread MilleBii
Not sure what u mean by pig script, but I'd like to be able to make a
multi-criteria selection of Url for fetching...
 The scoring method forces into a kind of mono dimensional approach
which is not really easy to deal with.

The regex filters are good but it assumes you want select URLs on data
which is in the URL... Pretty limited in fact

I basically would like to do 'content' based crawling. Say for
example: that I'm interested in "topic A".
I'd'like to label URLs that match "Topic A" (user supplied logic).
Later on I would want to crawl "topic A" urls at a certain frequency
and non labeled urls for exploring in a different way.

 This looks like hard to do right now

2010/4/8, Doğacan Güney :
> Hi,
>
> On Wed, Apr 7, 2010 at 21:19, MilleBii  wrote:
>> Just a question ?
>> Will the new HBase implementation allow more sophisticated crawling
>> strategies than the current score based.
>>
>> Give you a few  example of what I'd like to do :
>> Define different crawling frequency for different set of URLs, say
>> weekly for some url, monthly or more for others.
>>
>> Select URLs to re-crawl based on attributes previously extracted.Just
>> one example: recrawl urls that contained a certain keyword (or set of)
>>
>> Select URLs that have not yet been crawled, at the frontier of the
>> crawl therefore
>>
>
> At some point, it would be nice to change generator so that it is only a
> handful
> of methods and a pig (or something else) script. So, we would provide
> most of the functions
> you may need during generation (accessing various data) but actual
> generation would be a pig
> process. This way, anyone can easily change generate any way they want
> (even make it more jobs
> than 2 if they want more complex schemes).
>
>>
>>
>>
>> 2010/4/7, Doğacan Güney :
>>> Hey everyone,
>>>
>>> On Tue, Apr 6, 2010 at 20:23, Andrzej Bialecki  wrote:
 On 2010-04-06 15:43, Julien Nioche wrote:
> Hi guys,
>
> I gather that we'll jump straight to  2.0 after 1.1 and that 2.0 will
> be
> based on what is currently referred to as NutchBase. Shall we create a
> branch for 2.0 in the Nutch SVN repository and have a label accordingly
> for
> JIRA so that we can file issues / feature requests on 2.0? Do you think
> that
> the current NutchBase could be used as a basis for the 2.0 branch?

 I'm not sure what is the status of the nutchbase - it's missed a lot of
 fixes and changes in trunk since it's been last touched ...

>>>
>>> I know... But I still intend to finish it, I just need to schedule
>>> some time for it.
>>>
>>> My vote would be to go with nutchbase.
>>>
>
> Talking about features, what else would we add apart from :
>
> * support for HBase : via ORM or not (see
> NUTCH-808
> )

 This IMHO is promising, this could open the doors to small-to-medium
 installations that are currently too cumbersome to handle.

>>>
>>> Yeah, there is already a simple ORM within nutchbase that is
>>> avro-based and should
>>> be generic enough to also support MySQL, cassandra and berkeleydb. But
>>> any good ORM will
>>> be a very good addition.
>>>
> * plugin cleanup : Tika only for parsing - get rid of everything else?

 Basically, yes - keep only stuff like HtmlParseFilters (probably with a
 different API) so that we can post-process the DOM created in Tika from
 whatever original format.

 Also, the goal of the crawler-commons project is to provide APIs and
 implementations of stuff that is needed for every open source crawler
 project, like: robots handling, url filtering and url normalization, URL
 state management, perhaps deduplication. We should coordinate our
 efforts, and share code freely so that other projects (bixo, heritrix,
 droids) may contribute to this shared pool of functionality, much like
 Tika does for the common need of parsing complex formats.

> * remove index / search and delegate to SOLR

 +1 - we may still keep a thin abstract layer to allow other
 indexing/search backends, but the current mess of indexing/query filters
 and competing indexing frameworks (lucene, fields, solr) should go away.
 We should go directly from DOM to a NutchDocument, and stop there.

>>>
>>> Agreed. I would like to add support for katta and other indexing
>>> backends at some point but
>>> NutchDocument should be our canonical representation. The rest should
>>> be up to indexing backends.
>>>
 Regarding search - currently the search API is too low-level, with the
 custom text and query analysis chains. This needlessly introduces the
 (in)famous Nutch Query classes and Nutch query syntax limitations, We
 should get rid of it and simply leave this part of the processing to the
 search backend. Probably we will use the SolrCloud branch that supports
 sharding and global IDF.

> 

Re: Nutch 2.0 roadmap

2010-04-08 Thread Doğacan Güney
Hi,

On Wed, Apr 7, 2010 at 21:19, MilleBii  wrote:
> Just a question ?
> Will the new HBase implementation allow more sophisticated crawling
> strategies than the current score based.
>
> Give you a few  example of what I'd like to do :
> Define different crawling frequency for different set of URLs, say
> weekly for some url, monthly or more for others.
>
> Select URLs to re-crawl based on attributes previously extracted.Just
> one example: recrawl urls that contained a certain keyword (or set of)
>
> Select URLs that have not yet been crawled, at the frontier of the
> crawl therefore
>

At some point, it would be nice to change generator so that it is only a handful
of methods and a pig (or something else) script. So, we would provide
most of the functions
you may need during generation (accessing various data) but actual
generation would be a pig
process. This way, anyone can easily change generate any way they want
(even make it more jobs
than 2 if they want more complex schemes).

>
>
>
> 2010/4/7, Doğacan Güney :
>> Hey everyone,
>>
>> On Tue, Apr 6, 2010 at 20:23, Andrzej Bialecki  wrote:
>>> On 2010-04-06 15:43, Julien Nioche wrote:
 Hi guys,

 I gather that we'll jump straight to  2.0 after 1.1 and that 2.0 will be
 based on what is currently referred to as NutchBase. Shall we create a
 branch for 2.0 in the Nutch SVN repository and have a label accordingly
 for
 JIRA so that we can file issues / feature requests on 2.0? Do you think
 that
 the current NutchBase could be used as a basis for the 2.0 branch?
>>>
>>> I'm not sure what is the status of the nutchbase - it's missed a lot of
>>> fixes and changes in trunk since it's been last touched ...
>>>
>>
>> I know... But I still intend to finish it, I just need to schedule
>> some time for it.
>>
>> My vote would be to go with nutchbase.
>>

 Talking about features, what else would we add apart from :

 * support for HBase : via ORM or not (see
 NUTCH-808
 )
>>>
>>> This IMHO is promising, this could open the doors to small-to-medium
>>> installations that are currently too cumbersome to handle.
>>>
>>
>> Yeah, there is already a simple ORM within nutchbase that is
>> avro-based and should
>> be generic enough to also support MySQL, cassandra and berkeleydb. But
>> any good ORM will
>> be a very good addition.
>>
 * plugin cleanup : Tika only for parsing - get rid of everything else?
>>>
>>> Basically, yes - keep only stuff like HtmlParseFilters (probably with a
>>> different API) so that we can post-process the DOM created in Tika from
>>> whatever original format.
>>>
>>> Also, the goal of the crawler-commons project is to provide APIs and
>>> implementations of stuff that is needed for every open source crawler
>>> project, like: robots handling, url filtering and url normalization, URL
>>> state management, perhaps deduplication. We should coordinate our
>>> efforts, and share code freely so that other projects (bixo, heritrix,
>>> droids) may contribute to this shared pool of functionality, much like
>>> Tika does for the common need of parsing complex formats.
>>>
 * remove index / search and delegate to SOLR
>>>
>>> +1 - we may still keep a thin abstract layer to allow other
>>> indexing/search backends, but the current mess of indexing/query filters
>>> and competing indexing frameworks (lucene, fields, solr) should go away.
>>> We should go directly from DOM to a NutchDocument, and stop there.
>>>
>>
>> Agreed. I would like to add support for katta and other indexing
>> backends at some point but
>> NutchDocument should be our canonical representation. The rest should
>> be up to indexing backends.
>>
>>> Regarding search - currently the search API is too low-level, with the
>>> custom text and query analysis chains. This needlessly introduces the
>>> (in)famous Nutch Query classes and Nutch query syntax limitations, We
>>> should get rid of it and simply leave this part of the processing to the
>>> search backend. Probably we will use the SolrCloud branch that supports
>>> sharding and global IDF.
>>>
 * new functionalities e.g. sitemap support, canonical tag etc...
>>>
>>> Plus a better handling of redirects, detecting duplicated sites,
>>> detection of spam cliques, tools to manage the webgraph, etc.
>>>

 I suppose that http://wiki.apache.org/nutch/Nutch2Architecture needs an
 update?
>>>
>>> Definitely. :)
>>>
>>> --
>>> Best regards,
>>> Andrzej Bialecki     <><
>>>  ___. ___ ___ ___ _ _   __
>>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>>> http://www.sigram.com  Contact: info at sigram dot com
>>>
>>>
>>
>>
>>
>> --
>> Doğacan Güney
>>
>
>
> --
> -MilleBii-
>



-- 
Doğacan Güney


Re: Nutch 2.0 roadmap

2010-04-08 Thread Doğacan Güney
On Wed, Apr 7, 2010 at 20:32, Andrzej Bialecki  wrote:
> On 2010-04-07 18:54, Doğacan Güney wrote:
>> Hey everyone,
>>
>> On Tue, Apr 6, 2010 at 20:23, Andrzej Bialecki  wrote:
>>> On 2010-04-06 15:43, Julien Nioche wrote:
 Hi guys,

 I gather that we'll jump straight to  2.0 after 1.1 and that 2.0 will be
 based on what is currently referred to as NutchBase. Shall we create a
 branch for 2.0 in the Nutch SVN repository and have a label accordingly for
 JIRA so that we can file issues / feature requests on 2.0? Do you think 
 that
 the current NutchBase could be used as a basis for the 2.0 branch?
>>>
>>> I'm not sure what is the status of the nutchbase - it's missed a lot of
>>> fixes and changes in trunk since it's been last touched ...
>>>
>>
>> I know... But I still intend to finish it, I just need to schedule
>> some time for it.
>>
>> My vote would be to go with nutchbase.
>
> Hmm .. this puzzles me, do you think we should port changes from 1.1 to
> nutchbase? I thought we should do it the other way around, i.e. merge
> nutchbase bits to trunk.
>

Hmm, I am a bit out of touch with the latest changes but I know that
the differences
between trunk and nutchbase are unfortunately rather large right now.
If merging nutchbase
back into trunk would be easier then sure, let's do that.

>
 * support for HBase : via ORM or not (see
 NUTCH-808
 )
>>>
>>> This IMHO is promising, this could open the doors to small-to-medium
>>> installations that are currently too cumbersome to handle.
>>>
>>
>> Yeah, there is already a simple ORM within nutchbase that is
>> avro-based and should
>> be generic enough to also support MySQL, cassandra and berkeleydb. But
>> any good ORM will
>> be a very good addition.
>
> Again, the advantage of DataNucleus is that we don't have to handcraft
> all the mid- to low-level mappings, just the mid-level ones (JOQL or
> whatever) - the cost of maintenance is lower, and the number of backends
> that are supported out of the box is larger. Of course, this is just
> IMHO - we won't know for sure until we try to use both your custom ORM
> and DataNucleus...

I am obviously a bit biased here but I have no strong feelings really.
DataNucleus
is an excellent project. What I like about avro-based approach is the
essentially free
MapReduce support we get and the fact that supporting another language
is easy. So,
we can expose partial hbase data through a server and a python-client
can easily read/write to it, thanks
to avro. That being said, I am all for DataNucleus or something else.

>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>



-- 
Doğacan Güney


Re: Nutch 2.0 roadmap

2010-04-07 Thread MilleBii
Just a question ?
Will the new HBase implementation allow more sophisticated crawling
strategies than the current score based.

Give you a few  example of what I'd like to do :
Define different crawling frequency for different set of URLs, say
weekly for some url, monthly or more for others.

Select URLs to re-crawl based on attributes previously extracted.Just
one example: recrawl urls that contained a certain keyword (or set of)

Select URLs that have not yet been crawled, at the frontier of the
crawl therefore




2010/4/7, Doğacan Güney :
> Hey everyone,
>
> On Tue, Apr 6, 2010 at 20:23, Andrzej Bialecki  wrote:
>> On 2010-04-06 15:43, Julien Nioche wrote:
>>> Hi guys,
>>>
>>> I gather that we'll jump straight to  2.0 after 1.1 and that 2.0 will be
>>> based on what is currently referred to as NutchBase. Shall we create a
>>> branch for 2.0 in the Nutch SVN repository and have a label accordingly
>>> for
>>> JIRA so that we can file issues / feature requests on 2.0? Do you think
>>> that
>>> the current NutchBase could be used as a basis for the 2.0 branch?
>>
>> I'm not sure what is the status of the nutchbase - it's missed a lot of
>> fixes and changes in trunk since it's been last touched ...
>>
>
> I know... But I still intend to finish it, I just need to schedule
> some time for it.
>
> My vote would be to go with nutchbase.
>
>>>
>>> Talking about features, what else would we add apart from :
>>>
>>> * support for HBase : via ORM or not (see
>>> NUTCH-808
>>> )
>>
>> This IMHO is promising, this could open the doors to small-to-medium
>> installations that are currently too cumbersome to handle.
>>
>
> Yeah, there is already a simple ORM within nutchbase that is
> avro-based and should
> be generic enough to also support MySQL, cassandra and berkeleydb. But
> any good ORM will
> be a very good addition.
>
>>> * plugin cleanup : Tika only for parsing - get rid of everything else?
>>
>> Basically, yes - keep only stuff like HtmlParseFilters (probably with a
>> different API) so that we can post-process the DOM created in Tika from
>> whatever original format.
>>
>> Also, the goal of the crawler-commons project is to provide APIs and
>> implementations of stuff that is needed for every open source crawler
>> project, like: robots handling, url filtering and url normalization, URL
>> state management, perhaps deduplication. We should coordinate our
>> efforts, and share code freely so that other projects (bixo, heritrix,
>> droids) may contribute to this shared pool of functionality, much like
>> Tika does for the common need of parsing complex formats.
>>
>>> * remove index / search and delegate to SOLR
>>
>> +1 - we may still keep a thin abstract layer to allow other
>> indexing/search backends, but the current mess of indexing/query filters
>> and competing indexing frameworks (lucene, fields, solr) should go away.
>> We should go directly from DOM to a NutchDocument, and stop there.
>>
>
> Agreed. I would like to add support for katta and other indexing
> backends at some point but
> NutchDocument should be our canonical representation. The rest should
> be up to indexing backends.
>
>> Regarding search - currently the search API is too low-level, with the
>> custom text and query analysis chains. This needlessly introduces the
>> (in)famous Nutch Query classes and Nutch query syntax limitations, We
>> should get rid of it and simply leave this part of the processing to the
>> search backend. Probably we will use the SolrCloud branch that supports
>> sharding and global IDF.
>>
>>> * new functionalities e.g. sitemap support, canonical tag etc...
>>
>> Plus a better handling of redirects, detecting duplicated sites,
>> detection of spam cliques, tools to manage the webgraph, etc.
>>
>>>
>>> I suppose that http://wiki.apache.org/nutch/Nutch2Architecture needs an
>>> update?
>>
>> Definitely. :)
>>
>> --
>> Best regards,
>> Andrzej Bialecki     <><
>>  ___. ___ ___ ___ _ _   __
>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>> http://www.sigram.com  Contact: info at sigram dot com
>>
>>
>
>
>
> --
> Doğacan Güney
>


-- 
-MilleBii-


Re: Nutch 2.0 roadmap

2010-04-07 Thread Andrzej Bialecki
On 2010-04-07 19:24, Enis Söztutar wrote:

>>> Also, the goal of the crawler-commons project is to provide APIs and
>>> implementations of stuff that is needed for every open source crawler
>>> project, like: robots handling, url filtering and url normalization, URL
>>> state management, perhaps deduplication. We should coordinate our
>>> efforts, and share code freely so that other projects (bixo, heritrix,
>>> droids) may contribute to this shared pool of functionality, much like
>>> Tika does for the common need of parsing complex formats.
>>>
>>>  
> 
> So, it seems that at some point, we need to bite the bullet, and
> refactor plugins, dropping backwards compatibility.

Right, that was my point - now is the time to break it, with the
cut-over to 2.0, and leaving 1.1 branch in a good shape, to serve well
enough in the interim period.


-- 
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Nutch 2.0 roadmap

2010-04-07 Thread Andrzej Bialecki
On 2010-04-07 18:54, Doğacan Güney wrote:
> Hey everyone,
> 
> On Tue, Apr 6, 2010 at 20:23, Andrzej Bialecki  wrote:
>> On 2010-04-06 15:43, Julien Nioche wrote:
>>> Hi guys,
>>>
>>> I gather that we'll jump straight to  2.0 after 1.1 and that 2.0 will be
>>> based on what is currently referred to as NutchBase. Shall we create a
>>> branch for 2.0 in the Nutch SVN repository and have a label accordingly for
>>> JIRA so that we can file issues / feature requests on 2.0? Do you think that
>>> the current NutchBase could be used as a basis for the 2.0 branch?
>>
>> I'm not sure what is the status of the nutchbase - it's missed a lot of
>> fixes and changes in trunk since it's been last touched ...
>>
> 
> I know... But I still intend to finish it, I just need to schedule
> some time for it.
> 
> My vote would be to go with nutchbase.

Hmm .. this puzzles me, do you think we should port changes from 1.1 to
nutchbase? I thought we should do it the other way around, i.e. merge
nutchbase bits to trunk.


>>> * support for HBase : via ORM or not (see
>>> NUTCH-808
>>> )
>>
>> This IMHO is promising, this could open the doors to small-to-medium
>> installations that are currently too cumbersome to handle.
>>
> 
> Yeah, there is already a simple ORM within nutchbase that is
> avro-based and should
> be generic enough to also support MySQL, cassandra and berkeleydb. But
> any good ORM will
> be a very good addition.

Again, the advantage of DataNucleus is that we don't have to handcraft
all the mid- to low-level mappings, just the mid-level ones (JOQL or
whatever) - the cost of maintenance is lower, and the number of backends
that are supported out of the box is larger. Of course, this is just
IMHO - we won't know for sure until we try to use both your custom ORM
and DataNucleus...

-- 
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Nutch 2.0 roadmap

2010-04-07 Thread Enis Söztutar
Forgot to say that, at Hadoop, it is the convention that big issues, 
like the ones under discussion come with a design document. So that a 
solid design is agreed upon for the work. We can apply the same pattern 
at Nutch.


On 04/07/2010 07:54 PM, Doğacan Güney wrote:

Hey everyone,

On Tue, Apr 6, 2010 at 20:23, Andrzej Bialecki  wrote:
   

On 2010-04-06 15:43, Julien Nioche wrote:
 

Hi guys,

I gather that we'll jump straight to  2.0 after 1.1 and that 2.0 will be
based on what is currently referred to as NutchBase. Shall we create a
branch for 2.0 in the Nutch SVN repository and have a label accordingly for
JIRA so that we can file issues / feature requests on 2.0? Do you think that
the current NutchBase could be used as a basis for the 2.0 branch?
   

I'm not sure what is the status of the nutchbase - it's missed a lot of
fixes and changes in trunk since it's been last touched ...

 

I know... But I still intend to finish it, I just need to schedule
some time for it.

My vote would be to go with nutchbase.

   

Talking about features, what else would we add apart from :

* support for HBase : via ORM or not (see
NUTCH-808
)
   

This IMHO is promising, this could open the doors to small-to-medium
installations that are currently too cumbersome to handle.

 

Yeah, there is already a simple ORM within nutchbase that is
avro-based and should
be generic enough to also support MySQL, cassandra and berkeleydb. But
any good ORM will
be a very good addition.

   

* plugin cleanup : Tika only for parsing - get rid of everything else?
   

Basically, yes - keep only stuff like HtmlParseFilters (probably with a
different API) so that we can post-process the DOM created in Tika from
whatever original format.

Also, the goal of the crawler-commons project is to provide APIs and
implementations of stuff that is needed for every open source crawler
project, like: robots handling, url filtering and url normalization, URL
state management, perhaps deduplication. We should coordinate our
efforts, and share code freely so that other projects (bixo, heritrix,
droids) may contribute to this shared pool of functionality, much like
Tika does for the common need of parsing complex formats.

 

* remove index / search and delegate to SOLR
   

+1 - we may still keep a thin abstract layer to allow other
indexing/search backends, but the current mess of indexing/query filters
and competing indexing frameworks (lucene, fields, solr) should go away.
We should go directly from DOM to a NutchDocument, and stop there.

 

Agreed. I would like to add support for katta and other indexing
backends at some point but
NutchDocument should be our canonical representation. The rest should
be up to indexing backends.

   

Regarding search - currently the search API is too low-level, with the
custom text and query analysis chains. This needlessly introduces the
(in)famous Nutch Query classes and Nutch query syntax limitations, We
should get rid of it and simply leave this part of the processing to the
search backend. Probably we will use the SolrCloud branch that supports
sharding and global IDF.

 

* new functionalities e.g. sitemap support, canonical tag etc...
   

Plus a better handling of redirects, detecting duplicated sites,
detection of spam cliques, tools to manage the webgraph, etc.

 

I suppose that http://wiki.apache.org/nutch/Nutch2Architecture needs an
update?
   

Definitely. :)

--
Best regards,
Andrzej Bialecki<><
  ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


 



   




Re: Nutch 2.0 roadmap

2010-04-07 Thread Enis Söztutar

Hi,

On 04/07/2010 07:54 PM, Doğacan Güney wrote:

Hey everyone,

On Tue, Apr 6, 2010 at 20:23, Andrzej Bialecki  wrote:
   

On 2010-04-06 15:43, Julien Nioche wrote:
 

Hi guys,

I gather that we'll jump straight to  2.0 after 1.1 and that 2.0 will be
based on what is currently referred to as NutchBase. Shall we create a
branch for 2.0 in the Nutch SVN repository and have a label accordingly for
JIRA so that we can file issues / feature requests on 2.0? Do you think that
the current NutchBase could be used as a basis for the 2.0 branch?
   

I'm not sure what is the status of the nutchbase - it's missed a lot of
fixes and changes in trunk since it's been last touched ...

 

I know... But I still intend to finish it, I just need to schedule
some time for it.

My vote would be to go with nutchbase.
   


A suggestion would be to continue with trunk until nutch-base is stable. 
Once it is, then we can merge the nutchbase branch to trunk (after 1.1 
split), at which point trunk becomes the nutchbase+other issues merged. 
Then when the time comes, we can fork branch-2.0 and release when 
blockers are done. I strongly suggest against having a trunk and a 2.0 
branch for development.


   

Talking about features, what else would we add apart from :

* support for HBase : via ORM or not (see
NUTCH-808
)
   

This IMHO is promising, this could open the doors to small-to-medium
installations that are currently too cumbersome to handle.

 

Yeah, there is already a simple ORM within nutchbase that is
avro-based and should
be generic enough to also support MySQL, cassandra and berkeleydb. But
any good ORM will
be a very good addition.
   
Current ORM code is merged with nutchbase code, but I think the sooner 
we split it the better, since development will be much more clear and 
simple this way. A have opened Nutch-808 to explore the alternatives, 
but we might as well continue with current implementation. I intent to 
share my findings in a couple of days.


   

* plugin cleanup : Tika only for parsing - get rid of everything else?
   

Basically, yes - keep only stuff like HtmlParseFilters (probably with a
different API) so that we can post-process the DOM created in Tika from
whatever original format.

Also, the goal of the crawler-commons project is to provide APIs and
implementations of stuff that is needed for every open source crawler
project, like: robots handling, url filtering and url normalization, URL
state management, perhaps deduplication. We should coordinate our
efforts, and share code freely so that other projects (bixo, heritrix,
droids) may contribute to this shared pool of functionality, much like
Tika does for the common need of parsing complex formats.

 


So, it seems that at some point, we need to bite the bullet, and 
refactor plugins, dropping backwards compatibility.



* remove index / search and delegate to SOLR
   

+1 - we may still keep a thin abstract layer to allow other
indexing/search backends, but the current mess of indexing/query filters
and competing indexing frameworks (lucene, fields, solr) should go away.
We should go directly from DOM to a NutchDocument, and stop there.

 

Agreed. I would like to add support for katta and other indexing
backends at some point but
NutchDocument should be our canonical representation. The rest should
be up to indexing backends.

   

Regarding search - currently the search API is too low-level, with the
custom text and query analysis chains. This needlessly introduces the
(in)famous Nutch Query classes and Nutch query syntax limitations, We
should get rid of it and simply leave this part of the processing to the
search backend. Probably we will use the SolrCloud branch that supports
sharding and global IDF.

 

* new functionalities e.g. sitemap support, canonical tag etc...
   

Plus a better handling of redirects, detecting duplicated sites,
detection of spam cliques, tools to manage the webgraph, etc.

 

I suppose that http://wiki.apache.org/nutch/Nutch2Architecture needs an
update?
   

Definitely. :)

--
Best regards,
Andrzej Bialecki<><
  ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


 



   




Re: Nutch 2.0 roadmap

2010-04-07 Thread Doğacan Güney
Hey everyone,

On Tue, Apr 6, 2010 at 20:23, Andrzej Bialecki  wrote:
> On 2010-04-06 15:43, Julien Nioche wrote:
>> Hi guys,
>>
>> I gather that we'll jump straight to  2.0 after 1.1 and that 2.0 will be
>> based on what is currently referred to as NutchBase. Shall we create a
>> branch for 2.0 in the Nutch SVN repository and have a label accordingly for
>> JIRA so that we can file issues / feature requests on 2.0? Do you think that
>> the current NutchBase could be used as a basis for the 2.0 branch?
>
> I'm not sure what is the status of the nutchbase - it's missed a lot of
> fixes and changes in trunk since it's been last touched ...
>

I know... But I still intend to finish it, I just need to schedule
some time for it.

My vote would be to go with nutchbase.

>>
>> Talking about features, what else would we add apart from :
>>
>> * support for HBase : via ORM or not (see
>> NUTCH-808
>> )
>
> This IMHO is promising, this could open the doors to small-to-medium
> installations that are currently too cumbersome to handle.
>

Yeah, there is already a simple ORM within nutchbase that is
avro-based and should
be generic enough to also support MySQL, cassandra and berkeleydb. But
any good ORM will
be a very good addition.

>> * plugin cleanup : Tika only for parsing - get rid of everything else?
>
> Basically, yes - keep only stuff like HtmlParseFilters (probably with a
> different API) so that we can post-process the DOM created in Tika from
> whatever original format.
>
> Also, the goal of the crawler-commons project is to provide APIs and
> implementations of stuff that is needed for every open source crawler
> project, like: robots handling, url filtering and url normalization, URL
> state management, perhaps deduplication. We should coordinate our
> efforts, and share code freely so that other projects (bixo, heritrix,
> droids) may contribute to this shared pool of functionality, much like
> Tika does for the common need of parsing complex formats.
>
>> * remove index / search and delegate to SOLR
>
> +1 - we may still keep a thin abstract layer to allow other
> indexing/search backends, but the current mess of indexing/query filters
> and competing indexing frameworks (lucene, fields, solr) should go away.
> We should go directly from DOM to a NutchDocument, and stop there.
>

Agreed. I would like to add support for katta and other indexing
backends at some point but
NutchDocument should be our canonical representation. The rest should
be up to indexing backends.

> Regarding search - currently the search API is too low-level, with the
> custom text and query analysis chains. This needlessly introduces the
> (in)famous Nutch Query classes and Nutch query syntax limitations, We
> should get rid of it and simply leave this part of the processing to the
> search backend. Probably we will use the SolrCloud branch that supports
> sharding and global IDF.
>
>> * new functionalities e.g. sitemap support, canonical tag etc...
>
> Plus a better handling of redirects, detecting duplicated sites,
> detection of spam cliques, tools to manage the webgraph, etc.
>
>>
>> I suppose that http://wiki.apache.org/nutch/Nutch2Architecture needs an
>> update?
>
> Definitely. :)
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>



-- 
Doğacan Güney


Re: Nutch 2.0 roadmap

2010-04-07 Thread Julien Nioche
Hi,

I'm not sure what is the status of the nutchbase - it's missed a lot of
> fixes and changes in trunk since it's been last touched ...
>

yes, maybe we should start the 2.0 branch from 1.1 instead
Dogacan - what do you think?

BTW I see there is now a 2.0 label under JIRA, thanks to whoever added it


> Also, the goal of the crawler-commons project is to provide APIs and
> implementations of stuff that is needed for every open source crawler
> project, like: robots handling, url filtering and url normalization, URL
> state management, perhaps deduplication. We should coordinate our
> efforts, and share code freely so that other projects (bixo, heritrix,
> droids) may contribute to this shared pool of functionality, much like
> Tika does for the common need of parsing complex formats.
>

definitely

 +1 - we may still keep a thin abstract layer to allow other
> indexing/search backends, but the current mess of indexing/query filters
> and competing indexing frameworks (lucene, fields, solr) should go away.
> We should go directly from DOM to a NutchDocument, and stop there.
>


I think that separating the parsing filters from the indexing filters can
have its merits e.g. combining the metadata generated by 2 or more different
parsing filters into a single field in the NutchDocument, keeping only a
subset of the available information etc...


> >
> > I suppose that http://wiki.apache.org/nutch/Nutch2Architecture needs an
> > update?
>

Have created a new page to serve as a support for discussion :
http://wiki.apache.org/nutch/Nutch2Roadmap

julien
-- 
DigitalPebble Ltd
http://www.digitalpebble.com


Re: Nutch 2.0 roadmap

2010-04-06 Thread Andrzej Bialecki
On 2010-04-06 15:43, Julien Nioche wrote:
> Hi guys,
> 
> I gather that we'll jump straight to  2.0 after 1.1 and that 2.0 will be
> based on what is currently referred to as NutchBase. Shall we create a
> branch for 2.0 in the Nutch SVN repository and have a label accordingly for
> JIRA so that we can file issues / feature requests on 2.0? Do you think that
> the current NutchBase could be used as a basis for the 2.0 branch?

I'm not sure what is the status of the nutchbase - it's missed a lot of
fixes and changes in trunk since it's been last touched ...

> 
> Talking about features, what else would we add apart from :
> 
> * support for HBase : via ORM or not (see
> NUTCH-808
> )

This IMHO is promising, this could open the doors to small-to-medium
installations that are currently too cumbersome to handle.

> * plugin cleanup : Tika only for parsing - get rid of everything else?

Basically, yes - keep only stuff like HtmlParseFilters (probably with a
different API) so that we can post-process the DOM created in Tika from
whatever original format.

Also, the goal of the crawler-commons project is to provide APIs and
implementations of stuff that is needed for every open source crawler
project, like: robots handling, url filtering and url normalization, URL
state management, perhaps deduplication. We should coordinate our
efforts, and share code freely so that other projects (bixo, heritrix,
droids) may contribute to this shared pool of functionality, much like
Tika does for the common need of parsing complex formats.

> * remove index / search and delegate to SOLR

+1 - we may still keep a thin abstract layer to allow other
indexing/search backends, but the current mess of indexing/query filters
and competing indexing frameworks (lucene, fields, solr) should go away.
We should go directly from DOM to a NutchDocument, and stop there.

Regarding search - currently the search API is too low-level, with the
custom text and query analysis chains. This needlessly introduces the
(in)famous Nutch Query classes and Nutch query syntax limitations, We
should get rid of it and simply leave this part of the processing to the
search backend. Probably we will use the SolrCloud branch that supports
sharding and global IDF.

> * new functionalities e.g. sitemap support, canonical tag etc...

Plus a better handling of redirects, detecting duplicated sites,
detection of spam cliques, tools to manage the webgraph, etc.

> 
> I suppose that http://wiki.apache.org/nutch/Nutch2Architecture needs an
> update?

Definitely. :)

-- 
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Nutch Topical / Focused Crawl

2009-10-06 Thread MyD

I just found an interesting thesis which explains how to turn / modify Nutch
into a focused / topical crawler. This thesis helped me a lot. Maybe useful
to others...

http://wing.comp.nus.edu.sg/publications/theses/2009/markusHaenseThesis.pdf



MyD wrote:
> 
> Hi @ all,
> 
> I'd like to turn Nutch into an focused / topical crawler. I started to
> analyze the code and think that I found the right peace of code. I just
> wanted to know if I am on the right track. I think the right peace of code
> to implement a decision to fetch further is in the method output of the
> Fetcher class every time we call the collect method of the OutputCollector
> object.
> 
> private ParseStatus output(Text key, CrawlDatum datum, Content content,
> ProtocolStatus pstatus, int status) {
> ...
> output.collect(...);
> ...
> }
> 
> Would you mind to let me know the the best way to turn this decision into
> an plugin? I was thinking to go a similar way like the scoring filters.
> Thanks in advance.
> 
> Cheers,
> MyD
> 

-- 
View this message in context: 
http://www.nabble.com/Nutch-Topical---Focused-Crawl-tp22765848p25764131.html
Sent from the Nutch - Dev mailing list archive at Nabble.com.



Re: Nutch Performance Improvements

2009-08-25 Thread Ken Krugler


On Aug 25, 2009, at 9:50am, Fuad Efendi wrote:

I forgot to add for “Allow Redirects” to work properly we need also  
Cookie handling in HttpClient... Most “stateful” websites generate  
links inside HTML with Session tokens if they find that Client does  
not support cookies; but if HttpClient supports – we are forced to  
allow redirects (although new version of HttpClient supports per- 
host cookies cache?!); to be verified...


HttpClient 4.0 provides per-user/thread context, which includes  
cookies. I don't know of any per-host cookie support, just per-host  
routing.


-- Ken



From: Fuad Efendi [mailto:f...@efendi.ca]
Sent: August-25-09 12:42 PM
To: nutch-dev@lucene.apache.org
Subject: Nutch Performance Improvements

Hello,


Few years ago I noticed some performance bottlenecks of Nutch;  
checking source code now... the same...



1.   RegexURLNormalizer and similar plugins
It’s singleton, and main method is synchronized. Would be better to  
have per-thread instance, non-synchronized; but how to make it  
plugin then?



2.   “Allow Redirects” for HttpClient
By allowing redirects we can avoid HttpSession related tokens in  
final URLs
(may be it’s not acceptable for general crawl, but would be nice to  
have such configuration option)




Fuad Efendi
==
http://www.linkedin.com/in/liferay
http://www.tokenizer.org
http://www.casaGURU.com
==




--
Ken Krugler
TransPac Software, Inc.

+1 530-210-6378



RE: Nutch Performance Improvements

2009-08-25 Thread Fuad Efendi
I forgot to add for "Allow Redirects" to work properly we need also Cookie
handling in HttpClient... Most "stateful" websites generate links inside
HTML with Session tokens if they find that Client does not support cookies;
but if HttpClient supports - we are forced to allow redirects (although new
version of HttpClient supports per-host cookies cache?!); to be verified...

 

From: Fuad Efendi [mailto:f...@efendi.ca] 
Sent: August-25-09 12:42 PM
To: nutch-dev@lucene.apache.org
Subject: Nutch Performance Improvements

 

Hello,

 

 

Few years ago I noticed some performance bottlenecks of Nutch; checking
source code now... the same...

 

 

1.   RegexURLNormalizer and similar plugins

It's singleton, and main method is synchronized. Would be better to have
per-thread instance, non-synchronized; but how to make it plugin then?

 

 

2.   "Allow Redirects" for HttpClient

By allowing redirects we can avoid HttpSession related tokens in final URLs

(may be it's not acceptable for general crawl, but would be nice to have
such configuration option)

 

 

 

Fuad Efendi

==

http://www.linkedin.com/in/liferay

http://www.tokenizer.org

http://www.casaGURU.com

==

 

 



Re: Nutch dev. plans

2009-07-29 Thread Andrzej Bialecki

Kirby Bohling wrote:

2009/7/29 Doğacan Güney :

Hey guys,

Kirby, thanks for all the insightful information! I couldn't comment
much as most of
the stuff went right over my head :) (don't know much about OSGI).



A bit of a progress report to make sure I'm heading in the proper direction.

I'm learning I know a lot about Eclipse RCP, which hid lots of details
about OSGi from me.  No public code yet, I'm hoping that happens this
weekend.

Got Felix downloaded and started an embedded OSGi environment
successfully.  I chose Felix because it's Apache licensed.  I'm not
clear if the CPL/EPL is acceptable to the ASF for inclusion and
distribution.


I need to check this - it probably is ok to include it and distribute 
it, with a notice.



 Sounds like Equinox is more full featured.  I'll
probably integrate with both just for sanity checking portability.


http://s3.amazonaws.com/neilbartlett.name/osgibook_preview_20090110.pdf

This book indicates that Equinox (and Eclipse) historically preferred 
extensions over services, so the toolchain available in Equinox is more 
robust for building extensions. Unless I got something mixed up ;)


My understanding is that in Nutch we want primarily services, not 
extensions. Although I'm not sure about the pre-/post-processing plugins 
such as query filters and indexing filters, as well as library-only 
plugins like lib-xml.



My quick research indicates that Hadoop isn't OSGi application
friendly, but can host an embedded OSGi environment.  I bogged down
attempting to integrate the bnd tool to run inside of Ant.  I think I
have that resolved so I can just wrap third party jar's in the ant
scripts.  So hopefully I can make more meaningful progress soon.

The current mental architecture I have is to make all the libraries in
./lib/*.jar end up in the top level classloader outside the OSGi
environment (I forget the technical OSGi name, I think it is the
System Classloader).


+1.


 Then, turn the Nutch Core into a single bundle.


+1.


Turn each current plugin into an OSGi bundle.  Each plugin registers a
"service" which is a factory capable of creating whatever is currently
inside of the plugin.xml as an "implementation" attribute.  Modify the
core to use the factories inside of the extension point code.  I think
that is the minimally invasive way to get to OSGi.


I'm not sure about this part. There may be many plugins that implement 
the same service. E.g. both protocol-http and protocol-httpclient 
implement HTTP protocol service. Some plugins implement many services 
(e.g. creativecommons implements both parsing, indexing and query 
components).




I assumed that it would be easiest to get OSGi in and integrated with
minimal disruption.  If nothing else, we can use that as a staging
point to play with more invasive designs and architectures.  In the
long run, I think something more invasive makes more sense.  Hopefully
that is effective risk management rather then a waste of time.

Please course correct me if any of this seems a bad idea.


It would be fantastic to have something like this as a starting point. 
Other developers can join this effort as soon as they understand the 
general design, and this prototype should provide a good example and 
verify the design.


--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Nutch dev. plans

2009-07-29 Thread Kirby Bohling
2009/7/29 Doğacan Güney :
> Hey guys,
>
> Kirby, thanks for all the insightful information! I couldn't comment
> much as most of
> the stuff went right over my head :) (don't know much about OSGI).
>

A bit of a progress report to make sure I'm heading in the proper direction.

I'm learning I know a lot about Eclipse RCP, which hid lots of details
about OSGi from me.  No public code yet, I'm hoping that happens this
weekend.

Got Felix downloaded and started an embedded OSGi environment
successfully.  I chose Felix because it's Apache licensed.  I'm not
clear if the CPL/EPL is acceptable to the ASF for inclusion and
distribution.  Sounds like Equinox is more full featured.  I'll
probably integrate with both just for sanity checking portability.

My quick research indicates that Hadoop isn't OSGi application
friendly, but can host an embedded OSGi environment.  I bogged down
attempting to integrate the bnd tool to run inside of Ant.  I think I
have that resolved so I can just wrap third party jar's in the ant
scripts.  So hopefully I can make more meaningful progress soon.

The current mental architecture I have is to make all the libraries in
./lib/*.jar end up in the top level classloader outside the OSGi
environment (I forget the technical OSGi name, I think it is the
System Classloader).  Then, turn the Nutch Core into a single bundle.
Turn each current plugin into an OSGi bundle.  Each plugin registers a
"service" which is a factory capable of creating whatever is currently
inside of the plugin.xml as an "implementation" attribute.  Modify the
core to use the factories inside of the extension point code.  I think
that is the minimally invasive way to get to OSGi.

I assumed that it would be easiest to get OSGi in and integrated with
minimal disruption.  If nothing else, we can use that as a staging
point to play with more invasive designs and architectures.  In the
long run, I think something more invasive makes more sense.  Hopefully
that is effective risk management rather then a waste of time.

Please course correct me if any of this seems a bad idea.

> Andrzej, would OSGI make creating plugins easier?

I'm not sure it will.  Hopefully we can accomplish that in some way or
another, especially out of tree plugins.  I know I have plenty of out
of tree plugins in my future at work.

>
> On Sun, Jul 26, 2009 at 19:09, Andrzej Bialecki wrote:
> [..snipping thread as it has gone too long.]
>
> --
> Doğacan Güney
>


Re: Nutch dev. plans

2009-07-29 Thread Andrzej Bialecki

Doğacan Güney wrote:

Hey guys,

Kirby, thanks for all the insightful information! I couldn't comment
much as most of
the stuff went right over my head :) (don't know much about OSGI).

Andrzej, would OSGI make creating plugins easier? One of the things
that bug me most
about our plugin system is the xml files that need to be created for
every plugin. These files have to be written manually and nutch
doesn't report errors very good here so this process is extremely
error-prone. Do you have something in mind for making this part any
simpler?


I don't think this will become much simpler, just different. OSGI 
descriptors (not sure yet if we will go with extensions or dynamic 
services) can be much more complex that our plugin descriptors, but I 
think that in practice we would use a well-defined subset.


--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Nutch dev. plans

2009-07-29 Thread Doğacan Güney
Hey guys,

Kirby, thanks for all the insightful information! I couldn't comment
much as most of
the stuff went right over my head :) (don't know much about OSGI).

Andrzej, would OSGI make creating plugins easier? One of the things
that bug me most
about our plugin system is the xml files that need to be created for
every plugin. These files have to be written manually and nutch
doesn't report errors very good here so this process is extremely
error-prone. Do you have something in mind for making this part any
simpler?

On Sun, Jul 26, 2009 at 19:09, Andrzej Bialecki wrote:
[..snipping thread as it has gone too long.]

-- 
Doğacan Güney


Re: Nutch dev. plans

2009-07-26 Thread Andrzej Bialecki

Kirby Bohling wrote:


I think you're correct about it being worth while.  I've got a git
repository that I use for my work, I'll see about setting up a github
and start to use that as a public place to get some of my stuff so you
can see it.  Unfortunately, I have some proprietary stuff that I can't
contribute back (most of which you don't want anyways).  I do have
bugfixes for core issues that I do have permission to contribute.
It'd be much easier for me to use Git to migrate the work back and
forth between work and there.  It's also much smoother for me to
develop a series of "easy to review" patches using it.


This is ok at this early stage - although sooner or later the patches 
need to appear in JIRA and be submitted with a grant of the ASL license.




I'm guessing that Tika isn't ready for this.  Given that it's an
Apache and/or Lucene project, it can probably be addressed.  My guess
is that a number of the libraries they depend upon won't be.

I think we would like Tika to function as an OSGI plugin (or a group of
plugins?) out of the box so that we could avoid having to wrap it ourselves.



I think Tika as one plugin would lead to a charge of "bloat", given
all the formats it currently supports that you now ship as plugins.


The cumulative weight of our plugins is also significant.


Long term do you see Nutch just supporting everything Tika does "out
of the box" and including all of the dependencies.  Thus folding most
of the parser plug-ins into one.  My understanding is that Tika is
nothing more then a port of the Nutch library into a single unified,
and re-usable library.  We might need help/support from Tika if the
answer is to split them up.


IMHO it would be good to include all parsers, but provide a mechanism 
for a la carte configuration of active parsers, and a mechanism for 
using other parsers packaged as OSGI plugins instead of the Tika ones.




I'd love to help.  I've mostly fought along the edges of this problem,
rather then worked on it directly.  I've written an OSGi service or
two, but I'm not sure it correctly handled all of the lifecycle issues
and other critical details.

I've played with your current system, and I know you'll have problems
with OSGi, pretty much straight out of the box.  I wanted a docx
parser, so I upgraded to Tika 0.3 and packaged the latest POI jars in
a new plug-in, and I had pretty much exactly the problem I described
with Class.forName() with the current plug-in system, because Tika
uses Class.forName().  Tika was in the core class-loader, and the
classes I needed where only in my docx plugin (core can't see system
plugins).  So Tika 0.3 couldn't find them.  There are also a couple of
small bug fixes for core in the API that I have, that it'd be nice to
see get integrated, then we could upgrade to Tika 0.3 at least.


Tika is already at 0.4, maybe some things changed.



I'll go hack on this tonight and tomorrow and see where I get.  I
think it's likely that Tika (or the dependent libraries), will need
significant work on packaging and the like.  I'm assuming that Felix
is the OSGi implementation you'd like to use by default?


No idea - I played shortly with both, the key being the word "played" .. 
;) Equinox has fewer dependencies if I'm not mistaken?




I know somebody was fairly well along with this conversion 3-4 years
ago.  Sami Siren is the name I associated with that.  Anybody know
where all of that ended up?  If nothing else, the boiler plate Ant
changes would be nice to have.
(http://wiki.apache.org/nutch/NutchOSGi)

How do you feel about build system modifications?  It'd be much nicer
to use OSGi in a toolchain where dependency resolution was done for
us.  I've looked at Ivy, but I couldn't seem to get it working.  The
documentation and tutorials was just a bit terse, and I know how to
deal with Maven.  I use Maven at my work all the time.  When it works
it's glorious, when you've hit a bug, it can be a show stopper.
However, I know for a lot of folks it is a non-starter.


I acknowledge that maven may be superior to ant at tracking dependencies 
... let's leave it at that ;)



--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Nutch dev. plans

2009-07-25 Thread Kirby Bohling
Comments inline below:

On Sat, Jul 25, 2009 at 2:23 PM, Andrzej Bialecki wrote:
> Kirby Bohling wrote:
>>
>> On Fri, Jul 17, 2009 at 5:21 PM, Andrzej Bialecki wrote:
>>>
>>> Doğacan Güney wrote:
>>>
> There's no specific design yet except I can't stand the existing plugin
> framework anymore ... ;) I started reading on OSGI and it seems that it
> supports the functionality that we need, and much more - it certainly
> looks
> like a better alternative than maintaining our plugin system beyond 1.x
> ...
>
 Couldn't agree more with the "can't stand plugin framework" :D

 Any good links on OSGI stuff?
>>>
>>> I found this:
>>>
>>> http://neilbartlett.name/blog/osgi-articles
>
> Hi Kirby,
>
> Thanks for your insights - please see my comments below.
>
>> Plugins are called Bundles in OSGi parlance, but I'll use plugin as
>> that's the term used by Nutch.
>>
>> I have done quite a bit of OSGi work (I used to develop RCP
>> applications for a living).  OSGi is great, as long as you plan on not
>> using reflection to retrieve classes directly, and you don't plan on
>> using a library that uses it directly.
>>
>> Pretty much every use of usage like this:
>>
>> Class clazz = Class.forName(stringFromConfig);
>> // Code to create an object using this class...
>>
>> Will fail, unless the code is very classloader aware.  So if you're
>> going to switch over to using OSGi (which I think would be wonderful),
>> you'll want to ensure that you can deal with all of the third-party
>> libraries.  I haven't played much with any of the Declarative Services
>> stuff (I think that was slated for OSGi, but it might have just been
>> an Eclipse extension).
>
> This is an important issue - so I think we need first to do some
> experiments, and continue development on a branch for a while ... Still the
> whole ecosystem that OSGI offers is worth the trouble IMHO.
>

I think you're correct about it being worth while.  I've got a git
repository that I use for my work, I'll see about setting up a github
and start to use that as a public place to get some of my stuff so you
can see it.  Unfortunately, I have some proprietary stuff that I can't
contribute back (most of which you don't want anyways).  I do have
bugfixes for core issues that I do have permission to contribute.
It'd be much easier for me to use Git to migrate the work back and
forth between work and there.  It's also much smoother for me to
develop a series of "easy to review" patches using it.


>
>
>> The OSGi uses classloader segmentation to allow multiple conflicting
>> versions of the same code inside the same project.  So having a
>> pattern like:
>>
>> Plugin A: nutch.api (Which contains say the interface Parser { })
>> Plugin B: parser.word (which has class WordParser implements Parser)
>>
>> Plugin B has to depend on Plugin A so it can see the parser.  In this
>> case, Plugin A can't have code that uses Class.forName("WordParser");
>>
>> OSGi changes the default classloader delegation, you can only see
>> classes in plugins you depend upon, and cycles in the dependencies are
>> not allowed.
>
> If I understand it correctly, this is pretty much how it's supposed to work
> in our current plugin system ... only it's more primitive and it's got some
> warts ;)

That's a fair and accurate statement.

>
>>
>> If you want to do that, you end up having to do:
>>
>> ClassLoader loader = ParserRegistery.lookupPlugin("WordParser");
>> Class.forname("WordParser", loader);
>>
>> OSGi has some SPI-like way way to have a plugin note the fact that it
>> contributes an implementation of the Parser interface.  Eclipse builds
>> on top of it, and that's what Eclipse 3.x implemented the
>> Extension/ExtensionPoint system on top of.  I believe they are called
>> services in "raw" OSGi.
>>
>> It's not a huge deal to write that yourself for API's you implement.
>> The problem is that it can be difficult to integrate really useful
>> third-party libraries that don't account for this change in
>> classloader behaviour.  At points it can make it very problematic to
>> use a specific XML parser that has the features you want (or some
>> library you want to use really wants).  Because they do this sort of
>> thing all the time.
>
> This doesn't sound too much different from what we do already in Nutch
> plugins.

Yes.  I think that's accurate.

>
>>
>> I'm guessing that Tika isn't ready for this.  Given that it's an
>> Apache and/or Lucene project, it can probably be addressed.  My guess
>> is that a number of the libraries they depend upon won't be.
>
> I think we would like Tika to function as an OSGI plugin (or a group of
> plugins?) out of the box so that we could avoid having to wrap it ourselves.
>

I think Tika as one plugin would lead to a charge of "bloat", given
all the formats it currently supports that you now ship as plugins.
Long term do you see Nutch just supporting everything Tika does "out
of the box" and including all of the dependencie

Re: Nutch dev. plans

2009-07-25 Thread Andrzej Bialecki

Kirby Bohling wrote:

On Fri, Jul 17, 2009 at 5:21 PM, Andrzej Bialecki wrote:

Doğacan Güney wrote:


There's no specific design yet except I can't stand the existing plugin
framework anymore ... ;) I started reading on OSGI and it seems that it
supports the functionality that we need, and much more - it certainly
looks
like a better alternative than maintaining our plugin system beyond 1.x
...


Couldn't agree more with the "can't stand plugin framework" :D

Any good links on OSGI stuff?

I found this:

http://neilbartlett.name/blog/osgi-articles


Hi Kirby,

Thanks for your insights - please see my comments below.


Plugins are called Bundles in OSGi parlance, but I'll use plugin as
that's the term used by Nutch.

I have done quite a bit of OSGi work (I used to develop RCP
applications for a living).  OSGi is great, as long as you plan on not
using reflection to retrieve classes directly, and you don't plan on
using a library that uses it directly.

Pretty much every use of usage like this:

Class clazz = Class.forName(stringFromConfig);
// Code to create an object using this class...

Will fail, unless the code is very classloader aware.  So if you're
going to switch over to using OSGi (which I think would be wonderful),
you'll want to ensure that you can deal with all of the third-party
libraries.  I haven't played much with any of the Declarative Services
stuff (I think that was slated for OSGi, but it might have just been
an Eclipse extension).


This is an important issue - so I think we need first to do some 
experiments, and continue development on a branch for a while ... Still 
the whole ecosystem that OSGI offers is worth the trouble IMHO.





The OSGi uses classloader segmentation to allow multiple conflicting
versions of the same code inside the same project.  So having a
pattern like:

Plugin A: nutch.api (Which contains say the interface Parser { })
Plugin B: parser.word (which has class WordParser implements Parser)

Plugin B has to depend on Plugin A so it can see the parser.  In this
case, Plugin A can't have code that uses Class.forName("WordParser");

OSGi changes the default classloader delegation, you can only see
classes in plugins you depend upon, and cycles in the dependencies are
not allowed.


If I understand it correctly, this is pretty much how it's supposed to 
work in our current plugin system ... only it's more primitive and it's 
got some warts ;)




If you want to do that, you end up having to do:

ClassLoader loader = ParserRegistery.lookupPlugin("WordParser");
Class.forname("WordParser", loader);

OSGi has some SPI-like way way to have a plugin note the fact that it
contributes an implementation of the Parser interface.  Eclipse builds
on top of it, and that's what Eclipse 3.x implemented the
Extension/ExtensionPoint system on top of.  I believe they are called
services in "raw" OSGi.

It's not a huge deal to write that yourself for API's you implement.
The problem is that it can be difficult to integrate really useful
third-party libraries that don't account for this change in
classloader behaviour.  At points it can make it very problematic to
use a specific XML parser that has the features you want (or some
library you want to use really wants).  Because they do this sort of
thing all the time.


This doesn't sound too much different from what we do already in Nutch 
plugins.




I'm guessing that Tika isn't ready for this.  Given that it's an
Apache and/or Lucene project, it can probably be addressed.  My guess
is that a number of the libraries they depend upon won't be.


I think we would like Tika to function as an OSGI plugin (or a group of 
plugins?) out of the box so that we could avoid having to wrap it ourselves.



You can use fragments to get away from that (a fragment requires a
host bundle, the fragment's classes are loaded using the same
classloader as the host), but it doing that defeats a lot of the
reason for using OSGi (at least in terms of allowing you to use
multiple conflicting libraries in the same application).


Thank you again for the comments - I'm a newbie to OSGI, so I'll 
probably start with small experiments and see how it goes. If you think 
you could help us with this by providing some guidance or help with the 
design then that would be great.


--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Nutch dev. plans

2009-07-22 Thread Enis Soztutar

Andrzej Bialecki wrote:

Hi all,

I think we should be creating a sandbox area, where we can collaborate
on various subprojects, such as HBase, OSGI, Tika parsers, etc. 
Dogacan will be importing his HBase work as 'nutchbase'. Tika work is 
the least disruptive, so it could occur even on trunk. OSGI plugins 
work (which I'd like to tackle) means significant refactoring so I'd 
rather put this on a branch too.


Dogacan, you mentioned that you would like to work on Katta 
integration. Could you shed some light on how this fits with the 
abstract indexing & searching layer that we now have, and how 
distributed Solr fits into this picture?


In hadoop, we do branch for huge patches that affect every other, like 
Hadoop-4687(project split), and HADOOP-3628(lifecycles). Having branches 
for hbase integration and osgi might help a lot. Committing the patches 
immediately, and then resolving remaining issues, keeping up with the 
trunk, etc. seems to be easier that way.





Re: Nutch dev. plans

2009-07-20 Thread Ken Krugler

[snip]


 > Dogacan, you mentioned that you would like to work on Katta integration.

 Could you shed some light on how this fits with the abstract indexing &
 searching layer that we now have, and how distributed Solr fits into this
 picture?



I haven't yet given much thought to Katta integration. But basically,
I am thinking of
indexing newly-crawled documents as lucene shards and uploading them
to katta for searching. This should be very possible with the new
indexing system. But so far, I have neither studied katta too much nor
given much thought to integration. So I may be missing obvious stuff.


I've got some experience in this area, so let me know what questions, 
if any, you've got.


But the basic approach is very simple - just create N indexes (one 
per reducer), move this to HDFS, S3, or some other location where the 
Katta master & slaves can all access the shards, and then use the 
Katta "addIndex" command or supporting Java code to deploy the index.



About distributed solr: I very much like to do this and again, I
think, this should be possible to
do within nutch. However, distributed solr is ultimately uninteresting
to me because (AFAIK) it doesn't have the reliability and
high-availability that hadoop&hbase have, i.e. if a machine dies you
lose that part of the index.

Are there any projects going on that are live indexing systems like
solr, yet are backed up by hadoop HDFS like katta?


Note that Katta doesn't use HDFS as a backing store - the shards are 
copied to the local disks of the slaves for performance reasons.


There has been work on making Katta work better for near-real time 
updating, versus the currently very batch-oriented approach. See the 
Katta list for more details.


-- Ken
--
Ken Krugler
+1 530-210-6378


Re: Nutch dev. plans

2009-07-17 Thread Kirby Bohling
On Fri, Jul 17, 2009 at 5:21 PM, Andrzej Bialecki wrote:
> Doğacan Güney wrote:
>
>>> There's no specific design yet except I can't stand the existing plugin
>>> framework anymore ... ;) I started reading on OSGI and it seems that it
>>> supports the functionality that we need, and much more - it certainly
>>> looks
>>> like a better alternative than maintaining our plugin system beyond 1.x
>>> ...
>>>
>>
>> Couldn't agree more with the "can't stand plugin framework" :D
>>
>> Any good links on OSGI stuff?
>
> I found this:
>
> http://neilbartlett.name/blog/osgi-articles
>

Plugins are called Bundles in OSGi parlance, but I'll use plugin as
that's the term used by Nutch.

I have done quite a bit of OSGi work (I used to develop RCP
applications for a living).  OSGi is great, as long as you plan on not
using reflection to retrieve classes directly, and you don't plan on
using a library that uses it directly.

Pretty much every use of usage like this:

Class clazz = Class.forName(stringFromConfig);
// Code to create an object using this class...

Will fail, unless the code is very classloader aware.  So if you're
going to switch over to using OSGi (which I think would be wonderful),
you'll want to ensure that you can deal with all of the third-party
libraries.  I haven't played much with any of the Declarative Services
stuff (I think that was slated for OSGi, but it might have just been
an Eclipse extension).

We managed to get most of the code to play nice, and had a few
horrific hacks for allowing the use of Spring if necessary.

The OSGi uses classloader segmentation to allow multiple conflicting
versions of the same code inside the same project.  So having a
pattern like:

Plugin A: nutch.api (Which contains say the interface Parser { })
Plugin B: parser.word (which has class WordParser implements Parser)

Plugin B has to depend on Plugin A so it can see the parser.  In this
case, Plugin A can't have code that uses Class.forName("WordParser");

OSGi changes the default classloader delegation, you can only see
classes in plugins you depend upon, and cycles in the dependencies are
not allowed.

If you want to do that, you end up having to do:

ClassLoader loader = ParserRegistery.lookupPlugin("WordParser");
Class.forname("WordParser", loader);

OSGi has some SPI-like way way to have a plugin note the fact that it
contributes an implementation of the Parser interface.  Eclipse builds
on top of it, and that's what Eclipse 3.x implemented the
Extension/ExtensionPoint system on top of.  I believe they are called
services in "raw" OSGi.

It's not a huge deal to write that yourself for API's you implement.
The problem is that it can be difficult to integrate really useful
third-party libraries that don't account for this change in
classloader behaviour.  At points it can make it very problematic to
use a specific XML parser that has the features you want (or some
library you want to use really wants).  Because they do this sort of
thing all the time.

I'm guessing that Tika isn't ready for this.  Given that it's an
Apache and/or Lucene project, it can probably be addressed.  My guess
is that a number of the libraries they depend upon won't be.

You can use fragments to get away from that (a fragment requires a
host bundle, the fragment's classes are loaded using the same
classloader as the host), but it doing that defeats a lot of the
reason for using OSGi (at least in terms of allowing you to use
multiple conflicting libraries in the same application).

Thanks,
Kirby

>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>


Re: Nutch dev. plans

2009-07-17 Thread Andrzej Bialecki

Doğacan Güney wrote:


There's no specific design yet except I can't stand the existing plugin
framework anymore ... ;) I started reading on OSGI and it seems that it
supports the functionality that we need, and much more - it certainly looks
like a better alternative than maintaining our plugin system beyond 1.x ...



Couldn't agree more with the "can't stand plugin framework" :D

Any good links on OSGI stuff?


I found this:

http://neilbartlett.name/blog/osgi-articles


--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Nutch dev. plans

2009-07-17 Thread Dennis Kubes



Doğacan Güney wrote:

On Fri, Jul 17, 2009 at 21:32, Andrzej Bialecki wrote:

Doğacan Güney wrote:

Hey list,

On Fri, Jul 17, 2009 at 16:55, Andrzej Bialecki wrote:

Hi all,

I think we should be creating a sandbox area, where we can collaborate
on various subprojects, such as HBase, OSGI, Tika parsers, etc. Dogacan
will
be importing his HBase work as 'nutchbase'. Tika work is the least
disruptive, so it could occur even on trunk. OSGI plugins work (which I'd
like to tackle) means significant refactoring so I'd rather put this on a
branch too.


Thanks for starting the discussion, Andrzej.

Can you detail your OSGI plugin framework design? Maybe I missed the
discussion but
updating the plugin system has been something that I wanted to do for
a long time :)
so I am very much interested in your design.

There's no specific design yet except I can't stand the existing plugin
framework anymore ... ;) I started reading on OSGI and it seems that it
supports the functionality that we need, and much more - it certainly looks
like a better alternative than maintaining our plugin system beyond 1.x ...


I think I remember a conversation a while back about this :)  Not OSGI 
specifically but changing the plugin framework.  I am all for changing 
it to something like OSGI though.


Dennis





Couldn't agree more with the "can't stand plugin framework" :D

Any good links on OSGI stuff?


Oh, an additional comment about the scoring API: I don't think the claimed
benefits of OPIC outweigh the widespread complications that it caused in the
API. Besides, getting the static scoring right is very very tricky, so from
the engineer's point of view IMHO it's better to do the computation offline,
where you have more control over the process and can easily re-run the
computation, rather than rely on an online unstable algorithm that modifies
scores in place ...



Yeah, I am convinced :) . I am not done yet, but I think OPIC-like scoring will
feel very natural in a hbase-backed nutch. Give me a couple more days to polish
the scoring API then we can change it if you are not happy with it.


Dogacan, you mentioned that you would like to work on Katta integration.
Could you shed some light on how this fits with the abstract indexing &
searching layer that we now have, and how distributed Solr fits into this
picture?


I haven't yet given much thought to Katta integration. But basically,
I am thinking of
indexing newly-crawled documents as lucene shards and uploading them
to katta for searching. This should be very possible with the new
indexing system. But so far, I have neither studied katta too much nor
given much thought to integration. So I may be missing obvious stuff.

Me too..


About distributed solr: I very much like to do this and again, I
think, this should be possible to
do within nutch. However, distributed solr is ultimately uninteresting
to me because (AFAIK) it doesn't have the reliability and
high-availability that hadoop&hbase have, i.e. if a machine dies you
lose that part of the index.

Grant Ingersoll is doing some initial work on integrating distributed Solr
and Zookeeper, once this is in a usable shape then I think perhaps it's more
or less equivalent to Katta. I have a patch in my queue that adds direct
Hadoop->Solr indexing, using Hadoop OutputFormat. So there will be many
options to push index updates to distributed indexes. We just need to offer
the right API to implement the integration, and the current API is IMHO
quite close.


Are there any projects going on that are live indexing systems like
solr, yet are backed up by hadoop HDFS like katta?

There is the Bailey.sf.net project that fits this description, but it's
dormant - either it was too early, or there were just too many design
questions (or simply the committers moved to other things).


--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com








Re: Nutch dev. plans

2009-07-17 Thread Doğacan Güney
On Fri, Jul 17, 2009 at 21:32, Andrzej Bialecki wrote:
> Doğacan Güney wrote:
>>
>> Hey list,
>>
>> On Fri, Jul 17, 2009 at 16:55, Andrzej Bialecki wrote:
>>>
>>> Hi all,
>>>
>>> I think we should be creating a sandbox area, where we can collaborate
>>> on various subprojects, such as HBase, OSGI, Tika parsers, etc. Dogacan
>>> will
>>> be importing his HBase work as 'nutchbase'. Tika work is the least
>>> disruptive, so it could occur even on trunk. OSGI plugins work (which I'd
>>> like to tackle) means significant refactoring so I'd rather put this on a
>>> branch too.
>>>
>>
>> Thanks for starting the discussion, Andrzej.
>>
>> Can you detail your OSGI plugin framework design? Maybe I missed the
>> discussion but
>> updating the plugin system has been something that I wanted to do for
>> a long time :)
>> so I am very much interested in your design.
>
> There's no specific design yet except I can't stand the existing plugin
> framework anymore ... ;) I started reading on OSGI and it seems that it
> supports the functionality that we need, and much more - it certainly looks
> like a better alternative than maintaining our plugin system beyond 1.x ...
>

Couldn't agree more with the "can't stand plugin framework" :D

Any good links on OSGI stuff?

> Oh, an additional comment about the scoring API: I don't think the claimed
> benefits of OPIC outweigh the widespread complications that it caused in the
> API. Besides, getting the static scoring right is very very tricky, so from
> the engineer's point of view IMHO it's better to do the computation offline,
> where you have more control over the process and can easily re-run the
> computation, rather than rely on an online unstable algorithm that modifies
> scores in place ...
>

Yeah, I am convinced :) . I am not done yet, but I think OPIC-like scoring will
feel very natural in a hbase-backed nutch. Give me a couple more days to polish
the scoring API then we can change it if you are not happy with it.

>
>>
>>> Dogacan, you mentioned that you would like to work on Katta integration.
>>> Could you shed some light on how this fits with the abstract indexing &
>>> searching layer that we now have, and how distributed Solr fits into this
>>> picture?
>>>
>>
>> I haven't yet given much thought to Katta integration. But basically,
>> I am thinking of
>> indexing newly-crawled documents as lucene shards and uploading them
>> to katta for searching. This should be very possible with the new
>> indexing system. But so far, I have neither studied katta too much nor
>> given much thought to integration. So I may be missing obvious stuff.
>
> Me too..
>
>> About distributed solr: I very much like to do this and again, I
>> think, this should be possible to
>> do within nutch. However, distributed solr is ultimately uninteresting
>> to me because (AFAIK) it doesn't have the reliability and
>> high-availability that hadoop&hbase have, i.e. if a machine dies you
>> lose that part of the index.
>
> Grant Ingersoll is doing some initial work on integrating distributed Solr
> and Zookeeper, once this is in a usable shape then I think perhaps it's more
> or less equivalent to Katta. I have a patch in my queue that adds direct
> Hadoop->Solr indexing, using Hadoop OutputFormat. So there will be many
> options to push index updates to distributed indexes. We just need to offer
> the right API to implement the integration, and the current API is IMHO
> quite close.
>
>>
>> Are there any projects going on that are live indexing systems like
>> solr, yet are backed up by hadoop HDFS like katta?
>
> There is the Bailey.sf.net project that fits this description, but it's
> dormant - either it was too early, or there were just too many design
> questions (or simply the committers moved to other things).
>
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>



-- 
Doğacan Güney


Re: Nutch dev. plans

2009-07-17 Thread Andrzej Bialecki

Doğacan Güney wrote:

Hey list,

On Fri, Jul 17, 2009 at 16:55, Andrzej Bialecki wrote:

Hi all,

I think we should be creating a sandbox area, where we can collaborate
on various subprojects, such as HBase, OSGI, Tika parsers, etc. Dogacan will
be importing his HBase work as 'nutchbase'. Tika work is the least
disruptive, so it could occur even on trunk. OSGI plugins work (which I'd
like to tackle) means significant refactoring so I'd rather put this on a
branch too.



Thanks for starting the discussion, Andrzej.

Can you detail your OSGI plugin framework design? Maybe I missed the
discussion but
updating the plugin system has been something that I wanted to do for
a long time :)
so I am very much interested in your design.


There's no specific design yet except I can't stand the existing plugin 
framework anymore ... ;) I started reading on OSGI and it seems that it 
supports the functionality that we need, and much more - it certainly 
looks like a better alternative than maintaining our plugin system 
beyond 1.x ...


Oh, an additional comment about the scoring API: I don't think the 
claimed benefits of OPIC outweigh the widespread complications that it 
caused in the API. Besides, getting the static scoring right is very 
very tricky, so from the engineer's point of view IMHO it's better to do 
the computation offline, where you have more control over the process 
and can easily re-run the computation, rather than rely on an online 
unstable algorithm that modifies scores in place ...






Dogacan, you mentioned that you would like to work on Katta integration.
Could you shed some light on how this fits with the abstract indexing &
searching layer that we now have, and how distributed Solr fits into this
picture?



I haven't yet given much thought to Katta integration. But basically,
I am thinking of
indexing newly-crawled documents as lucene shards and uploading them
to katta for searching. This should be very possible with the new
indexing system. But so far, I have neither studied katta too much nor
given much thought to integration. So I may be missing obvious stuff.


Me too..


About distributed solr: I very much like to do this and again, I
think, this should be possible to
do within nutch. However, distributed solr is ultimately uninteresting
to me because (AFAIK) it doesn't have the reliability and
high-availability that hadoop&hbase have, i.e. if a machine dies you
lose that part of the index.


Grant Ingersoll is doing some initial work on integrating distributed 
Solr and Zookeeper, once this is in a usable shape then I think perhaps 
it's more or less equivalent to Katta. I have a patch in my queue that 
adds direct Hadoop->Solr indexing, using Hadoop OutputFormat. So there 
will be many options to push index updates to distributed indexes. We 
just need to offer the right API to implement the integration, and the 
current API is IMHO quite close.




Are there any projects going on that are live indexing systems like
solr, yet are backed up by hadoop HDFS like katta?


There is the Bailey.sf.net project that fits this description, but it's 
dormant - either it was too early, or there were just too many design 
questions (or simply the committers moved to other things).



--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Nutch dev. plans

2009-07-17 Thread Doğacan Güney
Hey list,

On Fri, Jul 17, 2009 at 16:55, Andrzej Bialecki wrote:
> Hi all,
>
> I think we should be creating a sandbox area, where we can collaborate
> on various subprojects, such as HBase, OSGI, Tika parsers, etc. Dogacan will
> be importing his HBase work as 'nutchbase'. Tika work is the least
> disruptive, so it could occur even on trunk. OSGI plugins work (which I'd
> like to tackle) means significant refactoring so I'd rather put this on a
> branch too.
>

Thanks for starting the discussion, Andrzej.

Can you detail your OSGI plugin framework design? Maybe I missed the
discussion but
updating the plugin system has been something that I wanted to do for
a long time :)
so I am very much interested in your design.

> Dogacan, you mentioned that you would like to work on Katta integration.
> Could you shed some light on how this fits with the abstract indexing &
> searching layer that we now have, and how distributed Solr fits into this
> picture?
>

I haven't yet given much thought to Katta integration. But basically,
I am thinking of
indexing newly-crawled documents as lucene shards and uploading them
to katta for searching. This should be very possible with the new
indexing system. But so far, I have neither studied katta too much nor
given much thought to integration. So I may be missing obvious stuff.

About distributed solr: I very much like to do this and again, I
think, this should be possible to
do within nutch. However, distributed solr is ultimately uninteresting
to me because (AFAIK) it doesn't have the reliability and
high-availability that hadoop&hbase have, i.e. if a machine dies you
lose that part of the index.

Are there any projects going on that are live indexing systems like
solr, yet are backed up by hadoop HDFS like katta?

> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
>



-- 
Doğacan Güney


Re: Nutch/Solr: storing the page cache in Solr

2009-05-15 Thread Siddhartha Reddy
Thanks a lot, Andrzej. I only need handling of String content at the moment;
so this should suffice. But if someone would like to store other content as
well, they can take a look at the Binary FieldType that is in the works for
Solr (https://issues.apache.org/jira/browse/SOLR-1116).

Thanks,
Siddhartha

On Thu, May 14, 2009 at 5:29 PM, Andrzej Bialecki  wrote:

> Siddhartha Reddy wrote:
>
>> I'm trying to patch Nutch to allow the page cache to be added to the Solr
>> index when using the SolrIndexer tool. Is there any reason this is not done
>> by default? The Solr schema even has the "cache" field but it is left empty.
>>
>>
> This issue is more complicated. We would need to handle also non-string
> content such as various binary formats (PDF, Office, images, etc), and there
> is no support for this in Solr (yet).
>
> Additionally, storing large binary blobs in Lucene index has some
> performance consequences.
>
> Currently Nutch uses Solr for searching, and a separate (set of) segment
> servers for content serving.
>
>  I'm enclosing a patch of the changes I have made. I have done some testing
>> and this seems to work fine. Can someone please take a look at it let me
>> know if I'm doing anything wrong? I'm especially not sure about the
>> character encoding to assume when converting the Content (which is stored as
>> byte[]) to a String; I'm getting the encoding from Metadata (using the key
>> Metadata.ORIGINAL_CHAR_ENCODING) but it is always null.
>>
>
> The patch looks ok, if handling String content is all you need. Char
> encoding should be available in ParseData.getMeta().
>
> --
> Best regards,
> Andrzej Bialecki <><
>  ___. ___ ___ ___ _ _   __
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>


Re: Nutch/Solr: storing the page cache in Solr

2009-05-14 Thread Andrzej Bialecki

Siddhartha Reddy wrote:
I'm trying to patch Nutch to allow the page cache to be added to the 
Solr index when using the SolrIndexer tool. Is there any reason this is 
not done by default? The Solr schema even has the "cache" field but it 
is left empty.




This issue is more complicated. We would need to handle also non-string 
content such as various binary formats (PDF, Office, images, etc), and 
there is no support for this in Solr (yet).


Additionally, storing large binary blobs in Lucene index has some 
performance consequences.


Currently Nutch uses Solr for searching, and a separate (set of) segment 
servers for content serving.


I'm enclosing a patch of the changes I have made. I have done some 
testing and this seems to work fine. Can someone please take a look at 
it let me know if I'm doing anything wrong? I'm especially not sure 
about the character encoding to assume when converting the Content 
(which is stored as byte[]) to a String; I'm getting the encoding from 
Metadata (using the key Metadata.ORIGINAL_CHAR_ENCODING) but it is 
always null.


The patch looks ok, if handling String content is all you need. Char 
encoding should be available in ParseData.getMeta().


--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Nutch crawled results for Clustering with Carrot2

2009-05-07 Thread Dawid Weiss


Gaurang,

You can fetch documents from Nutch indexes (which are Lucene indexes) and then 
feed them to the clustering algorithm directly, as explained in Carrot2 examples 
here:


http://download.carrot2.org/head/manual/index.html#section.integration

There are several examples you can choose to start from -- some of them accept 
raw data, some of them use Lucene document source.


http://fisheye3.atlassian.com/browse/carrot2/branches/stable/applications/carrot2-examples/src/org/carrot2/examples/clustering

If you need ultimate flexibility, go with the raw-data example:

http://fisheye3.atlassian.com/browse/carrot2/branches/stable/applications/carrot2-examples/src/org/carrot2/examples/clustering/ClusteringDocumentList.java?r=3345

Dawid


Gaurang Patel wrote:

Hi all,

Can anyone know how can I use the nutch crawled results for clustering them
with Carrot2 clustering engine? What I want is different from Carrot2
clustering plugin that comes with nutch. I want to write my own code for
retrieving document list from nutch crawled results, and then want to supply
this list to the Carrot2 algorithm.

Any kind of quick help will be appriciated.

Regards,
Gaurang



Re: Nutch Topical / Focused Crawl

2009-04-02 Thread Ken Krugler

Hi @ all,

I'd like to turn Nutch into an focused / topical crawler. It's a 
part of my final year thesis. Further, I'd like that others can 
contribute from my work. I started to analyze the code and think 
that I found the right peace of code. I just wanted to know if I am 
on the right track. I think the right peace of code to implement a 
decision to fetch further is in the method output of the Fetcher 
class every time we call the collect method of the OutputCollector 
object.


private ParseStatus output(Text key, CrawlDatum datum, Content content,
ProtocolStatus pstatus, int status) {
...
output.collect(...);
...
}

Would you mind to let me know the the best way to turn this decision 
into an plugin? I was thinking to go a similar way like the scoring 
filters. Thanks in advance.


Don't have the code in front of me right now, but we did something 
like this for a focused tech pages crawl with Krugle a few years 
back. Our goal was to influence the OPIC scores to ensure that pages 
we thought were likely to be "good" technical pages got fetched 
sooner.


Assuming you're using the scoring-opic plugin, then you'd create a 
custom ScoringFilter that gets executed after the scoring-opic plugin.


But the actual process of hooking every up was pretty complicated and 
error prone, unfortunately. We had to define our own keys for storing 
our custom scores inside of the parse_data Metadataa, the content 
Metadata, and the CrawlDB Metadata.


And we had to implement following methods for our scoring plugin:

setConf()
injectScore()
initialScore();
generateSortValue();
passScoreBeforeParsing();
passScoreAfterParsing();
shouldHarvestOutlinks();
distributeScoreToOutlink();
updateDbScore();
indexerScore();

-- Ken
--
Ken Krugler
+1 530-210-6378


Re: NUTCH-722 is resolved

2009-03-23 Thread Andrzej Bialecki

Sami Siren wrote:
I think we are good to go for rc2 and it also seems that the smartest 
thing to do with the package contents at this point is "do not touch them".


I agree.



I will roll out the new rc later today.


Great, thanks.


--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Nutch on Eclipse How To?

2009-03-20 Thread Bartosz Gadzimski

Sherjeel Niazi pisze:

I am working on Windows.


Ok, so you have to download:
cygwin: http://www.cygwin.com/setup.exe
nutch (from trunk) 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/758/artifact/trunk/build/nutch-2009-03-20_04-01-47.tar.gz


Install cygwin and set PATH variable for it.

It's in control panel, system, advanced tab, environment variables end 
edit/add PATH

I have in PATH like:
C:\Sun\SDK\bin;C:\cygwin\bin

If you run "bash" in Start->RUN->cmd.exe it should work.

Than you can follow instructions on wiki:
http://wiki.apache.org/nutch/RunNutchInEclipse0.9

Good luck,
Bartosz


Re: Nutch on Eclipse How To?

2009-03-20 Thread Sherjeel Niazi
I am working on Windows.


Re: Nutch on Eclipse How To?

2009-03-20 Thread Bartosz Gadzimski

Sherjeel Niazi pisze:

Hi there,

I want to configure nutch on Eclipse.
Can you plz help me that how can I do so? From where can I download 
the code, jar files etc.



Thanks,
Sherjeel.

Windows or linux ?


Re: Nutch ML cleanup

2009-03-10 Thread Otis Gospodnetic

Thanks for the clarification.  I'm now a a happy Nutch SF-list unsubscriber.  
Sorry for the confusion.

Otis



- Original Message 
> From: Doug Cutting 
> To: nutch-dev@lucene.apache.org
> Sent: Tuesday, March 10, 2009 4:11:03 PM
> Subject: Re: Nutch ML cleanup
> 
> ogjunk-nu...@yahoo.com is a member of nutch-...@lists.sourceforge.net and 
> nutch-gene...@lists.sourceforge.net.  These lists do not otherwise appear to 
> forward to Apache lists.  They used to perhaps forward through nutch.org 
> lists, 
> but that domain no longer forwards any email.  Please check the message 
> headers 
> to see how this message is routed to you.  If it is indeed routed through 
> Apache 
> servers then please send the headers to me.
> 
> Doug
> 
> Andrzej Bialecki wrote:
> > Otis Gospodnetic wrote:
> >> Hi,
> >> 
> >> This has been bugging me for a while now.  For some reason Nutch MLs get 
> >> the 
> most "junk" emails - both rude/rudeish emails, as well as clear spam (with 
> "SPAM" in the subject - something must be detecting it). I just looked at the 
> headers of the clearly labeled spam messages and found that they all seem to 
> come from SF:
> >> 
> >>  To: nutch-...@lists.sourceforge.net
> >>  To: nutch-gene...@lists.sourceforge.net
> >> 
> >> I assume there is some kind of a mail forward from the old Nutch MLs on SF 
> >> to 
> the "new" Nutch MLs at ASF.
> >> Do you think we could remove this forwarding and get rid of this spam?
> >> 
> >> Sami & Andrzej seem to be members who mght be able to make this change:
> >> 
> >> http://sourceforge.net/project/memberlist.php?group_id=59548
> > 
> > Actually, only Doug and Mike Cafarella are admins of that project.
> > 
> > Doug, could you please disable this forwarding?
> > 
> > 



Re: Nutch ML cleanup

2009-03-10 Thread Doug Cutting
ogjunk-nu...@yahoo.com is a member of nutch-...@lists.sourceforge.net 
and nutch-gene...@lists.sourceforge.net.  These lists do not otherwise 
appear to forward to Apache lists.  They used to perhaps forward through 
nutch.org lists, but that domain no longer forwards any email.  Please 
check the message headers to see how this message is routed to you.  If 
it is indeed routed through Apache servers then please send the headers 
to me.


Doug

Andrzej Bialecki wrote:

Otis Gospodnetic wrote:

Hi,

This has been bugging me for a while now.  For some reason Nutch MLs 
get the most "junk" emails - both rude/rudeish emails, as well as 
clear spam (with "SPAM" in the subject - something must be detecting 
it). 
I just looked at the headers of the clearly labeled spam messages and 
found that they all seem to come from SF:


 To: nutch-...@lists.sourceforge.net
 To: nutch-gene...@lists.sourceforge.net

I assume there is some kind of a mail forward from the old Nutch MLs 
on SF to the "new" Nutch MLs at ASF.

Do you think we could remove this forwarding and get rid of this spam?

Sami & Andrzej seem to be members who mght be able to make this 
change:


http://sourceforge.net/project/memberlist.php?group_id=59548


Actually, only Doug and Mike Cafarella are admins of that project.

Doug, could you please disable this forwarding?




Re: Nutch ML cleanup

2009-03-10 Thread Andrzej Bialecki

Otis Gospodnetic wrote:

Hi,

This has been bugging me for a while now.  For some reason Nutch MLs get the most "junk" emails - both rude/rudeish emails, as well as clear spam (with "SPAM" in the subject - something must be detecting it).  


I just looked at the headers of the clearly labeled spam messages and found 
that they all seem to come from SF:

 To: nutch-...@lists.sourceforge.net
 To: nutch-gene...@lists.sourceforge.net

I assume there is some kind of a mail forward from the old Nutch MLs on SF to the 
"new" Nutch MLs at ASF.
Do you think we could remove this forwarding and get rid of this spam?

Sami & Andrzej seem to be members who mght be able to make this change:

http://sourceforge.net/project/memberlist.php?group_id=59548


Actually, only Doug and Mike Cafarella are admins of that project.

Doug, could you please disable this forwarding?


--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Nutch ML cleanup

2009-03-09 Thread Sami Siren
Like I suspected: I have no power to do or view any admin stuff there. 
Btw. I am not seeing any span, perhaps google takes care of that for me?


--
Sami Siren

Sami Siren wrote:
I'll take a look at this, I am pretty sure we have to ask Doug at the 
end :)


--
Sami Siren

Otis Gospodnetic wrote:

Hi,

This has been bugging me for a while now.  For some reason Nutch MLs 
get the most "junk" emails - both rude/rudeish emails, as well as 
clear spam (with "SPAM" in the subject - something must be detecting 
it). 
I just looked at the headers of the clearly labeled spam messages and 
found that they all seem to come from SF:


 To: nutch-...@lists.sourceforge.net
 To: nutch-gene...@lists.sourceforge.net

I assume there is some kind of a mail forward from the old Nutch MLs 
on SF to the "new" Nutch MLs at ASF.

Do you think we could remove this forwarding and get rid of this spam?

Sami & Andrzej seem to be members who mght be able to make this 
change:


http://sourceforge.net/project/memberlist.php?group_id=59548

Otis
  






Re: Nutch ML cleanup

2009-03-09 Thread Sami Siren

I'll take a look at this, I am pretty sure we have to ask Doug at the end :)

--
Sami Siren

Otis Gospodnetic wrote:

Hi,

This has been bugging me for a while now.  For some reason Nutch MLs get the most "junk" emails - both rude/rudeish emails, as well as clear spam (with "SPAM" in the subject - something must be detecting it).  


I just looked at the headers of the clearly labeled spam messages and found 
that they all seem to come from SF:

 To: nutch-...@lists.sourceforge.net
 To: nutch-gene...@lists.sourceforge.net

I assume there is some kind of a mail forward from the old Nutch MLs on SF to the 
"new" Nutch MLs at ASF.
Do you think we could remove this forwarding and get rid of this spam?

Sami & Andrzej seem to be members who mght be able to make this change:

http://sourceforge.net/project/memberlist.php?group_id=59548

Otis
  




Re: NUTCH-684 [was: Re: [VOTE] Release Apache Nutch 1.0]

2009-03-09 Thread Doğacan Güney
On Mon, Mar 9, 2009 at 17:46, Sami Siren  wrote:
> Doğacan Güney wrote:
>>
>>
>> On 09.Mar.2009, at 11:05, Sami Siren > > wrote:
>>
>>> Doğacan Güney wrote:

 On Sun, Mar 8, 2009 at 20:25, Sami Siren >>> > wrote:

>
> Hello,
>
> I have packaged the first release candidate for Apache Nutch 1.0
> release at
>
> http://people.apache.org/~siren/nutch-1.0/rc0/
> 
>
> See the included CHANGES.txt file for details on release contents and
> latest
> changes. The release was made from tag:
>
> http://svn.apache.org/viewvc/lucene/nutch/tags/release-1.0-rc0/?pathrev=751480
>
> Please vote on releasing this package as Apache Nutch 1.0. The vote is
> open
> for the next 72 hours. Only votes from Lucene PMC members are binding,
> but
> everyone is welcome to check the release candidate and voice their
> approval
> or disapproval. The vote  passes if at least three binding +1 votes are
> cast.
>
> [ ] +1 Release the packages as Apache Nutch 1.0
> [ ] -1 Do not release the packages because...
>
> Thanks!
>
>

 That's great!

 I would like to see NUTCH-684 in but I guess I was too late :)

 Anyway, my non-binding +1.

>>>
>>> uh, I missed that one, sorry. Do you think it's ready to be included?
>>> (IMO that's an important feature) It's not a big deal for me to rebuild the
>>> package with that feature included.
>>>
>>
>> I only tested it on a small crawl. Still, I believe it is important too so
>> I would like to include it. Worst case we release a 1.0.1 soon after:)
>
> I am fine either way. So if you think it's good enough to go in just commit
> it and I'll build another rc. If not then we can release it later too when
> it's ready.
>

Committed, thanks for waiting :)

> --
> Sami Siren
>
>
>>
>>> --
>>>  Sami Siren
>>>
>
>



-- 
Doğacan Güney


Re: NUTCH-684 [was: Re: [VOTE] Release Apache Nutch 1.0]

2009-03-09 Thread Sami Siren

Doğacan Güney wrote:



On 09.Mar.2009, at 11:05, Sami Siren > wrote:



Doğacan Güney wrote:

On Sun, Mar 8, 2009 at 20:25, Sami Siren mailto:ssi...@gmail.com>> wrote:
  

Hello,

I have packaged the first release candidate for Apache Nutch 1.0 release at

http://people.apache.org/~siren/nutch-1.0/rc0/ 


See the included CHANGES.txt file for details on release contents and latest
changes. The release was made from tag:
http://svn.apache.org/viewvc/lucene/nutch/tags/release-1.0-rc0/?pathrev=751480

Please vote on releasing this package as Apache Nutch 1.0. The vote is open
for the next 72 hours. Only votes from Lucene PMC members are binding, but
everyone is welcome to check the release candidate and voice their approval
or disapproval. The vote  passes if at least three binding +1 votes are
cast.

[ ] +1 Release the packages as Apache Nutch 1.0
[ ] -1 Do not release the packages because...

Thanks!



That's great!

I would like to see NUTCH-684 in but I guess I was too late :)

Anyway, my non-binding +1.
  


uh, I missed that one, sorry. Do you think it's ready to be included? 
(IMO that's an important feature) It's not a big deal for me to 
rebuild the package with that feature included.




I only tested it on a small crawl. Still, I believe it is important 
too so I would like to include it. Worst case we release a 1.0.1 soon 
after:)
I am fine either way. So if you think it's good enough to go in just 
commit it and I'll build another rc. If not then we can release it later 
too when it's ready.


--
Sami Siren





--
 Sami Siren





Re: NUTCH-684 [was: Re: [VOTE] Release Apache Nutch 1.0]

2009-03-09 Thread Doğacan Güney



On 09.Mar.2009, at 11:05, Sami Siren  wrote:


Doğacan Güney wrote:


On Sun, Mar 8, 2009 at 20:25, Sami Siren  wrote:


Hello,

I have packaged the first release candidate for Apache Nutch 1.0  
release at


http://people.apache.org/~siren/nutch-1.0/rc0/

See the included CHANGES.txt file for details on release contents  
and latest

changes. The release was made from tag:
http://svn.apache.org/viewvc/lucene/nutch/tags/release-1.0-rc0/?pathrev=751480

Please vote on releasing this package as Apache Nutch 1.0. The  
vote is open
for the next 72 hours. Only votes from Lucene PMC members are  
binding, but
everyone is welcome to check the release candidate and voice their  
approval
or disapproval. The vote  passes if at least three binding +1  
votes are

cast.

[ ] +1 Release the packages as Apache Nutch 1.0
[ ] -1 Do not release the packages because...

Thanks!



That's great!

I would like to see NUTCH-684 in but I guess I was too late :)

Anyway, my non-binding +1.



uh, I missed that one, sorry. Do you think it's ready to be  
included? (IMO that's an important feature) It's not a big deal for  
me to rebuild the package with that feature included.




I only tested it on a small crawl. Still, I believe it is important  
too so I would like to include it. Worst case we release a 1.0.1 soon  
after:)



--
 Sami Siren



Re: [Nutch Wiki] Update of "InstallingWeb2" by SamiSiren

2009-02-20 Thread Sami Siren

Andrzej Bialecki wrote:

Apache Wiki wrote:

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" 
for change notification.


The following page has been changed by SamiSiren:
http://wiki.apache.org/nutch/InstallingWeb2

-- 


+ == NOTE: Web2 module is no longer part of Nutch ==
+ So these instructions do no longer apply.


Are you going to remove web2/ now? I'm +1 on this - the application is 
certainly nice, but it's not actively maintained and too complex for 
casual users to tweak. And in my experience people don't use Nutch 
webapp in production - either they roll their own or use 
OpenSearchServlet instead.


gone already :)

--
Sami Siren


Re: [Nutch Wiki] Update of "InstallingWeb2" by SamiSiren

2009-02-20 Thread Andrzej Bialecki

Apache Wiki wrote:

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The following page has been changed by SamiSiren:
http://wiki.apache.org/nutch/InstallingWeb2

--
+ == NOTE: Web2 module is no longer part of Nutch ==
+ So these instructions do no longer apply.


Are you going to remove web2/ now? I'm +1 on this - the application is 
certainly nice, but it's not actively maintained and too complex for 
casual users to tweak. And in my experience people don't use Nutch 
webapp in production - either they roll their own or use 
OpenSearchServlet instead.



--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Nutch ScoringFilter plugin problems

2009-01-26 Thread Doğacan Güney
On Mon, Jan 26, 2009 at 2:17 PM, Pau  wrote:
> Hello,
> I still have the same problem. I have the following piece of code
>
>   if (linkdb == null) {
> System.out.println("Null linkdb");
>   } else {
> System.out.println("LinkDB not null");
>   }
>   Inlinks inlinks = linkdb.getInlinks(url);
>   System.out.println("a");
>
> On the output I can see it always prints "LinkDB not null", so linkdb is not
> null. But "a" never gets printed, so I guess that at: " Inlinks inlinks =
> linkdb.getInlinks(url); " there is some error. Maybe the getInlinks function
> throws an IOException?
> I do catch the IOException, but the catch block is never executed either.
>

It is very difficult to guess without seeing the exception. Maybe you can try
catching everything (i.e Throwable) and printing it?

> One question, how should I create the LinkDBReader? I do it the following
> way:
>  linkdb = new LinkDbReader(getConf(), new Path("crawl/linkdb"));
> Is it right? Thanks.
>
>
> On Wed, Jan 21, 2009 at 10:16 AM, Pau  wrote:
>>
>> Ok, I think you are right, maybe "inlinks" is null. I will try it now.
>> Thank you!
>> I have no information about the exception. It seems that simply the
>> program skips this part of the code... maybe a ScoringFilterExcetion is
>> thrown?
>>
>> On Wed, Jan 21, 2009 at 9:47 AM, Doğacan Güney  wrote:
>>>
>>> On Tue, Jan 20, 2009 at 7:18 PM, Pau  wrote:
>>> > Hello,
>>> > I want to create a new ScoringFilter plugin. In order to evaluate how
>>> > interesting a web page is, I need information about the link structure
>>> > in
>>> > the LinkDB.
>>> > In the method updateDBScore, I have the following lines (among others):
>>> >
>>> > 88linkdb = new LinkDbReader(getConf(),
>>> > new
>>> > Path("crawl/linkdb"));
>>> > ...
>>> > 99System.out.println("Inlinks to " +
>>> > url);
>>> >100Inlinks inlinks =
>>> > linkdb.getInlinks(url);
>>> >101System.out.println("a");
>>> >102Iterator iIt =
>>> > inlinks.iterator();
>>> >103System.out.println("b");
>>> >
>>> > "a" always gets printed, but "b" rarely gets printed, so this seems
>>> > that in
>>> > line 102 an error happens, and an exeception is raised. Do you know why
>>> > this
>>> > is happening? What am I doing wrong? Thanks.
>>> >
>>>
>>> Maybe there are no inlinks to that page so "inlinks" is null? What is
>>> the exception
>>> exactly?
>>>
>>> >
>>>
>>>
>>>
>>> --
>>> Doğacan Güney
>>
>
>



-- 
Doğacan Güney


Re: Nutch ScoringFilter plugin problems

2009-01-26 Thread Pau
Hello,
I still have the same problem. I have the following piece of code

  if (linkdb == null) {
System.out.println("Null linkdb");
  } else {
System.out.println("LinkDB not null");
  }
  Inlinks inlinks = linkdb.getInlinks(url);
  System.out.println("a");

On the output I can see it always prints "LinkDB not null", so linkdb is not
null. But "a" never gets printed, so I guess that at: " Inlinks inlinks =
linkdb.getInlinks(url); " there is some error. Maybe the getInlinks function
throws an IOException?
I do catch the IOException, but the catch block is never executed either.

One question, how should I create the LinkDBReader? I do it the following
way:
 linkdb = new LinkDbReader(getConf(), new Path("crawl/linkdb"));
Is it right? Thanks.


On Wed, Jan 21, 2009 at 10:16 AM, Pau  wrote:

> Ok, I think you are right, maybe "inlinks" is null. I will try it now.
> Thank you!
> I have no information about the exception. It seems that simply the program
> skips this part of the code... maybe a ScoringFilterExcetion is thrown?
>
>
> On Wed, Jan 21, 2009 at 9:47 AM, Doğacan Güney  wrote:
>
>> On Tue, Jan 20, 2009 at 7:18 PM, Pau  wrote:
>> > Hello,
>> > I want to create a new ScoringFilter plugin. In order to evaluate how
>> > interesting a web page is, I need information about the link structure
>> in
>> > the LinkDB.
>> > In the method updateDBScore, I have the following lines (among others):
>> >
>> > 88linkdb = new LinkDbReader(getConf(),
>> new
>> > Path("crawl/linkdb"));
>> > ...
>> > 99System.out.println("Inlinks to " +
>> url);
>> >100Inlinks inlinks =
>> linkdb.getInlinks(url);
>> >101System.out.println("a");
>> >102Iterator iIt =
>> inlinks.iterator();
>> >103System.out.println("b");
>> >
>> > "a" always gets printed, but "b" rarely gets printed, so this seems that
>> in
>> > line 102 an error happens, and an exeception is raised. Do you know why
>> this
>> > is happening? What am I doing wrong? Thanks.
>> >
>>
>> Maybe there are no inlinks to that page so "inlinks" is null? What is
>> the exception
>> exactly?
>>
>> >
>>
>>
>>
>> --
>> Doğacan Güney
>>
>
>


Re: Nutch ScoringFilter plugin problems

2009-01-21 Thread Pau
Ok, I think you are right, maybe "inlinks" is null. I will try it now. Thank
you!
I have no information about the exception. It seems that simply the program
skips this part of the code... maybe a ScoringFilterExcetion is thrown?

On Wed, Jan 21, 2009 at 9:47 AM, Doğacan Güney  wrote:

> On Tue, Jan 20, 2009 at 7:18 PM, Pau  wrote:
> > Hello,
> > I want to create a new ScoringFilter plugin. In order to evaluate how
> > interesting a web page is, I need information about the link structure in
> > the LinkDB.
> > In the method updateDBScore, I have the following lines (among others):
> >
> > 88linkdb = new LinkDbReader(getConf(),
> new
> > Path("crawl/linkdb"));
> > ...
> > 99System.out.println("Inlinks to " +
> url);
> >100Inlinks inlinks =
> linkdb.getInlinks(url);
> >101System.out.println("a");
> >102Iterator iIt =
> inlinks.iterator();
> >103System.out.println("b");
> >
> > "a" always gets printed, but "b" rarely gets printed, so this seems that
> in
> > line 102 an error happens, and an exeception is raised. Do you know why
> this
> > is happening? What am I doing wrong? Thanks.
> >
>
> Maybe there are no inlinks to that page so "inlinks" is null? What is
> the exception
> exactly?
>
> >
>
>
>
> --
> Doğacan Güney
>


Re: Nutch ScoringFilter plugin problems

2009-01-21 Thread Doğacan Güney
On Tue, Jan 20, 2009 at 7:18 PM, Pau  wrote:
> Hello,
> I want to create a new ScoringFilter plugin. In order to evaluate how
> interesting a web page is, I need information about the link structure in
> the LinkDB.
> In the method updateDBScore, I have the following lines (among others):
>
> 88linkdb = new LinkDbReader(getConf(), new
> Path("crawl/linkdb"));
> ...
> 99System.out.println("Inlinks to " + url);
>100Inlinks inlinks = linkdb.getInlinks(url);
>101System.out.println("a");
>102Iterator iIt = inlinks.iterator();
>103System.out.println("b");
>
> "a" always gets printed, but "b" rarely gets printed, so this seems that in
> line 102 an error happens, and an exeception is raised. Do you know why this
> is happening? What am I doing wrong? Thanks.
>

Maybe there are no inlinks to that page so "inlinks" is null? What is
the exception
exactly?

>



-- 
Doğacan Güney


Re: nutch segment format

2009-01-05 Thread Todd Lipcon
Hi Matt,

The nutch segments are stored as Hadoop SequenceFiles and MapFiles. MapFile
is made up of multiple SequenceFiles. I'm not certain if the format is
documented anywhere, but the source is in org.apache.hadoop.io. I doubt
you'll find a PHP library for reading them, so you'll probably have to write
something yourself.

-Todd

On Mon, Jan 5, 2009 at 10:32 AM, Matt Pearson  wrote:

>  Hi Everyone,
>
>
>
> I'm looking into reading data from Nutch segments with PHP is there
> anywhere where I can get information on the format in which the data is
> stored?
>
>
>
> Thanks and apologies if this isn't the right place to ask this question.
>
>
>
>
>
> Matt Pearson
>
>
>
>
>


Re: NUTCH-92

2008-11-27 Thread Doğacan Güney
On Thu, Nov 27, 2008 at 11:40 PM, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
> Doğacan Güney wrote:
>
>>
>> It seems I wrote the patch in NUTCH-92. My recollection was that you
>> wrote it, Andrzej :D
>
> No, I didn't - you did! :) I only came up with the proposal, after
> discussing it with Doug.
>
>> Anyway, I have no idea what I did in that patch, don't know if it
>> works or applies etc. Really,
>> I am just curios. Did anyone test it? Does it really work :) ?
>
> Not me. I shied away from the patch because I didn't like the 2 RPC-s per
> search. I still don't like it, but I may have to accept it as an interim
> solution.
>
> That was my question, really - for release 1.0:
>
> * are we better off not having this patch, and just be careful how we split
> indexes among searchers as we do it now, or
>
> * should we apply the patch, pay the price of 2 RPCs, and wait for the patch
> implementing the approach that I proposed?
>
> * or make an effort to implement the new approach, and postpone the release
> until this is ready.
>

3rd approach sounds the best, especially if new approach is not
difficult to implement.
(I may even give it a try if I have the time)

>
>>
>> I haven't read the paper yet but the proposed approach sounds better
>> to me. Do you have any
>> code ready, Andrzej? Or how difficult is it to implement it?
>
> No code yet, just thinking aloud. But it's not really anything complicated,
> chunks of code already exist that implement almost all building blocks of
> the algorithm.
>
> --
> Best regards,
> Andrzej Bialecki <><
>  ___. ___ ___ ___ _ _   __
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>



-- 
Doğacan Güney


Re: NUTCH-92

2008-11-27 Thread Andrzej Bialecki

Doğacan Güney wrote:



It seems I wrote the patch in NUTCH-92. My recollection was that you
wrote it, Andrzej :D


No, I didn't - you did! :) I only came up with the proposal, after 
discussing it with Doug.



Anyway, I have no idea what I did in that patch, don't know if it
works or applies etc. Really,
I am just curios. Did anyone test it? Does it really work :) ?


Not me. I shied away from the patch because I didn't like the 2 RPC-s 
per search. I still don't like it, but I may have to accept it as an 
interim solution.


That was my question, really - for release 1.0:

* are we better off not having this patch, and just be careful how we 
split indexes among searchers as we do it now, or


* should we apply the patch, pay the price of 2 RPCs, and wait for the 
patch implementing the approach that I proposed?


* or make an effort to implement the new approach, and postpone the 
release until this is ready.





I haven't read the paper yet but the proposed approach sounds better
to me. Do you have any
code ready, Andrzej? Or how difficult is it to implement it?


No code yet, just thinking aloud. But it's not really anything 
complicated, chunks of code already exist that implement almost all 
building blocks of the algorithm.


--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: NUTCH-92

2008-11-27 Thread Doğacan Güney
Hi,

On Wed, Nov 26, 2008 at 3:04 AM, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
> Hi all,
>
> After reading this paper:
>
> http://wortschatz.uni-leipzig.de/~fwitschel/papers/ipm1152.pdf
>
> I came up with the following idea of implementing global IDF in Nutch. The
> upside of the approach I propose is that it brings back the cost of making a
> search query to 1 RPC call. The downside is that the search servers need to
> cache global IDF estimates as computed by the DS.Client, which ties them to
> a single query front-end (DistributedSearch.Client), or requires keeping a
> map of  on each search server.
>
> -
>
> First, as the paper above claims, we don't really need exact IDF values of
> all terms from every index. We should get acceptable quality if we only
> learn the top-N frequent terms, and for the rest of them we apply a
> smoothing function that is based on global characteristics of each index
> (such as the number of terms in the index).
>
> This means that the data that needs to be collected by the query integrator
> (DS.Client in Nutch) from shard servers (DS.Server in Nutch) would consist
> of a list of e.g. top 500 local terms with their frequency, plus the local
> smoothing factor as a single value.
>
> We could further reduce the amount of data to be sent from/to shard servers
> by encoding this information in a counted Bloom filter with a single-byte
> resolution (or a spectral Bloom filter, whichever yields a better precision
> / bit in our case).
>
> The query integrator would ask all active shard servers to provide their
> local IDF data, and it would compute global IDFs for these terms, plus a
> global smoothing factor, and send back the updated information to each shard
> server. This would happen once per lifetime of a local shard, and is needed
> because of the local query rewriting (and expansion of terms from Nutch
> Query to Lucene Query).
>
> Shard servers would then process incoming queries using the IDF estimates
> for terms included in the global IDF data, or the global smoothing factors
> for terms missing from that data (or use local IDFs).
>
> The global IDF data would have to be recomputed each time the set of shards
> available to a DS.Client changes, and then it needs to be broadcast back
> from the client to all servers - which is the downside of this solution,
> because servers need to keep a cache of this information for every DS.Client
> (each of them possibly having a different list of shard servers, hence
> different IDFs). Also, as shard servers come and go, the IDF data keeps
> being recomputed and broadcast, which increases the traffic between the
> client and servers.
>
> Still I believe the amount of additional traffic should be minimal in a
> typical scenario, where changes to the shards are much less frequent than
> the frequency of sending user queries. :)
>
> --
>
> Now, if this approach seems viable (please comment on this), what should we
> do with the patches in NUTCH-92 ?
>
> 1. skip them for now, and wait until the above approach is implemented, and
> pay the penalty of using skewed local IDFs.
>
> 2. apply them now, and pay the penalty of additional RPC call / search, and
> replace this mechanism with the one described above, whenever that becomes
> available.
>

It seems I wrote the patch in NUTCH-92. My recollection was that you
wrote it, Andrzej :D
Anyway, I have no idea what I did in that patch, don't know if it
works or applies etc. Really,
I am just curios. Did anyone test it? Does it really work :) ?

I haven't read the paper yet but the proposed approach sounds better
to me. Do you have any
code ready, Andrzej? Or how difficult is it to implement it?

> --
> Best regards,
> Andrzej Bialecki <><
>  ___. ___ ___ ___ _ _   __
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>



-- 
Doğacan Güney


Re: NUTCH-92

2008-11-25 Thread Sean Dean
This method of calculating global IDF values certainly sounds more efficient 
then the currently proposed method. The reduction of 1 RPC call during the 
search query (so that only 1 RPC call is made in total) should reduce the 
overall load on each search server. I prefer the idea of having network 
broadcasts going out during the initial startup and only thereafter during a 
topology changing event.

To me this kind of sounds like network routing tables, the initial table is 
setup during startup and checked periodically for changes. When a change is 
detected the table is modified (sometimes regenerated completely) and the 
network continues to operate. The alternative (based on the current patch) is 
to check the table every time a packet (or maybe connection) is sent to one of 
the devices listed inside. This method may be faster to detect any problem but 
the additional load would be substantial.

With all this said though, the amount of time needed to research and develop 
this new method may take an extended period of time depending on developer 
availability. We have a proposed solution (albeit not as nice) that did work on 
older code that may only need a quick refresh to work with trunk (and the 
future 1.0 release). I would personally like to see NUTCH-92 (or some form of 
it) included in trunk for a legitimate evaluation before the next release.


Sean Dean





From: Andrzej Bialecki <[EMAIL PROTECTED]>
To: nutch-dev@lucene.apache.org
Sent: Tuesday, November 25, 2008 8:04:22 PM
Subject: NUTCH-92

Hi all,

After reading this paper:

http://wortschatz.uni-leipzig.de/~fwitschel/papers/ipm1152.pdf

I came up with the following idea of implementing global IDF in Nutch. 
The upside of the approach I propose is that it brings back the cost of 
making a search query to 1 RPC call. The downside is that the search 
servers need to cache global IDF estimates as computed by the DS.Client, 
which ties them to a single query front-end (DistributedSearch.Client), 
or requires keeping a map of  on each search server.

-

First, as the paper above claims, we don't really need exact IDF values 
of all terms from every index. We should get acceptable quality if we 
only learn the top-N frequent terms, and for the rest of them we apply a 
smoothing function that is based on global characteristics of each index 
(such as the number of terms in the index).

This means that the data that needs to be collected by the query 
integrator (DS.Client in Nutch) from shard servers (DS.Server in Nutch) 
would consist of a list of e.g. top 500 local terms with their 
frequency, plus the local smoothing factor as a single value.

We could further reduce the amount of data to be sent from/to shard 
servers by encoding this information in a counted Bloom filter with a 
single-byte resolution (or a spectral Bloom filter, whichever yields a 
better precision / bit in our case).

The query integrator would ask all active shard servers to provide their 
local IDF data, and it would compute global IDFs for these terms, plus a 
global smoothing factor, and send back the updated information to each 
shard server. This would happen once per lifetime of a local shard, and 
is needed because of the local query rewriting (and expansion of terms 
from Nutch Query to Lucene Query).

Shard servers would then process incoming queries using the IDF 
estimates for terms included in the global IDF data, or the global 
smoothing factors for terms missing from that data (or use local IDFs).

The global IDF data would have to be recomputed each time the set of 
shards available to a DS.Client changes, and then it needs to be 
broadcast back from the client to all servers - which is the downside of 
this solution, because servers need to keep a cache of this information 
for every DS.Client (each of them possibly having a different list of 
shard servers, hence different IDFs). Also, as shard servers come and 
go, the IDF data keeps being recomputed and broadcast, which increases 
the traffic between the client and servers.

Still I believe the amount of additional traffic should be minimal in a 
typical scenario, where changes to the shards are much less frequent 
than the frequency of sending user queries. :)

--

Now, if this approach seems viable (please comment on this), what should 
we do with the patches in NUTCH-92 ?

1. skip them for now, and wait until the above approach is implemented, 
and pay the penalty of using skewed local IDFs.

2. apply them now, and pay the penalty of additional RPC call / search, 
and replace this mechanism with the one described above, whenever that 
becomes available.

-- 
Best regards,
Andrzej Bialecki <><
  ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: NUTCH-92 - DistributedSearch incorrectly scores results

2008-11-21 Thread Dennis Kubes

Sean Dean wrote:

Folks,
 
I was wondering if anyone could shed some light on the status of this 
issue heading into a potential 1.0 (or 0.x) release over the few months?


Looks like a patch has been sitting there for a long time.  Don't know 
if it is still applicable or not.  Checking on it.


 
I realize many upgrades have been made to Hadoop and Lucene, and in 
addition to that bug fixes in just about every element of the system but 
does this issue not prevent Nutch from being a true scalable system?


I don't think so.  Especially with some of the new link analysis and 
indexing stuff we have run production systems up to 100 million and 150 
nodes and not seem problems (I think).


Dennis

 
My current situation limits me from providing development work but i can 
(and will) be ready to test any solution submitted against the latest 
code-base. I believe getting the distributed search functionality 
working correctly should be a requirement for any 1.0 release candidate.
 
What does the rest of the community think?


Re: nutch 2.0

2008-06-19 Thread Dennis Kubes
Currently the only sources are on my local machine and svn.  Hope to 
have more and put it into a public svn soon.  If you want to take a look 
at the current, non-documented, code, please let me know off list and I 
will get you a copy.


Dennis

Marko Bauhardt wrote:

Hi all
i found the Nutch 2.0 Documentation in the wiki 
(http://wiki.apache.org/nutch/Nutch2Architecture) and i'm very  
interested to see the sources of this new nutch architecture.

but i did not found a link to the sources of nutch2.0.

exists sources of the nutch2.0 project or exists only documentation?
If sources exists where i can get the sources? is it possible? this 
would be very nice.



thanks
marko




Re: nutch-0.9 and hadoop-0.15.0

2008-06-09 Thread ogjunk-nutch
Use nutch-user mailing list, please.  I'll reply there.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


- Original Message 
> From: m.harig <[EMAIL PROTECTED]>
> To: nutch-dev@lucene.apache.org
> Sent: Monday, June 9, 2008 12:17:07 PM
> Subject: nutch-0.9 and hadoop-0.15.0
> 
> 
> hi all
> 
>i upgraded nutch-0.9 to hadoop-0.15.0 on windows. when i start the crawl
> am getting the following error
> anyone help me what error is this
> 
> 
> 2008-06-09 15:43:28,906 WARN  mapred.LocalJobRunner - job_local_1
> java.io.IOException: CreateProcess: df -k F:\tmp\hadoop-Admin\mapred\local
> error=2
> at java.lang.ProcessImpl.create(Native Method)
> at java.lang.ProcessImpl.(ProcessImpl.java:81)
> at java.lang.ProcessImpl.start(ProcessImpl.java:30)
> at java.lang.ProcessBuilder.start(ProcessBuilder.java:451)
> at java.lang.Runtime.exec(Runtime.java:591)
> at java.lang.Runtime.exec(Runtime.java:464)
> at org.apache.hadoop.fs.ShellCommand.runCommand(ShellCommand.java:48)
> at org.apache.hadoop.fs.ShellCommand.run(ShellCommand.java:42)
> at org.apache.hadoop.fs.DF.getAvailable(DF.java:72)
> at
> org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:264)
> at
> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
> at
> org.apache.hadoop.mapred.MapOutputFile.getSpillFileForWrite(MapOutputFile.java:88)
> at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpillToDisk(MapTask.java:382)
> at 
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:604)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:193)
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:132)
> -- 
> View this message in context: 
> http://www.nabble.com/nutch-0.9-and-hadoop-0.15.0-tp17729948p17729948.html
> Sent from the Nutch - Dev mailing list archive at Nabble.com.



Re: nutch file content limit

2008-06-08 Thread m.harig

then how do i index files those who are greater than 15MB. please let me know




ogjunk-nutch wrote:
> 
> Right, that is what I tried saying below.  I don't think you can index
> partially fetched doc/xls/rtf documents.
> 
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> 
> 
> - Original Message 
>> From: m.harig <[EMAIL PROTECTED]>
>> To: nutch-dev@lucene.apache.org
>> Sent: Friday, June 6, 2008 3:56:30 AM
>> Subject: Re: nutch file content limit
>> 
>> 
>> is there any way to index partial content of doc/xls/rtf . if its not
>> possible let me know.
>> 
>> 
>> ogjunk-nutch wrote:
>> > 
>> > I *think* you have to fetch the *full* content of MS Word docs (and
>> PDFs
>> > and RTFs and ...) if you want parsers that handle those documents to be
>> > able to parse them.  A partial MS Word/PDF/RTF/... document is
>> considered
>> > invalid/broken.  Try opening it with MS Word, for example -- it will
>> not
>> > work.
>> > 
>> > 
>> > Otis
>> > --
>> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>> > 
>> > 
>> > - Original Message 
>> >> From: m.harig 
>> >> To: nutch-dev@lucene.apache.org
>> >> Sent: Thursday, June 5, 2008 3:27:18 AM
>> >> Subject: Re: nutch file content limit
>> >> 
>> >> 
>> >> thanks
>> >> 
>> >> my situation is this.. i've 100 MS-WORD files . each has 15MB in
>> size...
>> >> 
>> >> if i set file.content.limit as 5MB. when nutch goes for fetching it
>> can't
>> >> parse the content. it says Can't handle as Microsoft document. and its
>> >> failed.. how do i index partial content of those documents. any1 help
>> me
>> >> out
>> >> of this
>> >> 
>> >> 
>> >> this is my error
>> >> 
>> >> Can't be handled as Microsoft document. java.io.IOException: Cannot
>> >> remove
>> >> block[ 20839 ]; out of range
>> >> -- 
>> >> View this message in context: 
>> >>
>> http://www.nabble.com/nutch-file-content-limit-tp17640376p17663787.html
>> >> Sent from the Nutch - Dev mailing list archive at Nabble.com.
>> > 
>> > 
>> > 
>> 
>> -- 
>> View this message in context: 
>> http://www.nabble.com/nutch-file-content-limit-tp17640376p17686729.html
>> Sent from the Nutch - Dev mailing list archive at Nabble.com.
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/nutch-file-content-limit-tp17640376p17727247.html
Sent from the Nutch - Dev mailing list archive at Nabble.com.



Re: nutch file content limit

2008-06-06 Thread ogjunk-nutch
Right, that is what I tried saying below.  I don't think you can index 
partially fetched doc/xls/rtf documents.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


- Original Message 
> From: m.harig <[EMAIL PROTECTED]>
> To: nutch-dev@lucene.apache.org
> Sent: Friday, June 6, 2008 3:56:30 AM
> Subject: Re: nutch file content limit
> 
> 
> is there any way to index partial content of doc/xls/rtf . if its not
> possible let me know.
> 
> 
> ogjunk-nutch wrote:
> > 
> > I *think* you have to fetch the *full* content of MS Word docs (and PDFs
> > and RTFs and ...) if you want parsers that handle those documents to be
> > able to parse them.  A partial MS Word/PDF/RTF/... document is considered
> > invalid/broken.  Try opening it with MS Word, for example -- it will not
> > work.
> > 
> > 
> > Otis
> > --
> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> > 
> > 
> > - Original Message ----
> >> From: m.harig 
> >> To: nutch-dev@lucene.apache.org
> >> Sent: Thursday, June 5, 2008 3:27:18 AM
> >> Subject: Re: nutch file content limit
> >> 
> >> 
> >> thanks
> >> 
> >> my situation is this.. i've 100 MS-WORD files . each has 15MB in size...
> >> 
> >> if i set file.content.limit as 5MB. when nutch goes for fetching it can't
> >> parse the content. it says Can't handle as Microsoft document. and its
> >> failed.. how do i index partial content of those documents. any1 help me
> >> out
> >> of this
> >> 
> >> 
> >> this is my error
> >> 
> >> Can't be handled as Microsoft document. java.io.IOException: Cannot
> >> remove
> >> block[ 20839 ]; out of range
> >> -- 
> >> View this message in context: 
> >> http://www.nabble.com/nutch-file-content-limit-tp17640376p17663787.html
> >> Sent from the Nutch - Dev mailing list archive at Nabble.com.
> > 
> > 
> > 
> 
> -- 
> View this message in context: 
> http://www.nabble.com/nutch-file-content-limit-tp17640376p17686729.html
> Sent from the Nutch - Dev mailing list archive at Nabble.com.



Re: nutch file content limit

2008-06-06 Thread m.harig

is there any way to index partial content of doc/xls/rtf . if its not
possible let me know.


ogjunk-nutch wrote:
> 
> I *think* you have to fetch the *full* content of MS Word docs (and PDFs
> and RTFs and ...) if you want parsers that handle those documents to be
> able to parse them.  A partial MS Word/PDF/RTF/... document is considered
> invalid/broken.  Try opening it with MS Word, for example -- it will not
> work.
> 
> 
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> 
> 
> - Original Message 
>> From: m.harig <[EMAIL PROTECTED]>
>> To: nutch-dev@lucene.apache.org
>> Sent: Thursday, June 5, 2008 3:27:18 AM
>> Subject: Re: nutch file content limit
>> 
>> 
>> thanks
>> 
>> my situation is this.. i've 100 MS-WORD files . each has 15MB in size...
>> 
>> if i set file.content.limit as 5MB. when nutch goes for fetching it can't
>> parse the content. it says Can't handle as Microsoft document. and its
>> failed.. how do i index partial content of those documents. any1 help me
>> out
>> of this
>> 
>> 
>> this is my error
>> 
>> Can't be handled as Microsoft document. java.io.IOException: Cannot
>> remove
>> block[ 20839 ]; out of range
>> -- 
>> View this message in context: 
>> http://www.nabble.com/nutch-file-content-limit-tp17640376p17663787.html
>> Sent from the Nutch - Dev mailing list archive at Nabble.com.
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/nutch-file-content-limit-tp17640376p17686729.html
Sent from the Nutch - Dev mailing list archive at Nabble.com.



Re: nutch file content limit

2008-06-05 Thread ogjunk-nutch
I *think* you have to fetch the *full* content of MS Word docs (and PDFs and 
RTFs and ...) if you want parsers that handle those documents to be able to 
parse them.  A partial MS Word/PDF/RTF/... document is considered 
invalid/broken.  Try opening it with MS Word, for example -- it will not work.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


- Original Message 
> From: m.harig <[EMAIL PROTECTED]>
> To: nutch-dev@lucene.apache.org
> Sent: Thursday, June 5, 2008 3:27:18 AM
> Subject: Re: nutch file content limit
> 
> 
> thanks
> 
> my situation is this.. i've 100 MS-WORD files . each has 15MB in size...
> 
> if i set file.content.limit as 5MB. when nutch goes for fetching it can't
> parse the content. it says Can't handle as Microsoft document. and its
> failed.. how do i index partial content of those documents. any1 help me out
> of this
> 
> 
> this is my error
> 
> Can't be handled as Microsoft document. java.io.IOException: Cannot remove
> block[ 20839 ]; out of range
> -- 
> View this message in context: 
> http://www.nabble.com/nutch-file-content-limit-tp17640376p17663787.html
> Sent from the Nutch - Dev mailing list archive at Nabble.com.



Re: nutch file content limit

2008-06-05 Thread m.harig

thanks

my situation is this.. i've 100 MS-WORD files . each has 15MB in size...

if i set file.content.limit as 5MB. when nutch goes for fetching it can't
parse the content. it says Can't handle as Microsoft document. and its
failed.. how do i index partial content of those documents. any1 help me out
of this


this is my error

Can't be handled as Microsoft document. java.io.IOException: Cannot remove
block[ 20839 ]; out of range
-- 
View this message in context: 
http://www.nabble.com/nutch-file-content-limit-tp17640376p17663787.html
Sent from the Nutch - Dev mailing list archive at Nabble.com.



Re: nutch file content limit

2008-06-04 Thread ogjunk-nutch
How's this:

$ grep -n content nutch/conf/nutch-default.xml 
28:  file.content.limit
30:  The length limit for downloaded content, in bytes.
31:  If this value is nonnegative (>=0), content longer than it will be 
truncated;
37:  file.content.ignored
39:  If true, no file content will be saved during fetch.
42:  and indexing stages. Otherwise file contents will be saved.
146:  http.content.limit
148:  The length limit for downloaded content, in bytes.
149:  If this value is nonnegative (>=0), content longer than it will be 
truncated;
246:  ftp.content.limit
248:  The length limit for downloaded content, in bytes.
249:  If this value is nonnegative (>=0), content longer than it will be 
truncated;
595:  If true, fetcher will parse content.
599:  fetcher.store.content
601:  If true, fetcher will store content.
864:  Defines if the mime content type detector uses magic 
resolution.
913:  content-types and parsers.
933:  content
935:  that it should not be shown as cached content, apply this policy. 
Currently
937:  "content" doesn't show the content, but shows summaries (snippets).
938:  "all" doesn't show either content or summaries.
1179:  the language (0 means full content analysis).


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


- Original Message 
> From: m.harig <[EMAIL PROTECTED]>
> To: nutch-dev@lucene.apache.org
> Sent: Wednesday, June 4, 2008 2:53:36 AM
> Subject: nutch file content limit
> 
> 
> hi all.
> 
>  In Nutch-source where is the place that nutch setting the file content
> limit which is getting from nutch-site.xml . could any1 help me reg
> this.
> -- 
> View this message in context: 
> http://www.nabble.com/nutch-file-content-limit-tp17640376p17640376.html
> Sent from the Nutch - Dev mailing list archive at Nabble.com.



RE: Nutch Crawling - Failed for internet crawling

2008-05-25 Thread Sivakumar Sivagnanam NCS
Hi,

Please find the files attached as requested. thanks for the reply.

 


 
 
 
 
Thanks & Regards
Siva
65567233

-Original Message-
From: All day coders [mailto:[EMAIL PROTECTED] 
Sent: Saturday, May 24, 2008 11:13 PM
To: nutch-dev@lucene.apache.org
Subject: Re: Nutch Crawling - Failed for internet crawling

Do you mind attaching the configuration files? That way is more human
readable. The hadoop.log file will be useful too (if too big, please
compress)

On Wed, May 21, 2008 at 1:27 AM, Sivakumar_NCS <[EMAIL PROTECTED]>
wrote:

>
> Hi,
>
> I am a new bie to crawling and exploring the possiblities of crawling
the
> internet websites from my work PC.My work environment is having a
proxy to
> access the web.
> So I have configure the proxy information under the /conf/
by
> overriding the nutch-site.xml.Attached is the xml for reference.
>
> 
> 
>
> 
>
> 
> 
>  http.agent.name
>  ABC
>  ABC
> 
> 
>  http.agent.description
>  Acompany
>  A company
> 
> 
>  http.agent.url
>  
>  
> 
> 
> http.agent.email
>  
>  
> 
> 
>  http.timeout
>  1
>  The default network timeout, in
milliseconds.
> 
> 
>  http.max.delays
>  100
>  The number of times a thread will delay when trying to
>  fetch a page.  Each time it finds that a host is busy, it will wait
>  fetcher.server.delay.  After http.max.delays attepts, it will give
>  up on the page for now.
> 
> 
>  plugin.includes
>
>
>
protocol-httpclient|protocol-http|urlfilter-regex|parse-(text|htm
l|js)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urln
ormalizer-(pass|regex|basic)
>  Regular expression naming plugin directory names to
>  include.  Any plugin not matching this expression is excluded.
>  In any case you need at least include the nutch-extensionpoints
plugin. By
>  default Nutch includes crawling just HTML and plain text via HTTP,
>  and basic indexing and search plugins. In order to use HTTPS please
enable
>  protocol-httpclient, but be aware of possible intermittent problems
with
> the
>  underlying commons-httpclient library.
>  
> 
> 
>  http.proxy.host
>  proxy.ABC.COM
>  The proxy hostname.  If empty, no proxy is
> used.
> 
> 
>  http.proxy.port
>  8080
>  The proxy port.
> 
> 
>  http.proxy.username
>  ABCUSER
>  Username for proxy. This will be used by
>  'protocol-httpclient', if the proxy server requests basic, digest
>  and/or NTLM authentication. To use this, 'protocol-httpclient' must
>  be present in the value of 'plugin.includes' property.
>  NOTE: For NTLM authentication, do not prefix the username with the
>  domain, i.e. 'susam' is correct whereas 'DOMAIN\susam' is incorrect.
>  
> 
> 
>  http.proxy.password
>  X
>  Password for proxy. This will be used by
>  'protocol-httpclient', if the proxy server requests basic, digest
>  and/or NTLM authentication. To use this, 'protocol-httpclient' must
>  be present in the value of 'plugin.includes' property.
>  
> 
> 
>  http.proxy.realm
>  ABC
>  Authentication realm for proxy. Do not define a value
>  if realm is not required or authentication should take place for any
>  realm. NTLM does not use the notion of realms. Specify the domain
name
>  of NTLM authentication as the value for this property. To use this,
>  'protocol-httpclient' must be present in the value of
>  'plugin.includes' property.
>  
> 
> 
>  http.agent.host
>  xxx.xxx.xxx.xx
>  Name or IP address of the host on which the Nutch
crawler
>  would be running. Currently this is used by 'protocol-httpclient'
>  plugin.
>  
> 
> 
>
> my crawl-urlfilter.txt is as follows:
> # The url filter file used by the crawl command.
>
> # Better for intranet crawling.
> # Be sure to change MY.DOMAIN.NAME to your domain name.
>
> # Each non-comment, non-blank line contains a regular expression
> # prefixed by '+' or '-'.  The first matching pattern in the file
> # determines whether a URL is included or ignored.  If no pattern
> # matches, the URL is ignored.
>
> # skip file:, ftp:, & mailto: urls
> -^(file|ftp|mailto):
>
> # skip image and other suffixes we can't yet parse
>
>
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|r
pm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
>
> # skip URLs containing certain characters as probable queries, etc.
> [EMAIL PROTECTED]
>
> # skip URLs with slash-delimited segment that repeats 3+ times, to
break
> loops
> -.*(/.+?)/.*?\1/.*?\1/
>
> # accept hosts in MY.DOMAIN.NAME
> #+^http://([a-z0-9]

Re: Nutch Crawling - Failed for internet crawling

2008-05-24 Thread All day coders
Do you mind attaching the configuration files? That way is more human
readable. The hadoop.log file will be useful too (if too big, please
compress)

On Wed, May 21, 2008 at 1:27 AM, Sivakumar_NCS <[EMAIL PROTECTED]> wrote:

>
> Hi,
>
> I am a new bie to crawling and exploring the possiblities of crawling the
> internet websites from my work PC.My work environment is having a proxy to
> access the web.
> So I have configure the proxy information under the /conf/ by
> overriding the nutch-site.xml.Attached is the xml for reference.
>
> 
> 
>
> 
>
> 
> 
>  http.agent.name
>  ABC
>  ABC
> 
> 
>  http.agent.description
>  Acompany
>  A company
> 
> 
>  http.agent.url
>  
>  
> 
> 
> http.agent.email
>  
>  
> 
> 
>  http.timeout
>  1
>  The default network timeout, in milliseconds.
> 
> 
>  http.max.delays
>  100
>  The number of times a thread will delay when trying to
>  fetch a page.  Each time it finds that a host is busy, it will wait
>  fetcher.server.delay.  After http.max.delays attepts, it will give
>  up on the page for now.
> 
> 
>  plugin.includes
>
>
> protocol-httpclient|protocol-http|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)
>  Regular expression naming plugin directory names to
>  include.  Any plugin not matching this expression is excluded.
>  In any case you need at least include the nutch-extensionpoints plugin. By
>  default Nutch includes crawling just HTML and plain text via HTTP,
>  and basic indexing and search plugins. In order to use HTTPS please enable
>  protocol-httpclient, but be aware of possible intermittent problems with
> the
>  underlying commons-httpclient library.
>  
> 
> 
>  http.proxy.host
>  proxy.ABC.COM
>  The proxy hostname.  If empty, no proxy is
> used.
> 
> 
>  http.proxy.port
>  8080
>  The proxy port.
> 
> 
>  http.proxy.username
>  ABCUSER
>  Username for proxy. This will be used by
>  'protocol-httpclient', if the proxy server requests basic, digest
>  and/or NTLM authentication. To use this, 'protocol-httpclient' must
>  be present in the value of 'plugin.includes' property.
>  NOTE: For NTLM authentication, do not prefix the username with the
>  domain, i.e. 'susam' is correct whereas 'DOMAIN\susam' is incorrect.
>  
> 
> 
>  http.proxy.password
>  X
>  Password for proxy. This will be used by
>  'protocol-httpclient', if the proxy server requests basic, digest
>  and/or NTLM authentication. To use this, 'protocol-httpclient' must
>  be present in the value of 'plugin.includes' property.
>  
> 
> 
>  http.proxy.realm
>  ABC
>  Authentication realm for proxy. Do not define a value
>  if realm is not required or authentication should take place for any
>  realm. NTLM does not use the notion of realms. Specify the domain name
>  of NTLM authentication as the value for this property. To use this,
>  'protocol-httpclient' must be present in the value of
>  'plugin.includes' property.
>  
> 
> 
>  http.agent.host
>  xxx.xxx.xxx.xx
>  Name or IP address of the host on which the Nutch crawler
>  would be running. Currently this is used by 'protocol-httpclient'
>  plugin.
>  
> 
> 
>
> my crawl-urlfilter.txt is as follows:
> # The url filter file used by the crawl command.
>
> # Better for intranet crawling.
> # Be sure to change MY.DOMAIN.NAME to your domain name.
>
> # Each non-comment, non-blank line contains a regular expression
> # prefixed by '+' or '-'.  The first matching pattern in the file
> # determines whether a URL is included or ignored.  If no pattern
> # matches, the URL is ignored.
>
> # skip file:, ftp:, & mailto: urls
> -^(file|ftp|mailto):
>
> # skip image and other suffixes we can't yet parse
>
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
>
> # skip URLs containing certain characters as probable queries, etc.
> [EMAIL PROTECTED]
>
> # skip URLs with slash-delimited segment that repeats 3+ times, to break
> loops
> -.*(/.+?)/.*?\1/.*?\1/
>
> # accept hosts in MY.DOMAIN.NAME
> #+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
> +^http://([a-z0-9]*\.)*yahoo.com/
>
> # skip everything else
> -.
>
>
> my regex-urlfilter.txt is as follows:
> # The url filter file used by the crawl command.
>
> # Better for intranet crawling.
> # Be sure to change MY.DOMAIN.NAME to your domain name.
>
> # Each non-comment, non-blank line contains a regular expression
> # prefixed by '+' or '-'.  The first matching pattern in the file
> # determines whether a URL is included or ignored.  If no pattern
> # matches, the URL is ignored.
>
> # skip file:, ftp:, & mailto: urls
> -^(file|ftp|mailto):
>
> # skip image and other suffixes we can't yet parse
>
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
>
> # skip URLs containing certain characters as probable queries, etc.
> [EMAIL PROTECTED]
>
> # skip URLs with slash-delimited segment that repeats 3+ times

Re: Nutch 2 Architecture

2008-04-28 Thread Dennis Kubes
Hard to say.  It is in the initial dev stages right now.  Currently I 
have a listing of tools and stages for nutch 2 and I am working on 
backporting some of those tools into nutch 1 as there is an immediate 
need.  A good guess would be a stable version in 1-2 months.


Dennis

[EMAIL PROTECTED] wrote:

Hi

When will N2A (Nutch 2 Architecture) coming out?

Thanks
Paul



Re: nutch latest build - inject operation failing

2008-02-27 Thread esmithers

Any resolution to this problem? I just tried installing on Windows and I'm
getting the same problem.


Susam Pal wrote:
> 
> I tried setting hadoop.tmp.dir to /cygdrive/d/tmp and it created
> D:\cygdrive\d\tmp\mapred\temp\inject-temp-1365510909\_reduce_n7v9vq.
> 
> The same error occurred:-
> 
> 2008-02-15 10:19:22,833 WARN  mapred.LocalJobRunner - job_local_1
> java.io.IOException: Target
> file:/D:/cygdrive/d/tmp/mapred/temp/inject-temp-1365
> 510909/_reduce_n7v9vq/part-0 already exists
>at org.apache.hadoop.fs.FileUtil.checkDest(FileUtil.java:246)
>at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:125)
>at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:116)
>at
> org.apache.hadoop.fs.RawLocalFileSystem.rename(RawLocalFileSystem.java:180)
>at
> org.apache.hadoop.fs.ChecksumFileSystem.rename(ChecksumFileSystem.java:394)
>at org.apache.hadoop.mapred.Task.moveTaskOutputs(Task.java:452)
>at org.apache.hadoop.mapred.Task.moveTaskOutputs(Task.java:469)
>at org.apache.hadoop.mapred.Task.saveTaskOutput(Task.java:426)
>at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:165)
> 
> Regards,
> Susam Pal
> 
> On Thu, Feb 14, 2008 at 10:07 PM, Susam Pal <[EMAIL PROTECTED]> wrote:
>> What I did try was setting hadoop.tmp.dir to /opt/tmp. I found the
>>  behavior strange. I had an /opt/tmp directory in my Cygwin
>>  installation (Absolute Windows path: D:\Cygwin\opt\tmp) and I was
>>  expecting Hadoop to use it. However, it created a new D:\opt\tmp and
>>  wrote the temp files there. Of course this failed with the same error.
>>
>>  Right now I don't have a Windows system with me. I will try setting it
>>  as /cygdrive/d/tmp/ tomorrow when I again have access to a Windows
>>  system and then I'll update the mailing list with the observations.
>>  Thanks for the suggestion.
>>
>>  Regards,
>>  Susam Pal
>>
>>
>>
>>  On Thu, Feb 14, 2008 at 9:41 PM, Dennis Kubes <[EMAIL PROTECTED]> wrote:
>>  > I think what might be occurring is a file path issue with hadoop.  I
>>  >  have seen it in the past.  Can you try on windows using the cygdrive
>>  >  path and see if that works?  For below it would be /cygdrive/D/tmp/
>> ...
>>  >
>>  >  Dennis
>>  >
>>  >
>>  >
>>  >  Susam Pal wrote:
>>  >  > I can confirm this error as I just tried running the last revision
>> of
>>  >  > Nutch, rev-620818 on Debian as well as Cygwin on Windows.
>>  >  >
>>  >  > It works fine on Debian but fails on Cygwin with this error:-
>>  >  >
>>  >  > 2008-02-14 19:49:47,756 WARN  regex.RegexURLNormalizer - can\'t
>> find
>>  >  > rules for scope \'inject\', using default
>>  >  > 2008-02-14 19:49:48,381 WARN  mapred.LocalJobRunner - job_local_1
>>  >  > java.io.IOException: Target
>>  >  >
>> file:/D:/tmp/hadoop-guest/mapred/temp/inject-temp-322737506/_reduce_bjm6rw/part-0
>>  >  > already exists
>>  >  >   at org.apache.hadoop.fs.FileUtil.checkDest(FileUtil.java:246)
>>  >  >   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:125)
>>  >  >   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:116)
>>  >  >   at
>> org.apache.hadoop.fs.RawLocalFileSystem.rename(RawLocalFileSystem.java:196)
>>  >  >   at
>> org.apache.hadoop.fs.ChecksumFileSystem.rename(ChecksumFileSystem.java:394)
>>  >  >   at
>> org.apache.hadoop.mapred.Task.moveTaskOutputs(Task.java:452)
>>  >  >   at
>> org.apache.hadoop.mapred.Task.moveTaskOutputs(Task.java:469)
>>  >  >   at
>> org.apache.hadoop.mapred.Task.saveTaskOutput(Task.java:426)
>>  >  >   at
>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:165)
>>  >  > 2008-02-14 19:49:49,225 FATAL crawl.Injector - Injector:
>>  >  > java.io.IOException: Job failed!
>>  >  >   at
>> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:831)
>>  >  >   at org.apache.nutch.crawl.Injector.inject(Injector.java:162)
>>  >  >   at org.apache.nutch.crawl.Injector.run(Injector.java:192)
>>  >  >   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>  >  >   at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:54)
>>  >  >   at org.apache.nutch.crawl.Injector.main(Injector.java:182)
>>  >  >
>>  >  > Indeed the \'inject-temp-322737506\' is present in the specified
>>  >  > folder of D drive and doesn\'t get deleted.
>>  >  >
>>  >  > Is this because multiple map/reduce is running and one of them is
>>  >  > finding the directory to be present and therefore fails?
>>  >  >
>>  >  > So, I also tried setting this in \'conf/hadoop-site.xml\':-
>>  >  >
>>  >  > 
>>  >  > mapred.speculative.execution
>>  >  > false
>>  >  > 
>>  >  > 
>>  >  >
>>  >  > I wonder why the same issue doesn\'t occur in Linux. I am not well
>>  >  > acquainted with the Hadoop code yet. Could someone throw light on
>> what
>>  >  > might be going wrong?
>>  >  >
>>  >  > Regards,
>>  >  > Susam Pal
>>  >  >
>>  >  > On 2/7/08, DS jha <[EMAIL PROTECTED]> wrote:
>>  >  > 

Re: nutch latest build - inject operation failing

2008-02-15 Thread Susam Pal
I tried setting hadoop.tmp.dir to /cygdrive/d/tmp and it created
D:\cygdrive\d\tmp\mapred\temp\inject-temp-1365510909\_reduce_n7v9vq.

The same error occurred:-

2008-02-15 10:19:22,833 WARN  mapred.LocalJobRunner - job_local_1
java.io.IOException: Target file:/D:/cygdrive/d/tmp/mapred/temp/inject-temp-1365
510909/_reduce_n7v9vq/part-0 already exists
   at org.apache.hadoop.fs.FileUtil.checkDest(FileUtil.java:246)
   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:125)
   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:116)
   at 
org.apache.hadoop.fs.RawLocalFileSystem.rename(RawLocalFileSystem.java:180)
   at 
org.apache.hadoop.fs.ChecksumFileSystem.rename(ChecksumFileSystem.java:394)
   at org.apache.hadoop.mapred.Task.moveTaskOutputs(Task.java:452)
   at org.apache.hadoop.mapred.Task.moveTaskOutputs(Task.java:469)
   at org.apache.hadoop.mapred.Task.saveTaskOutput(Task.java:426)
   at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:165)

Regards,
Susam Pal

On Thu, Feb 14, 2008 at 10:07 PM, Susam Pal <[EMAIL PROTECTED]> wrote:
> What I did try was setting hadoop.tmp.dir to /opt/tmp. I found the
>  behavior strange. I had an /opt/tmp directory in my Cygwin
>  installation (Absolute Windows path: D:\Cygwin\opt\tmp) and I was
>  expecting Hadoop to use it. However, it created a new D:\opt\tmp and
>  wrote the temp files there. Of course this failed with the same error.
>
>  Right now I don't have a Windows system with me. I will try setting it
>  as /cygdrive/d/tmp/ tomorrow when I again have access to a Windows
>  system and then I'll update the mailing list with the observations.
>  Thanks for the suggestion.
>
>  Regards,
>  Susam Pal
>
>
>
>  On Thu, Feb 14, 2008 at 9:41 PM, Dennis Kubes <[EMAIL PROTECTED]> wrote:
>  > I think what might be occurring is a file path issue with hadoop.  I
>  >  have seen it in the past.  Can you try on windows using the cygdrive
>  >  path and see if that works?  For below it would be /cygdrive/D/tmp/ ...
>  >
>  >  Dennis
>  >
>  >
>  >
>  >  Susam Pal wrote:
>  >  > I can confirm this error as I just tried running the last revision of
>  >  > Nutch, rev-620818 on Debian as well as Cygwin on Windows.
>  >  >
>  >  > It works fine on Debian but fails on Cygwin with this error:-
>  >  >
>  >  > 2008-02-14 19:49:47,756 WARN  regex.RegexURLNormalizer - can\'t find
>  >  > rules for scope \'inject\', using default
>  >  > 2008-02-14 19:49:48,381 WARN  mapred.LocalJobRunner - job_local_1
>  >  > java.io.IOException: Target
>  >  > 
> file:/D:/tmp/hadoop-guest/mapred/temp/inject-temp-322737506/_reduce_bjm6rw/part-0
>  >  > already exists
>  >  >   at org.apache.hadoop.fs.FileUtil.checkDest(FileUtil.java:246)
>  >  >   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:125)
>  >  >   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:116)
>  >  >   at 
> org.apache.hadoop.fs.RawLocalFileSystem.rename(RawLocalFileSystem.java:196)
>  >  >   at 
> org.apache.hadoop.fs.ChecksumFileSystem.rename(ChecksumFileSystem.java:394)
>  >  >   at org.apache.hadoop.mapred.Task.moveTaskOutputs(Task.java:452)
>  >  >   at org.apache.hadoop.mapred.Task.moveTaskOutputs(Task.java:469)
>  >  >   at org.apache.hadoop.mapred.Task.saveTaskOutput(Task.java:426)
>  >  >   at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:165)
>  >  > 2008-02-14 19:49:49,225 FATAL crawl.Injector - Injector:
>  >  > java.io.IOException: Job failed!
>  >  >   at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:831)
>  >  >   at org.apache.nutch.crawl.Injector.inject(Injector.java:162)
>  >  >   at org.apache.nutch.crawl.Injector.run(Injector.java:192)
>  >  >   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>  >  >   at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:54)
>  >  >   at org.apache.nutch.crawl.Injector.main(Injector.java:182)
>  >  >
>  >  > Indeed the \'inject-temp-322737506\' is present in the specified
>  >  > folder of D drive and doesn\'t get deleted.
>  >  >
>  >  > Is this because multiple map/reduce is running and one of them is
>  >  > finding the directory to be present and therefore fails?
>  >  >
>  >  > So, I also tried setting this in \'conf/hadoop-site.xml\':-
>  >  >
>  >  > 
>  >  > mapred.speculative.execution
>  >  > false
>  >  > 
>  >  > 
>  >  >
>  >  > I wonder why the same issue doesn\'t occur in Linux. I am not well
>  >  > acquainted with the Hadoop code yet. Could someone throw light on what
>  >  > might be going wrong?
>  >  >
>  >  > Regards,
>  >  > Susam Pal
>  >  >
>  >  > On 2/7/08, DS jha <[EMAIL PROTECTED]> wrote:
>  >  > Hi -
>  >  >> Looks like latest trunk version of nutch is failing with the following
>  >  >> exception when trying to perform inject operation:
>  >  >>
>  >  >> java.io.IOException: Target
>  >  >> 
> file:/tmp/hadoop-user/mapred/temp/inject-temp-1280136828/_reduce_dv90x0/

Re: nutch latest build - inject operation failing

2008-02-14 Thread Susam Pal
What I did try was setting hadoop.tmp.dir to /opt/tmp. I found the
behavior strange. I had an /opt/tmp directory in my Cygwin
installation (Absolute Windows path: D:\Cygwin\opt\tmp) and I was
expecting Hadoop to use it. However, it created a new D:\opt\tmp and
wrote the temp files there. Of course this failed with the same error.

Right now I don't have a Windows system with me. I will try setting it
as /cygdrive/d/tmp/ tomorrow when I again have access to a Windows
system and then I'll update the mailing list with the observations.
Thanks for the suggestion.

Regards,
Susam Pal

On Thu, Feb 14, 2008 at 9:41 PM, Dennis Kubes <[EMAIL PROTECTED]> wrote:
> I think what might be occurring is a file path issue with hadoop.  I
>  have seen it in the past.  Can you try on windows using the cygdrive
>  path and see if that works?  For below it would be /cygdrive/D/tmp/ ...
>
>  Dennis
>
>
>
>  Susam Pal wrote:
>  > I can confirm this error as I just tried running the last revision of
>  > Nutch, rev-620818 on Debian as well as Cygwin on Windows.
>  >
>  > It works fine on Debian but fails on Cygwin with this error:-
>  >
>  > 2008-02-14 19:49:47,756 WARN  regex.RegexURLNormalizer - can\'t find
>  > rules for scope \'inject\', using default
>  > 2008-02-14 19:49:48,381 WARN  mapred.LocalJobRunner - job_local_1
>  > java.io.IOException: Target
>  > 
> file:/D:/tmp/hadoop-guest/mapred/temp/inject-temp-322737506/_reduce_bjm6rw/part-0
>  > already exists
>  >   at org.apache.hadoop.fs.FileUtil.checkDest(FileUtil.java:246)
>  >   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:125)
>  >   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:116)
>  >   at 
> org.apache.hadoop.fs.RawLocalFileSystem.rename(RawLocalFileSystem.java:196)
>  >   at 
> org.apache.hadoop.fs.ChecksumFileSystem.rename(ChecksumFileSystem.java:394)
>  >   at org.apache.hadoop.mapred.Task.moveTaskOutputs(Task.java:452)
>  >   at org.apache.hadoop.mapred.Task.moveTaskOutputs(Task.java:469)
>  >   at org.apache.hadoop.mapred.Task.saveTaskOutput(Task.java:426)
>  >   at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:165)
>  > 2008-02-14 19:49:49,225 FATAL crawl.Injector - Injector:
>  > java.io.IOException: Job failed!
>  >   at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:831)
>  >   at org.apache.nutch.crawl.Injector.inject(Injector.java:162)
>  >   at org.apache.nutch.crawl.Injector.run(Injector.java:192)
>  >   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>  >   at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:54)
>  >   at org.apache.nutch.crawl.Injector.main(Injector.java:182)
>  >
>  > Indeed the \'inject-temp-322737506\' is present in the specified
>  > folder of D drive and doesn\'t get deleted.
>  >
>  > Is this because multiple map/reduce is running and one of them is
>  > finding the directory to be present and therefore fails?
>  >
>  > So, I also tried setting this in \'conf/hadoop-site.xml\':-
>  >
>  > 
>  > mapred.speculative.execution
>  > false
>  > 
>  > 
>  >
>  > I wonder why the same issue doesn\'t occur in Linux. I am not well
>  > acquainted with the Hadoop code yet. Could someone throw light on what
>  > might be going wrong?
>  >
>  > Regards,
>  > Susam Pal
>  >
>  > On 2/7/08, DS jha <[EMAIL PROTECTED]> wrote:
>  > Hi -
>  >> Looks like latest trunk version of nutch is failing with the following
>  >> exception when trying to perform inject operation:
>  >>
>  >> java.io.IOException: Target
>  >> 
> file:/tmp/hadoop-user/mapred/temp/inject-temp-1280136828/_reduce_dv90x0/part-0
>  >> already exists
>  >> at org.apache.hadoop.fs.FileUtil.checkDest(FileUtil.java:246)
>  >> at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:125)
>  >> at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:116)
>  >> at 
> org.apache.hadoop.fs.RawLocalFileSystem.rename(RawLocalFileSystem.java:196)
>  >> at 
> org.apache.hadoop.fs.ChecksumFileSystem.rename(ChecksumFileSystem.java:394)
>  >> at org.apache.hadoop.mapred.Task.moveTaskOutputs(Task.java:452)
>  >> at org.apache.hadoop.mapred.Task.moveTaskOutputs(Task.java:469)
>  >> at org.apache.hadoop.mapred.Task.saveTaskOutput(Task.java:426)
>  >> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:165)
>  >>
>  >> Any thoughts?
>  >>
>  >> Thanks
>  >> Jha
>  >>
>


Re: nutch latest build - inject operation failing

2008-02-14 Thread Dennis Kubes
I think what might be occurring is a file path issue with hadoop.  I 
have seen it in the past.  Can you try on windows using the cygdrive 
path and see if that works?  For below it would be /cygdrive/D/tmp/ ...


Dennis

Susam Pal wrote:

I can confirm this error as I just tried running the last revision of
Nutch, rev-620818 on Debian as well as Cygwin on Windows.

It works fine on Debian but fails on Cygwin with this error:-

2008-02-14 19:49:47,756 WARN  regex.RegexURLNormalizer - can\'t find
rules for scope \'inject\', using default
2008-02-14 19:49:48,381 WARN  mapred.LocalJobRunner - job_local_1
java.io.IOException: Target
file:/D:/tmp/hadoop-guest/mapred/temp/inject-temp-322737506/_reduce_bjm6rw/part-0
already exists
at org.apache.hadoop.fs.FileUtil.checkDest(FileUtil.java:246)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:125)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:116)
at 
org.apache.hadoop.fs.RawLocalFileSystem.rename(RawLocalFileSystem.java:196)
at 
org.apache.hadoop.fs.ChecksumFileSystem.rename(ChecksumFileSystem.java:394)
at org.apache.hadoop.mapred.Task.moveTaskOutputs(Task.java:452)
at org.apache.hadoop.mapred.Task.moveTaskOutputs(Task.java:469)
at org.apache.hadoop.mapred.Task.saveTaskOutput(Task.java:426)
at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:165)
2008-02-14 19:49:49,225 FATAL crawl.Injector - Injector:
java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:831)
at org.apache.nutch.crawl.Injector.inject(Injector.java:162)
at org.apache.nutch.crawl.Injector.run(Injector.java:192)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:54)
at org.apache.nutch.crawl.Injector.main(Injector.java:182)

Indeed the \'inject-temp-322737506\' is present in the specified
folder of D drive and doesn\'t get deleted.

Is this because multiple map/reduce is running and one of them is
finding the directory to be present and therefore fails?

So, I also tried setting this in \'conf/hadoop-site.xml\':-


mapred.speculative.execution
false



I wonder why the same issue doesn\'t occur in Linux. I am not well
acquainted with the Hadoop code yet. Could someone throw light on what
might be going wrong?

Regards,
Susam Pal

On 2/7/08, DS jha <[EMAIL PROTECTED]> wrote:
Hi -

Looks like latest trunk version of nutch is failing with the following
exception when trying to perform inject operation:

java.io.IOException: Target
file:/tmp/hadoop-user/mapred/temp/inject-temp-1280136828/_reduce_dv90x0/part-0
already exists
at org.apache.hadoop.fs.FileUtil.checkDest(FileUtil.java:246)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:125)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:116)
at 
org.apache.hadoop.fs.RawLocalFileSystem.rename(RawLocalFileSystem.java:196)
at 
org.apache.hadoop.fs.ChecksumFileSystem.rename(ChecksumFileSystem.java:394)
at org.apache.hadoop.mapred.Task.moveTaskOutputs(Task.java:452)
at org.apache.hadoop.mapred.Task.moveTaskOutputs(Task.java:469)
at org.apache.hadoop.mapred.Task.saveTaskOutput(Task.java:426)
at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:165)

Any thoughts?

Thanks
Jha



Re: nutch latest build - inject operation failing

2008-02-14 Thread Susam Pal
I can confirm this error as I just tried running the last revision of
Nutch, rev-620818 on Debian as well as Cygwin on Windows.

It works fine on Debian but fails on Cygwin with this error:-

2008-02-14 19:49:47,756 WARN  regex.RegexURLNormalizer - can\'t find
rules for scope \'inject\', using default
2008-02-14 19:49:48,381 WARN  mapred.LocalJobRunner - job_local_1
java.io.IOException: Target
file:/D:/tmp/hadoop-guest/mapred/temp/inject-temp-322737506/_reduce_bjm6rw/part-0
already exists
at org.apache.hadoop.fs.FileUtil.checkDest(FileUtil.java:246)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:125)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:116)
at 
org.apache.hadoop.fs.RawLocalFileSystem.rename(RawLocalFileSystem.java:196)
at 
org.apache.hadoop.fs.ChecksumFileSystem.rename(ChecksumFileSystem.java:394)
at org.apache.hadoop.mapred.Task.moveTaskOutputs(Task.java:452)
at org.apache.hadoop.mapred.Task.moveTaskOutputs(Task.java:469)
at org.apache.hadoop.mapred.Task.saveTaskOutput(Task.java:426)
at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:165)
2008-02-14 19:49:49,225 FATAL crawl.Injector - Injector:
java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:831)
at org.apache.nutch.crawl.Injector.inject(Injector.java:162)
at org.apache.nutch.crawl.Injector.run(Injector.java:192)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:54)
at org.apache.nutch.crawl.Injector.main(Injector.java:182)

Indeed the \'inject-temp-322737506\' is present in the specified
folder of D drive and doesn\'t get deleted.

Is this because multiple map/reduce is running and one of them is
finding the directory to be present and therefore fails?

So, I also tried setting this in \'conf/hadoop-site.xml\':-


mapred.speculative.execution
false



I wonder why the same issue doesn\'t occur in Linux. I am not well
acquainted with the Hadoop code yet. Could someone throw light on what
might be going wrong?

Regards,
Susam Pal

On 2/7/08, DS jha <[EMAIL PROTECTED]> wrote:
Hi -
>
> Looks like latest trunk version of nutch is failing with the following
> exception when trying to perform inject operation:
>
> java.io.IOException: Target
> file:/tmp/hadoop-user/mapred/temp/inject-temp-1280136828/_reduce_dv90x0/part-0
> already exists
> at org.apache.hadoop.fs.FileUtil.checkDest(FileUtil.java:246)
> at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:125)
> at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:116)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.rename(RawLocalFileSystem.java:196)
> at 
> org.apache.hadoop.fs.ChecksumFileSystem.rename(ChecksumFileSystem.java:394)
> at org.apache.hadoop.mapred.Task.moveTaskOutputs(Task.java:452)
> at org.apache.hadoop.mapred.Task.moveTaskOutputs(Task.java:469)
> at org.apache.hadoop.mapred.Task.saveTaskOutput(Task.java:426)
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:165)
>
> Any thoughts?
>
> Thanks
> Jha
>


Re: nutch latest build - inject operation failing

2008-02-07 Thread DS jha
Local filesystem. No changes to default hadoop-site.xml (it is empty).

Thanks





On Feb 7, 2008 10:54 AM, Dennis Kubes <[EMAIL PROTECTED]> wrote:
> Would need more info on your configuration.  Local or DFS,
> hadoop-site.xml changes.
>
> Dennis
>
>
> DS jha wrote:
> > I tried setting it to false but it was still throwing the same error.
> >
> > Looks like when I am using older version of hadoop (0.14.4) it is working 
> > fine.
> >
> > Thanks
> >
> >
> >
> > On Feb 7, 2008 10:37 AM, Dennis Kubes <[EMAIL PROTECTED]> wrote:
> >> Do you have speculative execution turned on.  If so turn it off.
> >>
> >> Dennis
> >>
> >>
> >> DS jha wrote:
> >>> This is running on Windows/Cygwin, with username 'user' - and it is
> >>> using default hadoop-site.xml
> >>>
> >>> Thanks,
> >>> Jha
> >>>
> >>> On Feb 7, 2008 10:03 AM, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
>  DS jha wrote:
> > Yeah - it is using hadoop v 0.15.3 jar file - strange!
> >
> >
> > Thanks
> > Jha
> >
> >
> > On Feb 7, 2008 8:11 AM, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
> >> DS jha wrote:
> >>> Hi -
> >>>
> >>> Looks like latest trunk version of nutch is failing with the following
> >>> exception when trying to perform inject operation:
> >>>
> >>> java.io.IOException: Target
> >>> file:/tmp/hadoop-user/mapred/temp/inject-temp-1280136828/_reduce_dv90x0/part-0
> >>> already exists
>  Hmm, wait - this path is strange in itself, because it starts with
>  /tmp/hadoop-user ... Are you running on *nix or Windows/Cygwin? Did you
>  change hadoop-site.xml to redefine hadoop.tmp.dir ? Or perhaps you are
>  running as a user with username "user" ?
> 
> 
>  --
> 
>  Best regards,
>  Andrzej Bialecki <><
>    ___. ___ ___ ___ _ _   __
>  [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>  ___|||__||  \|  ||  |  Embedded Unix, System Integration
>  http://www.sigram.com  Contact: info at sigram dot com
> 
> 
>


Re: nutch latest build - inject operation failing

2008-02-07 Thread Dennis Kubes
Would need more info on your configuration.  Local or DFS, 
hadoop-site.xml changes.


Dennis

DS jha wrote:

I tried setting it to false but it was still throwing the same error.

Looks like when I am using older version of hadoop (0.14.4) it is working fine.

Thanks



On Feb 7, 2008 10:37 AM, Dennis Kubes <[EMAIL PROTECTED]> wrote:

Do you have speculative execution turned on.  If so turn it off.

Dennis


DS jha wrote:

This is running on Windows/Cygwin, with username 'user' - and it is
using default hadoop-site.xml

Thanks,
Jha

On Feb 7, 2008 10:03 AM, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:

DS jha wrote:

Yeah - it is using hadoop v 0.15.3 jar file - strange!


Thanks
Jha


On Feb 7, 2008 8:11 AM, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:

DS jha wrote:

Hi -

Looks like latest trunk version of nutch is failing with the following
exception when trying to perform inject operation:

java.io.IOException: Target
file:/tmp/hadoop-user/mapred/temp/inject-temp-1280136828/_reduce_dv90x0/part-0
already exists

Hmm, wait - this path is strange in itself, because it starts with
/tmp/hadoop-user ... Are you running on *nix or Windows/Cygwin? Did you
change hadoop-site.xml to redefine hadoop.tmp.dir ? Or perhaps you are
running as a user with username "user" ?


--

Best regards,
Andrzej Bialecki <><
  ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Re: nutch latest build - inject operation failing

2008-02-07 Thread DS jha
I tried setting it to false but it was still throwing the same error.

Looks like when I am using older version of hadoop (0.14.4) it is working fine.

Thanks



On Feb 7, 2008 10:37 AM, Dennis Kubes <[EMAIL PROTECTED]> wrote:
> Do you have speculative execution turned on.  If so turn it off.
>
> Dennis
>
>
> DS jha wrote:
> > This is running on Windows/Cygwin, with username 'user' - and it is
> > using default hadoop-site.xml
> >
> > Thanks,
> > Jha
> >
> > On Feb 7, 2008 10:03 AM, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
> >> DS jha wrote:
> >>> Yeah - it is using hadoop v 0.15.3 jar file - strange!
> >>>
> >>>
> >>> Thanks
> >>> Jha
> >>>
> >>>
> >>> On Feb 7, 2008 8:11 AM, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
>  DS jha wrote:
> > Hi -
> >
> > Looks like latest trunk version of nutch is failing with the following
> > exception when trying to perform inject operation:
> >
> > java.io.IOException: Target
> > file:/tmp/hadoop-user/mapred/temp/inject-temp-1280136828/_reduce_dv90x0/part-0
> > already exists
> >> Hmm, wait - this path is strange in itself, because it starts with
> >> /tmp/hadoop-user ... Are you running on *nix or Windows/Cygwin? Did you
> >> change hadoop-site.xml to redefine hadoop.tmp.dir ? Or perhaps you are
> >> running as a user with username "user" ?
> >>
> >>
> >> --
> >>
> >> Best regards,
> >> Andrzej Bialecki <><
> >>   ___. ___ ___ ___ _ _   __
> >> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> >> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> >> http://www.sigram.com  Contact: info at sigram dot com
> >>
> >>
>


Re: nutch latest build - inject operation failing

2008-02-07 Thread Dennis Kubes

Do you have speculative execution turned on.  If so turn it off.

Dennis

DS jha wrote:

This is running on Windows/Cygwin, with username 'user' - and it is
using default hadoop-site.xml

Thanks,
Jha

On Feb 7, 2008 10:03 AM, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:

DS jha wrote:

Yeah - it is using hadoop v 0.15.3 jar file - strange!


Thanks
Jha


On Feb 7, 2008 8:11 AM, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:

DS jha wrote:

Hi -

Looks like latest trunk version of nutch is failing with the following
exception when trying to perform inject operation:

java.io.IOException: Target
file:/tmp/hadoop-user/mapred/temp/inject-temp-1280136828/_reduce_dv90x0/part-0
already exists

Hmm, wait - this path is strange in itself, because it starts with
/tmp/hadoop-user ... Are you running on *nix or Windows/Cygwin? Did you
change hadoop-site.xml to redefine hadoop.tmp.dir ? Or perhaps you are
running as a user with username "user" ?


--

Best regards,
Andrzej Bialecki <><
  ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Re: nutch latest build - inject operation failing

2008-02-07 Thread DS jha
This is running on Windows/Cygwin, with username 'user' - and it is
using default hadoop-site.xml

Thanks,
Jha

On Feb 7, 2008 10:03 AM, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
> DS jha wrote:
> > Yeah - it is using hadoop v 0.15.3 jar file - strange!
> >
> >
> > Thanks
> > Jha
> >
> >
> > On Feb 7, 2008 8:11 AM, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
> >> DS jha wrote:
> >>> Hi -
> >>>
> >>> Looks like latest trunk version of nutch is failing with the following
> >>> exception when trying to perform inject operation:
> >>>
> >>> java.io.IOException: Target
> >>> file:/tmp/hadoop-user/mapred/temp/inject-temp-1280136828/_reduce_dv90x0/part-0
> >>> already exists
>
> Hmm, wait - this path is strange in itself, because it starts with
> /tmp/hadoop-user ... Are you running on *nix or Windows/Cygwin? Did you
> change hadoop-site.xml to redefine hadoop.tmp.dir ? Or perhaps you are
> running as a user with username "user" ?
>
>
> --
>
> Best regards,
> Andrzej Bialecki <><
>   ___. ___ ___ ___ _ _   __
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>


Re: nutch latest build - inject operation failing

2008-02-07 Thread Andrzej Bialecki

DS jha wrote:

Yeah - it is using hadoop v 0.15.3 jar file - strange!


Thanks
Jha


On Feb 7, 2008 8:11 AM, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:

DS jha wrote:

Hi -

Looks like latest trunk version of nutch is failing with the following
exception when trying to perform inject operation:

java.io.IOException: Target
file:/tmp/hadoop-user/mapred/temp/inject-temp-1280136828/_reduce_dv90x0/part-0
already exists


Hmm, wait - this path is strange in itself, because it starts with 
/tmp/hadoop-user ... Are you running on *nix or Windows/Cygwin? Did you 
change hadoop-site.xml to redefine hadoop.tmp.dir ? Or perhaps you are 
running as a user with username "user" ?



--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: nutch latest build - inject operation failing

2008-02-07 Thread DS jha
Yeah - it is using hadoop v 0.15.3 jar file - strange!


Thanks
Jha


On Feb 7, 2008 8:11 AM, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
> DS jha wrote:
> > Hi -
> >
> > Looks like latest trunk version of nutch is failing with the following
> > exception when trying to perform inject operation:
> >
> > java.io.IOException: Target
> > file:/tmp/hadoop-user/mapred/temp/inject-temp-1280136828/_reduce_dv90x0/part-0
> > already exists
>
> Is this really the latest trunk? Can you check the version of
> lib/hadoop*.jar? It should be 0.15.3. And make sure you have no other
> older hadoop libraries on the classpath.
>
> --
> Best regards,
> Andrzej Bialecki <><
>   ___. ___ ___ ___ _ _   __
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>


Re: nutch latest build - inject operation failing

2008-02-07 Thread Andrzej Bialecki

DS jha wrote:

Hi -

Looks like latest trunk version of nutch is failing with the following
exception when trying to perform inject operation:

java.io.IOException: Target
file:/tmp/hadoop-user/mapred/temp/inject-temp-1280136828/_reduce_dv90x0/part-0
already exists


Is this really the latest trunk? Can you check the version of 
lib/hadoop*.jar? It should be 0.15.3. And make sure you have no other 
older hadoop libraries on the classpath.


--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: nutch and future

2008-01-10 Thread Dennis Kubes
Nutch development is still continuing.  Will be increasing in the 
future.  Upgrade to 15.2 will come shortly.  OpicScoring problems?


Dennis Kubes

tigger . wrote:

Hi All

As development stop with Nutch, look like not much is going on :(

Is Nutch getting HBase and Upgraded to Hadoop 0.15.2 and what about OpicScoring?

All the best
Tom
_
Discover new ways to stay in touch with Windows Live! Visit the City @ Live 
today!
http://getyourliveid.ca/?icid=LIVEIDENCA006


Re: Nutch/Lucene unique ID for every item crawled?

2007-10-21 Thread Sagar Naik

hey

CRAWL 1:
   url: http://foo.com
   doc id =X
CRAWL 2:
   url: http://foo.com
   doc id =Y
X may be equal to Y

And yes, segment id is different for different crawls. It is timestamp 
value and is the time when the

Generator is executed

May be if cud tell abt u r ultimate aim, we might be be able to help u 
appropriately





Sagar Vibhute wrote:

Hash value of the url does sound useful. Thanks! :-)

But well, is the segment ID different for every crawl? In which case the
segment ID + Doc Id can become a unique mapping. Trouble is, I don't know
how to extract the doc id of a particular document while it is being
crawled. I found a method which, given a doc Id gives the document, but
that's not what I need, I kinda need the opposite.

Any leads?

- Sagar


On 10/21/07, Sagar Naik <[EMAIL PROTECTED]> wrote:
  

Hey,
The lucene document id , an integer, may not be same for 2 different
crawls.
I am not sure if this is wht u r looking for but U can store a hash
value of the url crawled ;)

- Sagar

Sagar Vibhute wrote:


Hello,

Does nutch/lucene provide for a unique ID for every item that it has
crawled?

I checked the Lucene docid but from what I understood, the lucene docid
  

is


not unique for every item crawled. Is that so?

How can I get this unique ID, if it is available?

Thanks.

- Sagar


  

--
This message has been scanned for viruses and
dangerous content and is believed to be clean.





  



--
This message has been scanned for viruses and
dangerous content and is believed to be clean.



Re: Nutch/Lucene unique ID for every item crawled?

2007-10-21 Thread Sagar Vibhute
Hash value of the url does sound useful. Thanks! :-)

But well, is the segment ID different for every crawl? In which case the
segment ID + Doc Id can become a unique mapping. Trouble is, I don't know
how to extract the doc id of a particular document while it is being
crawled. I found a method which, given a doc Id gives the document, but
that's not what I need, I kinda need the opposite.

Any leads?

- Sagar


On 10/21/07, Sagar Naik <[EMAIL PROTECTED]> wrote:
>
> Hey,
> The lucene document id , an integer, may not be same for 2 different
> crawls.
> I am not sure if this is wht u r looking for but U can store a hash
> value of the url crawled ;)
>
> - Sagar
>
> Sagar Vibhute wrote:
> > Hello,
> >
> > Does nutch/lucene provide for a unique ID for every item that it has
> > crawled?
> >
> > I checked the Lucene docid but from what I understood, the lucene docid
> is
> > not unique for every item crawled. Is that so?
> >
> > How can I get this unique ID, if it is available?
> >
> > Thanks.
> >
> > - Sagar
> >
> >
>
>
> --
> This message has been scanned for viruses and
> dangerous content and is believed to be clean.
>
>


Re: Nutch/Lucene unique ID for every item crawled?

2007-10-20 Thread Sagar Naik

Hey,
The lucene document id , an integer, may not be same for 2 different 
crawls.
I am not sure if this is wht u r looking for but U can store a hash 
value of the url crawled ;)


- Sagar

Sagar Vibhute wrote:

Hello,

Does nutch/lucene provide for a unique ID for every item that it has
crawled?

I checked the Lucene docid but from what I understood, the lucene docid is
not unique for every item crawled. Is that so?

How can I get this unique ID, if it is available?

Thanks.

- Sagar

  



--
This message has been scanned for viruses and
dangerous content and is believed to be clean.



Re: nutch trunk filtering URLs in invertlinks even if -noFilter is on?

2007-09-23 Thread Brian Whitman


On Sep 23, 2007, at 11:38 AM, Brian Whitman wrote:


2) If so, why is there a -noFilter option for readlinkdb?



mistake, change this to


2) If so, why is there a -noFilter option for invertlinks?





Re: nutch trunk filtering URLs in invertlinks even if -noFilter is on?

2007-09-23 Thread Brian Whitman

(Copied from nutch-user, this is more a dev topic now)
It's not an issue with readseg or readlinkdb themselves, because a  
segment fetched in the older nutch (using the exact same  
configuration) expels png links in trunk's readlinkdb. It appears  
the fetcher now only parses URLs that pass the filters into the  
segment.



I checked the diffs from my old version (mid-December 06) and trunk  
ParseOutputFormat. It appears now that the parse puts the outlink  
URLs through the URLFilters. I confirmed this by taking out .png from  
my URLFilters and re-running a crawl -- pngs now appear in the  
readlinkdb.


1) Was it a bug that URLs that would not pass URLFilters got into the  
linkdb for analysis?


2) If so, why is there a -noFilter option for readlinkdb? The linkdb  
has already been filtered whether you like it or not. -noFilter will  
never have any effect.


There needs to be a way to have the linkdb reflect all URLs  
(unfiltered) for further analysis. I suggest a -noFilterOutlinks  
(default off) in the fetch command (as the default behavior of fetch  
is to parse.) This would simply not call the filter in  
ParseOutputFormat, if my theory is correct.







Re: NUTCH-251(Administration gui) and next version

2007-09-20 Thread karthik085

Too bad. It would be nice tool for novice users.


Andrzej Bialecki wrote:
> 
> karthik085 wrote:
>> Will nutch have NUTCH-251(Administration gui) for the next version? If
>> not,
>> is there a plan to release it for future releases?
>> 
> 
> If there is enough interest ... So far very few people expressed their 
> interest in having a GUI, and even less people (close to zero, to be 
> exact ;) )  provided patches to integrate it with the current code base.
> 
> 
> -- 
> Best regards,
> Andrzej Bialecki <><
>   ___. ___ ___ ___ _ _   __
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/NUTCH-251%28Administration-gui%29-and-next-version-tf4488805.html#a12804449
Sent from the Nutch - Dev mailing list archive at Nabble.com.



Re: NUTCH-251(Administration gui) and next version

2007-09-20 Thread Andrzej Bialecki

karthik085 wrote:

Will nutch have NUTCH-251(Administration gui) for the next version? If not,
is there a plan to release it for future releases?



If there is enough interest ... So far very few people expressed their 
interest in having a GUI, and even less people (close to zero, to be 
exact ;) )  provided patches to integrate it with the current code base.



--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Nutch nightly build and NUTCH-505 draft patch

2007-07-10 Thread Doğacan Güney

Hi,

On 7/2/07, Kai_testing Middleton <[EMAIL PROTECTED]> wrote:

Recently I successully applied applied NUTCH-505_draft_v2.patch as follows:

$ svn co http://svn.apache.org/repos/asf/lucene/nutch/trunk nutch
$ cd nutch
$ wget 
https://issues.apache.org/jira/secure/attachment/12360411/NUTCH-505_draft_v2.patch
 --no-check-certificate
$ sudo patch -p0 < NUTCH-505_draft_v2.patch
$ ant clean
$ ant

However, I also needed other recent nutch functionality, so I downloaded a 
nightly build:

$ wget 
http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/lastStableBuild/artifact/trunk/build/nutch-2007-06-27_06-52-44.tar.gz

I then attempted to apply the patch to that build using the successive steps.  I was able to run 
"ant clean" but "ant" failed with

build.xml:61: Specify at least one source--a file or resource collection

Do I need to get a source checkout of a nightly build?  How would I do that?



Once you checkout nutch trunk with "svn checkout", you can use "svn
up" to get the latest code changes. You can also use "svn st -u" which
compares your local version against trunk and shows you what changed.





Pinpoint customers who are looking for what you sell.
http://searchmarketing.yahoo.com/



--
Doğacan Güney


Re: NUTCH-119 :: how hard to fix

2007-06-27 Thread Doğacan Güney

On 6/27/07, Kai_testing Middleton <[EMAIL PROTECTED]> wrote:

wow, setting db.max.outlinks.per.page immediately fixed my problem.  It looks 
like I totally mis-diagnosed things.

May I pose two questions:
1) how did you view all the outlinks?


bin/nutch plugin parse-html org.apache.nutch.parse.html.HtmlParser 


2) how severe is NUTCH-119 - does it occur on a lot of sites?


AFAIK, HtmlParser doesn't extract urls with regexps. Nutch uses a
regexp to extract outlinks from files that have no markup information
(such as plain text). See OutlinkExtractor.java.





- Original Message 
From: Doğacan Güney <[EMAIL PROTECTED]>
To: nutch-dev@lucene.apache.org
Sent: Tuesday, June 26, 2007 10:56:32 PM
Subject: Re: NUTCH-119 :: how hard to fix

On 6/27/07, Kai_testing Middleton <[EMAIL PROTECTED]> wrote:
> I am evaluating nutch+lucene as a crawl and search solution.
>
> However, I am finding major bugs in nutch right off the bat.
>
> In particular, NUTCH-119: nutch is not crawling relative URLs.  I have some 
discussion of it here:
> http://www.mail-archive.com/[EMAIL PROTECTED]/msg08644.html
>
> Most of the links off www.variety.com, one of my main test sites, have 
relative URLs.  It seems incredible that nutch, which is capable of mapreduce, 
cannot fetch these URLs.
>
> It could be that I would fix this bug if, for other reasons, I decide to go 
with nutch+lucene.  Has anyone tried fixing this problem?  Is it intractable?  Or 
are the developers, who are just volunteers anyway, more interested in fixing 
other problems?
>
> Could someone outline the issue for me a bit more clearly so I would know how 
to evaluate it?

Both this one and the other site you were mentioning (sf911truth) have
more than 100 outlinks. Nutch, by default, only stores 100 outlinks
per page (db.max.outlinks.per.page). Link about.html happens to be
105th link or so, so nutch doesn't store it. All you have to do is
either increase db.max.outlinks.per.page or set it  to -1 (which
means, store all outlinks).

>
>
>
>
>   

> Park yourself in front of a world of choices in alternative vehicles. Visit 
the Yahoo! Auto Green Center.
> http://autos.yahoo.com/green_center/


--
Doğacan Güney









Be a better Heartthrob. Get better relationship answers from someone who knows. 
Yahoo! Answers - Check it out.
http://answers.yahoo.com/dir/?link=list&sid=396545433



--
Doğacan Güney


Re: NUTCH-119 :: how hard to fix

2007-06-27 Thread Kai_testing Middleton
wow, setting db.max.outlinks.per.page immediately fixed my problem.  It looks 
like I totally mis-diagnosed things.

May I pose two questions:
1) how did you view all the outlinks?
2) how severe is NUTCH-119 - does it occur on a lot of sites?


- Original Message 
From: Doğacan Güney <[EMAIL PROTECTED]>
To: nutch-dev@lucene.apache.org
Sent: Tuesday, June 26, 2007 10:56:32 PM
Subject: Re: NUTCH-119 :: how hard to fix

On 6/27/07, Kai_testing Middleton <[EMAIL PROTECTED]> wrote:
> I am evaluating nutch+lucene as a crawl and search solution.
>
> However, I am finding major bugs in nutch right off the bat.
>
> In particular, NUTCH-119: nutch is not crawling relative URLs.  I have some 
> discussion of it here:
> http://www.mail-archive.com/[EMAIL PROTECTED]/msg08644.html
>
> Most of the links off www.variety.com, one of my main test sites, have 
> relative URLs.  It seems incredible that nutch, which is capable of 
> mapreduce, cannot fetch these URLs.
>
> It could be that I would fix this bug if, for other reasons, I decide to go 
> with nutch+lucene.  Has anyone tried fixing this problem?  Is it intractable? 
>  Or are the developers, who are just volunteers anyway, more interested in 
> fixing other problems?
>
> Could someone outline the issue for me a bit more clearly so I would know how 
> to evaluate it?

Both this one and the other site you were mentioning (sf911truth) have
more than 100 outlinks. Nutch, by default, only stores 100 outlinks
per page (db.max.outlinks.per.page). Link about.html happens to be
105th link or so, so nutch doesn't store it. All you have to do is
either increase db.max.outlinks.per.page or set it  to -1 (which
means, store all outlinks).

>
>
>
>
>   
> 
> Park yourself in front of a world of choices in alternative vehicles. Visit 
> the Yahoo! Auto Green Center.
> http://autos.yahoo.com/green_center/


-- 
Doğacan Güney







   

Be a better Heartthrob. Get better relationship answers from someone who knows. 
Yahoo! Answers - Check it out. 
http://answers.yahoo.com/dir/?link=list&sid=396545433

Re: NUTCH-119 :: how hard to fix

2007-06-26 Thread Doğacan Güney

On 6/27/07, Kai_testing Middleton <[EMAIL PROTECTED]> wrote:

I am evaluating nutch+lucene as a crawl and search solution.

However, I am finding major bugs in nutch right off the bat.

In particular, NUTCH-119: nutch is not crawling relative URLs.  I have some 
discussion of it here:
http://www.mail-archive.com/[EMAIL PROTECTED]/msg08644.html

Most of the links off www.variety.com, one of my main test sites, have relative 
URLs.  It seems incredible that nutch, which is capable of mapreduce, cannot 
fetch these URLs.

It could be that I would fix this bug if, for other reasons, I decide to go 
with nutch+lucene.  Has anyone tried fixing this problem?  Is it intractable?  
Or are the developers, who are just volunteers anyway, more interested in 
fixing other problems?

Could someone outline the issue for me a bit more clearly so I would know how 
to evaluate it?


Both this one and the other site you were mentioning (sf911truth) have
more than 100 outlinks. Nutch, by default, only stores 100 outlinks
per page (db.max.outlinks.per.page). Link about.html happens to be
105th link or so, so nutch doesn't store it. All you have to do is
either increase db.max.outlinks.per.page or set it  to -1 (which
means, store all outlinks).






  

Park yourself in front of a world of choices in alternative vehicles. Visit the 
Yahoo! Auto Green Center.
http://autos.yahoo.com/green_center/



--
Doğacan Güney


Re: NUTCH-348 and Nutch-0.7.2

2007-05-24 Thread Doug Cutting

karthik085 wrote:

How do you find when a revision was released?


Look at the tags in subversion:

http://svn.apache.org/viewvc/lucene/nutch/tags/

Doug


Re: [Nutch-dev] Creating a new scoring filter

2007-04-24 Thread Lorenzo

Very briefly, with an HtmlParseFilter and a list of weighted words.
This filter examines the Parse text and add a boost value if it finds 
one of the words in the list.

This boost value is added to ParseData MetaData.
Then, a ScoringPlugin reads this MetaData (passScoreAfterParsing) and 
update the CrawlData, both of outlinked pages (to focus more the search)
and of the current page (the difficult part, as explained in the ml; 
however, with NUTCH-468 it should be easyer now)


If you need other informations, please ask!

Lorenzo


Briggs wrote:

Yes.  I too need to alter the score based on attributes and such of
the particular url passed.
May I ask what you have done?


On 4/22/07, Lorenzo <[EMAIL PROTECTED]> wrote:

Perfect! Now I have it working, and it performs quite well for a focused
serch engine like ours!
Do you think it could be an interesting plug-in to add to nutch?

Lorenzo


Doğacan Güney wrote:
> On 4/21/07, Lorenzo <[EMAIL PROTECTED]> wrote:
>>
>> Uhmm... so, suppose I decided, from its content, that the current 
page

>> http://foo/bar.htm is really desiderable.
>> I have put in ParseData's metadata a flag to mark it.
>> In distributeScoreToOutlink(s) I read it from the ParseData param, 
and

>> put it in the adjust CrawlData metadata
>>
>>   MapWritable adjustMap = adjust.getMetaData();
>>   adjustMap.put(key, new FloatWritable(bootsValue));
>>   return adjust;
>>
>> So in updateDbScore(Text url, CrawlDatum old, CrawlDatum datum, List
>> inlinked)
>> the adjust CrawlData will be between the inlinked List. Is it 
right? How

>> do I distinguish it?
>> I can put the URL in metadata too, and scroll through the list, but
>> maybe there is a better method?
>
>
>
> Best approach is yours, you should put a flag in adjust datum's
> metadata to
> mark it, then process it in updateDbScore.
>
> Also, this CrawlDatum will be the same that is passed to indexerScore?
>
>
> You get 2 CrawlDatum's in indexerScore. First is fetchDatum which is
> the one
> in crawl_fetch that contains the fetching status. Second is dbDatum 
which

> comes from crawldb. This dbDatum is the one that you set in
> updateDbScore(The 'datum' argument of updateDbScore)
>
>
> Thanks a lot!
>>
>> Lorenzo
>>
>>
>
>










Re: [Nutch-dev] Creating a new scoring filter

2007-04-23 Thread Briggs

Yes.  I too need to alter the score based on attributes and such of
the particular url passed.
May I ask what you have done?


On 4/22/07, Lorenzo <[EMAIL PROTECTED]> wrote:

Perfect! Now I have it working, and it performs quite well for a focused
serch engine like ours!
Do you think it could be an interesting plug-in to add to nutch?

Lorenzo


Doğacan Güney wrote:
> On 4/21/07, Lorenzo <[EMAIL PROTECTED]> wrote:
>>
>> Uhmm... so, suppose I decided, from its content, that the current page
>> http://foo/bar.htm is really desiderable.
>> I have put in ParseData's metadata a flag to mark it.
>> In distributeScoreToOutlink(s) I read it from the ParseData param, and
>> put it in the adjust CrawlData metadata
>>
>>   MapWritable adjustMap = adjust.getMetaData();
>>   adjustMap.put(key, new FloatWritable(bootsValue));
>>   return adjust;
>>
>> So in updateDbScore(Text url, CrawlDatum old, CrawlDatum datum, List
>> inlinked)
>> the adjust CrawlData will be between the inlinked List. Is it right? How
>> do I distinguish it?
>> I can put the URL in metadata too, and scroll through the list, but
>> maybe there is a better method?
>
>
>
> Best approach is yours, you should put a flag in adjust datum's
> metadata to
> mark it, then process it in updateDbScore.
>
> Also, this CrawlDatum will be the same that is passed to indexerScore?
>
>
> You get 2 CrawlDatum's in indexerScore. First is fetchDatum which is
> the one
> in crawl_fetch that contains the fetching status. Second is dbDatum which
> comes from crawldb. This dbDatum is the one that you set in
> updateDbScore(The 'datum' argument of updateDbScore)
>
>
> Thanks a lot!
>>
>> Lorenzo
>>
>>
>
>





--
"Conscious decisions by concious minds are what make reality real"


Re: [Nutch-dev] Creating a new scoring filter

2007-04-22 Thread Lorenzo
Perfect! Now I have it working, and it performs quite well for a focused 
serch engine like ours!

Do you think it could be an interesting plug-in to add to nutch?

Lorenzo


Doğacan Güney wrote:

On 4/21/07, Lorenzo <[EMAIL PROTECTED]> wrote:


Uhmm... so, suppose I decided, from its content, that the current page
http://foo/bar.htm is really desiderable.
I have put in ParseData's metadata a flag to mark it.
In distributeScoreToOutlink(s) I read it from the ParseData param, and
put it in the adjust CrawlData metadata

  MapWritable adjustMap = adjust.getMetaData();
  adjustMap.put(key, new FloatWritable(bootsValue));
  return adjust;

So in updateDbScore(Text url, CrawlDatum old, CrawlDatum datum, List
inlinked)
the adjust CrawlData will be between the inlinked List. Is it right? How
do I distinguish it?
I can put the URL in metadata too, and scroll through the list, but
maybe there is a better method?




Best approach is yours, you should put a flag in adjust datum's 
metadata to

mark it, then process it in updateDbScore.

Also, this CrawlDatum will be the same that is passed to indexerScore?


You get 2 CrawlDatum's in indexerScore. First is fetchDatum which is 
the one

in crawl_fetch that contains the fetching status. Second is dbDatum which
comes from crawldb. This dbDatum is the one that you set in
updateDbScore(The 'datum' argument of updateDbScore)


Thanks a lot!


Lorenzo









Re: [Nutch-dev] Creating a new scoring filter

2007-04-21 Thread Doğacan Güney

On 4/21/07, Lorenzo <[EMAIL PROTECTED]> wrote:


Doğacan Güney wrote:
> On 4/19/07, Lorenzo <[EMAIL PROTECTED]> wrote:
>>
>> Hi,
>> sorry to re-open this thread, but I am facing the same problem of
>> Nicolás.
>> I like both yours (Doğacan) and Nicolas' ideas, more yours as I think
>> abstract
>> classes are not good extension points.
>> Anyway, is any of these implemented? I really need it!
>
>
> Well, I have implemented a subset of what we discussed in
> 
> NUTCH-468 . There is
> a lot
> more to be done but IMHO, NUTCH-468 may be a good starting point.
>
> Also, I can't understand from the docs what does it means that the
>> adjust datum
>> will update the score of the original datum in updatedb.
>> Update or adjusted in which way? I obtain strange values..
>
>
> In ScoringFilter.updateDbScore you get a list of inlinked datums that
you
> can use to change score. Now, if in distributeScoreToOutlink(s) you
> return a
> datum with a status of STATUS_LINKED, you will get this datum as one
> of the
> inlinked datums in updateDbScore.
>
> I hope, this clears it up a bit.
>
Uhmm... so, suppose I decided, from its content, that the current page
http://foo/bar.htm is really desiderable.
I have put in ParseData's metadata a flag to mark it.
In distributeScoreToOutlink(s) I read it from the ParseData param, and
put it in the adjust CrawlData metadata

  MapWritable adjustMap = adjust.getMetaData();
  adjustMap.put(key, new FloatWritable(bootsValue));
  return adjust;

So in updateDbScore(Text url, CrawlDatum old, CrawlDatum datum, List
inlinked)
the adjust CrawlData will be between the inlinked List. Is it right? How
do I distinguish it?
I can put the URL in metadata too, and scroll through the list, but
maybe there is a better method?




Best approach is yours, you should put a flag in adjust datum's metadata to
mark it, then process it in updateDbScore.

Also, this CrawlDatum will be the same that is passed to indexerScore?


You get 2 CrawlDatum's in indexerScore. First is fetchDatum which is the one
in crawl_fetch that contains the fetching status. Second is dbDatum which
comes from crawldb. This dbDatum is the one that you set in
updateDbScore(The 'datum' argument of updateDbScore)


Thanks a lot!


Lorenzo





--
Doğacan Güney


Re: [Nutch-dev] Creating a new scoring filter

2007-04-21 Thread Lorenzo

Doğacan Güney wrote:

On 4/19/07, Lorenzo <[EMAIL PROTECTED]> wrote:


Hi,
sorry to re-open this thread, but I am facing the same problem of 
Nicolás.

I like both yours (Doğacan) and Nicolas' ideas, more yours as I think
abstract
classes are not good extension points.
Anyway, is any of these implemented? I really need it!



Well, I have implemented a subset of what we discussed in

NUTCH-468 . There is 
a lot

more to be done but IMHO, NUTCH-468 may be a good starting point.

Also, I can't understand from the docs what does it means that the

adjust datum
will update the score of the original datum in updatedb.
Update or adjusted in which way? I obtain strange values..



In ScoringFilter.updateDbScore you get a list of inlinked datums that you
can use to change score. Now, if in distributeScoreToOutlink(s) you 
return a
datum with a status of STATUS_LINKED, you will get this datum as one 
of the

inlinked datums in updateDbScore.

I hope, this clears it up a bit.

Uhmm... so, suppose I decided, from its content, that the current page 
http://foo/bar.htm is really desiderable.

I have put in ParseData's metadata a flag to mark it.
In distributeScoreToOutlink(s) I read it from the ParseData param, and 
put it in the adjust CrawlData metadata


 MapWritable adjustMap = adjust.getMetaData();
 adjustMap.put(key, new FloatWritable(bootsValue));
 return adjust;

So in updateDbScore(Text url, CrawlDatum old, CrawlDatum datum, List 
inlinked)
the adjust CrawlData will be between the inlinked List. Is it right? How 
do I distinguish it?
I can put the URL in metadata too, and scroll through the list, but 
maybe there is a better method?


Also, this CrawlDatum will be the same that is passed to indexerScore?
Thanks a lot!

Lorenzo



  1   2   3   4   5   >