Re: Pending Commits for Nutch Issues

2008-12-02 Thread Susam Pal
I agree with John too. Probably you meant $ 0.02, since 0.02 cents is too
less. It is usually 2 cents. :-P

Regards,
Susam Pal

On Tue, Dec 2, 2008 at 6:09 PM, John Martyniak <[EMAIL PROTECTED]> wrote:

> Is NUTCH-442 going to be part of the 1.0 release?  I hope so, Nutch/Solr
> integration would be a huge.
>
> just my .02 cents.
>
> -John
>
> On Nov 27, 2008, at 12:10 PM, Doğacan Güney wrote:
>
>  And here is a list of issues from me that needs more discussion/review:
>>
>> NUTCH-442 - Integrate Nutch/Solr: If NUTCH-442 is too complex to
>> review for people, for now we can just write a SolrIndexer like Sami
>> Siren's and deal with 442 after 1.0. I would be happy to provide such
>> a patch.
>>
>> NUTCH-631 - MoreIndexingFilter fails with NoSuchElementException: I
>> don't know how to fix this one but indexing almost always fails with
>> index-more enabled.
>>
>> NUTCH-652 - AdaptiveFetchSchedule#setFetchSchedule doesn't calculate
>> fetch interval correctly: I botched it once so now I am afraid to
>> commit it :D
>>
>> NUTCH-626 - fetcher2 breaks out the domain with
>> db.ignore.external.links set at cross domain redirects: I am going to
>> update the patch and commit it if no objections.
>>
>> Also, I think NUTCH-658 would be a nice feature for 1.0.
>>
>> There are some others but these are the most recent and we really
>> should push 1.0 out the door already :D
>>
>> Oh and finally we should do a review of all libraries in nutch
>> (libraries in plugins included) and update them to latest versions. I
>> am going to open an issue with the intenton of updating all the
>> libraries that do not require code changes.
>>
>> --
>> Doğacan Güney
>>
>
>


Re: Pending Commits for Nutch Issues

2008-12-02 Thread Julien Nioche
I agree with John. NUTCH-442 is by far the most popular/watched item in JIRA
and, I think, has been already used by quite a lot of different people to be
deemed reliable.

Julien


2008/12/2 John Martyniak <[EMAIL PROTECTED]>

> Is NUTCH-442 going to be part of the 1.0 release?  I hope so, Nutch/Solr
> integration would be a huge.
>
> just my .02 cents.
>
> -John
>
>
> On Nov 27, 2008, at 12:10 PM, Doğacan Güney wrote:
>
>  And here is a list of issues from me that needs more discussion/review:
>>
>> NUTCH-442 - Integrate Nutch/Solr: If NUTCH-442 is too complex to
>> review for people, for now we can just write a SolrIndexer like Sami
>> Siren's and deal with 442 after 1.0. I would be happy to provide such
>> a patch.
>>
>> NUTCH-631 - MoreIndexingFilter fails with NoSuchElementException: I
>> don't know how to fix this one but indexing almost always fails with
>> index-more enabled.
>>
>> NUTCH-652 - AdaptiveFetchSchedule#setFetchSchedule doesn't calculate
>> fetch interval correctly: I botched it once so now I am afraid to
>> commit it :D
>>
>> NUTCH-626 - fetcher2 breaks out the domain with
>> db.ignore.external.links set at cross domain redirects: I am going to
>> update the patch and commit it if no objections.
>>
>> Also, I think NUTCH-658 would be a nice feature for 1.0.
>>
>> There are some others but these are the most recent and we really
>> should push 1.0 out the door already :D
>>
>> Oh and finally we should do a review of all libraries in nutch
>> (libraries in plugins included) and update them to latest versions. I
>> am going to open an issue with the intenton of updating all the
>> libraries that do not require code changes.
>>
>> --
>> Doğacan Güney
>>
>
>


-- 
DigitalPebble Ltd
http://www.digitalpebble.com


Re: Pending Commits for Nutch Issues

2008-12-02 Thread John Martyniak
Is NUTCH-442 going to be part of the 1.0 release?  I hope so, Nutch/ 
Solr integration would be a huge.


just my .02 cents.

-John

On Nov 27, 2008, at 12:10 PM, Doğacan Güney wrote:

And here is a list of issues from me that needs more discussion/ 
review:


NUTCH-442 - Integrate Nutch/Solr: If NUTCH-442 is too complex to
review for people, for now we can just write a SolrIndexer like Sami
Siren's and deal with 442 after 1.0. I would be happy to provide such
a patch.

NUTCH-631 - MoreIndexingFilter fails with NoSuchElementException: I
don't know how to fix this one but indexing almost always fails with
index-more enabled.

NUTCH-652 - AdaptiveFetchSchedule#setFetchSchedule doesn't calculate
fetch interval correctly: I botched it once so now I am afraid to
commit it :D

NUTCH-626 - fetcher2 breaks out the domain with
db.ignore.external.links set at cross domain redirects: I am going to
update the patch and commit it if no objections.

Also, I think NUTCH-658 would be a nice feature for 1.0.

There are some others but these are the most recent and we really
should push 1.0 out the door already :D

Oh and finally we should do a review of all libraries in nutch
(libraries in plugins included) and update them to latest versions. I
am going to open an issue with the intenton of updating all the
libraries that do not require code changes.

--
Doğacan Güney




Re: Pending Commits for Nutch Issues

2008-11-27 Thread Dennis Kubes



Doğacan Güney wrote:

Hi Dennis,

On Wed, Nov 26, 2008 at 11:42 PM, Dennis Kubes <[EMAIL PROTECTED]> wrote:

If nobody has a problem with them I would like to commit the following
issues in the next day or two:

NUTCH-663: Upgrade Nutch to the most recent Hadoop version (0.19)
NUTCH-662: Upgrade Nutch to the most recent Lucene version (2.4)
NUTCH-647: Resolve URLs tool
NUTCH-665: Search Load Testing Tool
NUTCH-667: Input Format for working with Content in Hadoop Streaming

And I would like to commit these in < a week:

NUTCH-635: LinkAnalysis Tool for Nutch
NUTCH-646: New Indexing framework for Nutch
NUTCH-594: Serve Nutch search results in XML and JSON
NUTCH-666: Analysis plugins and new language identifier.

There are others too but these are the ones I am trying to get moved into
trunk right now.



I am OK with all but NUTCH-666... Why a new language identifier? (or
if a new one, why keep old one around?)


I haven't got the code pushed out yet.  I do have a production version 
running but I need to make it play nice with the Apache licensing 
requirements.  Current library I am using is under GPL.  The reason I 
switched was because I found that the old one wasn't working correctly 
for me.


I don't know the accuracy levels of the old language identifier but I 
found that with pages that contained both english and another language, 
it would often classify it as english.  The new language identifier I am 
currently using has an accuracy rate of 97% and is trainable as before 
for multiple languages.  Currently we have models for 20-30 languages.


Also the new language identifier works with the new indexing framework 
and with new functionality for custom fields.  The only reason I would 
keep the old one around would be for backwards compatibility for people 
currently using it.


I will push out a patch shortly and we can review.  If we don't want it 
to make it into this release I am ok with that.


Dennis





Dennis







Re: Pending Commits for Nutch Issues

2008-11-27 Thread Andrzej Bialecki

Doğacan Güney wrote:

I forgot: I think there is a huge bug with MapWritable in nutch. I
didn't yet figure out what it is
exactly but it has something to do with the fact that id->class maps are static.


Hadoop now has its own implementation of MapWritable, which doesn't use 
static mappings. We should probably switch to this implementation, 
although we would have to solve the back-compat issues of accessing old 
data produced with the Nutch's MapWritable.



--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Pending Commits for Nutch Issues

2008-11-27 Thread Doğacan Güney
OK one last thing: Get rid of Fetcher and promote Fetcher2 to be the
default fetcher.

On Thu, Nov 27, 2008 at 7:15 PM, Doğacan Güney <[EMAIL PROTECTED]> wrote:
> I forgot: I think there is a huge bug with MapWritable in nutch. I
> didn't yet figure out what it is
> exactly but it has something to do with the fact that id->class maps are 
> static.
>
> On Thu, Nov 27, 2008 at 7:10 PM, Doğacan Güney <[EMAIL PROTECTED]> wrote:
>> And here is a list of issues from me that needs more discussion/review:
>>
>> NUTCH-442 - Integrate Nutch/Solr: If NUTCH-442 is too complex to
>> review for people, for now we can just write a SolrIndexer like Sami
>> Siren's and deal with 442 after 1.0. I would be happy to provide such
>> a patch.
>>
>> NUTCH-631 - MoreIndexingFilter fails with NoSuchElementException: I
>> don't know how to fix this one but indexing almost always fails with
>> index-more enabled.
>>
>> NUTCH-652 - AdaptiveFetchSchedule#setFetchSchedule doesn't calculate
>> fetch interval correctly: I botched it once so now I am afraid to
>> commit it :D
>>
>> NUTCH-626 - fetcher2 breaks out the domain with
>> db.ignore.external.links set at cross domain redirects: I am going to
>> update the patch and commit it if no objections.
>>
>> Also, I think NUTCH-658 would be a nice feature for 1.0.
>>
>> There are some others but these are the most recent and we really
>> should push 1.0 out the door already :D
>>
>> Oh and finally we should do a review of all libraries in nutch
>> (libraries in plugins included) and update them to latest versions. I
>> am going to open an issue with the intenton of updating all the
>> libraries that do not require code changes.
>>
>> --
>> Doğacan Güney
>>
>
>
>
> --
> Doğacan Güney
>



-- 
Doğacan Güney


Re: Pending Commits for Nutch Issues

2008-11-27 Thread Doğacan Güney
I forgot: I think there is a huge bug with MapWritable in nutch. I
didn't yet figure out what it is
exactly but it has something to do with the fact that id->class maps are static.

On Thu, Nov 27, 2008 at 7:10 PM, Doğacan Güney <[EMAIL PROTECTED]> wrote:
> And here is a list of issues from me that needs more discussion/review:
>
> NUTCH-442 - Integrate Nutch/Solr: If NUTCH-442 is too complex to
> review for people, for now we can just write a SolrIndexer like Sami
> Siren's and deal with 442 after 1.0. I would be happy to provide such
> a patch.
>
> NUTCH-631 - MoreIndexingFilter fails with NoSuchElementException: I
> don't know how to fix this one but indexing almost always fails with
> index-more enabled.
>
> NUTCH-652 - AdaptiveFetchSchedule#setFetchSchedule doesn't calculate
> fetch interval correctly: I botched it once so now I am afraid to
> commit it :D
>
> NUTCH-626 - fetcher2 breaks out the domain with
> db.ignore.external.links set at cross domain redirects: I am going to
> update the patch and commit it if no objections.
>
> Also, I think NUTCH-658 would be a nice feature for 1.0.
>
> There are some others but these are the most recent and we really
> should push 1.0 out the door already :D
>
> Oh and finally we should do a review of all libraries in nutch
> (libraries in plugins included) and update them to latest versions. I
> am going to open an issue with the intenton of updating all the
> libraries that do not require code changes.
>
> --
> Doğacan Güney
>



-- 
Doğacan Güney


Re: Pending Commits for Nutch Issues

2008-11-27 Thread Doğacan Güney
And here is a list of issues from me that needs more discussion/review:

NUTCH-442 - Integrate Nutch/Solr: If NUTCH-442 is too complex to
review for people, for now we can just write a SolrIndexer like Sami
Siren's and deal with 442 after 1.0. I would be happy to provide such
a patch.

NUTCH-631 - MoreIndexingFilter fails with NoSuchElementException: I
don't know how to fix this one but indexing almost always fails with
index-more enabled.

NUTCH-652 - AdaptiveFetchSchedule#setFetchSchedule doesn't calculate
fetch interval correctly: I botched it once so now I am afraid to
commit it :D

NUTCH-626 - fetcher2 breaks out the domain with
db.ignore.external.links set at cross domain redirects: I am going to
update the patch and commit it if no objections.

Also, I think NUTCH-658 would be a nice feature for 1.0.

There are some others but these are the most recent and we really
should push 1.0 out the door already :D

Oh and finally we should do a review of all libraries in nutch
(libraries in plugins included) and update them to latest versions. I
am going to open an issue with the intenton of updating all the
libraries that do not require code changes.

-- 
Doğacan Güney


Re: Pending Commits for Nutch Issues

2008-11-27 Thread Doğacan Güney
Hi Dennis,

On Wed, Nov 26, 2008 at 11:42 PM, Dennis Kubes <[EMAIL PROTECTED]> wrote:
> If nobody has a problem with them I would like to commit the following
> issues in the next day or two:
>
> NUTCH-663: Upgrade Nutch to the most recent Hadoop version (0.19)
> NUTCH-662: Upgrade Nutch to the most recent Lucene version (2.4)
> NUTCH-647: Resolve URLs tool
> NUTCH-665: Search Load Testing Tool
> NUTCH-667: Input Format for working with Content in Hadoop Streaming
>
> And I would like to commit these in < a week:
>
> NUTCH-635: LinkAnalysis Tool for Nutch
> NUTCH-646: New Indexing framework for Nutch
> NUTCH-594: Serve Nutch search results in XML and JSON
> NUTCH-666: Analysis plugins and new language identifier.
>
> There are others too but these are the ones I am trying to get moved into
> trunk right now.
>

I am OK with all but NUTCH-666... Why a new language identifier? (or
if a new one, why keep old one around?)

> Dennis
>



-- 
Doğacan Güney


Pending Commits for Nutch Issues

2008-11-26 Thread Dennis Kubes
If nobody has a problem with them I would like to commit the following 
issues in the next day or two:


NUTCH-663: Upgrade Nutch to the most recent Hadoop version (0.19)
NUTCH-662: Upgrade Nutch to the most recent Lucene version (2.4)
NUTCH-647: Resolve URLs tool
NUTCH-665: Search Load Testing Tool
NUTCH-667: Input Format for working with Content in Hadoop Streaming

And I would like to commit these in < a week:

NUTCH-635: LinkAnalysis Tool for Nutch
NUTCH-646: New Indexing framework for Nutch
NUTCH-594: Serve Nutch search results in XML and JSON
NUTCH-666: Analysis plugins and new language identifier.

There are others too but these are the ones I am trying to get moved 
into trunk right now.


Dennis