Re: [DISCUSS] Roadmap for 2.3 Release

2014-05-01 Thread Talat Uyarer
Thank you Julien. I agree with you about we should do as robust as possible
releases. I work on your comments.

Talat


2014-05-01 19:32 GMT+03:00 Julien Nioche :

> Hi Talat,
>
> Comments below :
>
>  NUTCH-1753 Eclipse dependecy problem for 2.x
>>
>
> => trivial, please see my comments on it
>
>
>> NUTCH-1748 urlfilter-validator to allow .. (two dots) inside file names
>> (path elements)
>>
>
> => still under discussion - leave it for 2.4
>
>
>> NUTCH-1740 BatchId parameter is not set in DbUpdaterJob
>>
>
> => duplicate
>
>
>> NUTCH-1728 indexer-solr plugin is not delete docs from solr
>>
>
> => trivial enough to be committed for 2.3
>
>
>> NUTCH-1725 CleaningJob's reducer does not commit deleted docs.
>>
>
> => trivial enough to be committed for 2.3
>
>
>> NUTCH-1662 NUTCH-1568 Indexer Plugin for Solr Cloud
>>
>
> => I think we did something pretty similar in 1.x and would like to make
> sure that both versions are as similar as possible.
>
>
>>  NUTCH-1661 Language based crawling
>>
>
> => This is definitely not being committed. You haven't replied to Otis's
> questions and this has to be properly reviewed first and discussed.
>
>
>> NUTCH-1660 Index filter for Page's latitude and longitude
>>
>
> => same. You haven't replied to the comments on this one.
>
>
>> NUTCH-1657 ORIGINAL_CHAR_ENCODING and CHAR_ENCODING_FOR_CONVERSION never
>> set in HTMLParser
>>
>
> => trivial indeed, +1 thanks
>
>
>> NUTCH-1643 Unnecessary fetching with http.content.limit when using
>> protocol-http
>>
>
> => needs reviewing first, let's leave it for later
>
>
>> NUTCH-1618 Fetches some websites multiple times for long lasting queues
>>
>
> => trivial indeed, please change the title to something more explicit like
> "Turn speculative execution off for Fetching"
>
> I have added NUTCH-1679 
>  (UpdateDb using batchId, link may override crawled page.) to 2.3 as it
> must be fixed ASAP.
>
> Thanks for pointing out these issues. I think the focus for 2.3 should be
> to get everything as robust as possible, we can always add new
> functionalities in another release after that ("release often" etc...). One
> thing we should definitely have though is to leverage the brand new GORA
> filtering so that we get only the entries marked for a given job - see
> discussion on NUTCH-1714. This should make Nutch 2.x a lot faster.
>
> We haven't released 2.x for some time and loads of interesting stuff has
> been done to it. It will be an exciting release!
>
> Thanks for your contributions and pushing things forward!
>
> Julien
>
>
>
>>
>> 2014-05-01 11:32 GMT+03:00 Julien Nioche :
>>
>> Hi Talat
>>>
>>> Not clear what you mean here. "I need them" is not really an explanation
>>> as to why they should be part of the next release. [If you want your own
>>> repository then open an account on GitHub (or somewhere else) and clone the
>>> 2.x branch to add the patches of your choice].
>>>
>>> Lewis suggested a roadmap for the next release and the changes he made
>>> reflect his suggestions. If you think some of the issues should be part of
>>> the 2.3 release then please explain why. BTW I don't think you agree with
>>> me as I was suggesting we stick to the ones already listed minus 1741.
>>>
>>> Thanks
>>>
>>> Julien
>>>
>>>
>>>
>>> On 1 May 2014 08:40, Talat Uyarer  wrote:
>>>
 I aggree with you Julien. Today Lewis change some issues's fix version
  2.3 to 2.4. Most of my issues :) May I ask, If I update these issues, can
 I change fix version to 2.3  ? I need them.

 Thanks
 Talat


 2014-05-01 9:47 GMT+03:00 Julien Nioche 
 :

 I'd exclude NUTCH-1741 for now and focus on the core updates (GORA,
> filters, etc...). See comments on 
> NUTCH-1714
>
>
> On 1 May 2014 07:27, Lewis John Mcgibbney 
> wrote:
>
>> Hi Alparslan & Folks,
>>
>> OK so you can see the road map's here
>>
>> *http://s.apache.org/Xqk* 
>>
>> As you can see in 2.3 development drive we've addressed 66 of 71
>> issues. The remainders being as follows
>>
>> NUTCH-1741 
>>
>> Support of Sitemaps in Nutch 
>> 2.x
>> NUTCH-1714 
>>
>> Nutch 2.x upgrade to Gora 
>> 0.4
>> NUTCH-1709 
>>
>> Generated classes o.a.n.storage.Host and o.a.n.storage.ProtocolStatus
>> contain methods not defined in source 
>> .avsc
>> NUTCH-1674 
>>
>> Use batchId filter to enable scan (GORA-119) for
>> Fetch,Parse,Update,Index

Re: [DISCUSS] Roadmap for 2.3 Release

2014-05-01 Thread Julien Nioche
Hi Talat,

Comments below :

NUTCH-1753 Eclipse dependecy problem for 2.x
>

=> trivial, please see my comments on it


> NUTCH-1748 urlfilter-validator to allow .. (two dots) inside file names
> (path elements)
>

=> still under discussion - leave it for 2.4


> NUTCH-1740 BatchId parameter is not set in DbUpdaterJob
>

=> duplicate


> NUTCH-1728 indexer-solr plugin is not delete docs from solr
>

=> trivial enough to be committed for 2.3


> NUTCH-1725 CleaningJob's reducer does not commit deleted docs.
>

=> trivial enough to be committed for 2.3


> NUTCH-1662 NUTCH-1568 Indexer Plugin for Solr Cloud
>

=> I think we did something pretty similar in 1.x and would like to make
sure that both versions are as similar as possible.


> NUTCH-1661 Language based crawling
>

=> This is definitely not being committed. You haven't replied to Otis's
questions and this has to be properly reviewed first and discussed.


> NUTCH-1660 Index filter for Page's latitude and longitude
>

=> same. You haven't replied to the comments on this one.


> NUTCH-1657 ORIGINAL_CHAR_ENCODING and CHAR_ENCODING_FOR_CONVERSION never
> set in HTMLParser
>

=> trivial indeed, +1 thanks


> NUTCH-1643 Unnecessary fetching with http.content.limit when using
> protocol-http
>

=> needs reviewing first, let's leave it for later


> NUTCH-1618 Fetches some websites multiple times for long lasting queues
>

=> trivial indeed, please change the title to something more explicit like
"Turn speculative execution off for Fetching"

I have added NUTCH-1679
 (UpdateDb
using batchId, link may override crawled page.) to 2.3 as it must be fixed
ASAP.

Thanks for pointing out these issues. I think the focus for 2.3 should be
to get everything as robust as possible, we can always add new
functionalities in another release after that ("release often" etc...). One
thing we should definitely have though is to leverage the brand new GORA
filtering so that we get only the entries marked for a given job - see
discussion on NUTCH-1714 .
This should make Nutch 2.x a lot faster.

We haven't released 2.x for some time and loads of interesting stuff has
been done to it. It will be an exciting release!

Thanks for your contributions and pushing things forward!

Julien



>
> 2014-05-01 11:32 GMT+03:00 Julien Nioche :
>
> Hi Talat
>>
>> Not clear what you mean here. "I need them" is not really an explanation
>> as to why they should be part of the next release. [If you want your own
>> repository then open an account on GitHub (or somewhere else) and clone the
>> 2.x branch to add the patches of your choice].
>>
>> Lewis suggested a roadmap for the next release and the changes he made
>> reflect his suggestions. If you think some of the issues should be part of
>> the 2.3 release then please explain why. BTW I don't think you agree with
>> me as I was suggesting we stick to the ones already listed minus 1741.
>>
>> Thanks
>>
>> Julien
>>
>>
>>
>> On 1 May 2014 08:40, Talat Uyarer  wrote:
>>
>>> I aggree with you Julien. Today Lewis change some issues's fix version
>>>  2.3 to 2.4. Most of my issues :) May I ask, If I update these issues, can
>>> I change fix version to 2.3  ? I need them.
>>>
>>> Thanks
>>> Talat
>>>
>>>
>>> 2014-05-01 9:47 GMT+03:00 Julien Nioche :
>>>
>>> I'd exclude NUTCH-1741 for now and focus on the core updates (GORA,
 filters, etc...). See comments on 
 NUTCH-1714


 On 1 May 2014 07:27, Lewis John Mcgibbney wrote:

> Hi Alparslan & Folks,
>
> OK so you can see the road map's here
>
> *http://s.apache.org/Xqk* 
>
> As you can see in 2.3 development drive we've addressed 66 of 71
> issues. The remainders being as follows
>
> NUTCH-1741 
>
> Support of Sitemaps in Nutch 
> 2.x
> NUTCH-1714 
>
> Nutch 2.x upgrade to Gora 
> 0.4
> NUTCH-1709 
>
> Generated classes o.a.n.storage.Host and o.a.n.storage.ProtocolStatus
> contain methods not defined in source 
> .avsc
> NUTCH-1674 
>
> Use batchId filter to enable scan (GORA-119) for
> Fetch,Parse,Update,Index
>  NUTCH-1570 
>
> Add filtering capability to Datastore 
> Queries
> I think if we addressed the above then we could push an RC.
> Any comments?
> I'll be able to crack on with this final push relatively soon.
>
> On Tu

Re: [DISCUSS] Roadmap for 2.3 Release

2014-05-01 Thread Talat Uyarer
Hi Julien,

Sorry, You are right. I guess I could not express myself. I want to say
some of the issues which are appointed to the 2.4, should be part of the
2.3.

The issues:
NUTCH-1753 Eclipse dependecy problem for 2.x
NUTCH-1748 urlfilter-validator to allow .. (two dots) inside file names
(path elements)
NUTCH-1740 BatchId parameter is not set in DbUpdaterJob
NUTCH-1728 indexer-solr plugin is not delete docs from solr
NUTCH-1725 CleaningJob's reducer does not commit deleted docs.
NUTCH-1662 NUTCH-1568 Indexer Plugin for Solr Cloud
NUTCH-1661 Language based crawling
NUTCH-1660 Index filter for Page's latitude and longitude
NUTCH-1657 ORIGINAL_CHAR_ENCODING and CHAR_ENCODING_FOR_CONVERSION never
set in HTMLParser
NUTCH-1643 Unnecessary fetching with http.content.limit when using
protocol-http
NUTCH-1618 Fetches some websites multiple times for long lasting queues

Wdyt ?

Talat


2014-05-01 11:32 GMT+03:00 Julien Nioche :

> Hi Talat
>
> Not clear what you mean here. "I need them" is not really an explanation
> as to why they should be part of the next release. [If you want your own
> repository then open an account on GitHub (or somewhere else) and clone the
> 2.x branch to add the patches of your choice].
>
> Lewis suggested a roadmap for the next release and the changes he made
> reflect his suggestions. If you think some of the issues should be part of
> the 2.3 release then please explain why. BTW I don't think you agree with
> me as I was suggesting we stick to the ones already listed minus 1741.
>
> Thanks
>
> Julien
>
>
>
> On 1 May 2014 08:40, Talat Uyarer  wrote:
>
>> I aggree with you Julien. Today Lewis change some issues's fix version
>>  2.3 to 2.4. Most of my issues :) May I ask, If I update these issues, can
>> I change fix version to 2.3  ? I need them.
>>
>> Thanks
>> Talat
>>
>>
>> 2014-05-01 9:47 GMT+03:00 Julien Nioche :
>>
>> I'd exclude NUTCH-1741 for now and focus on the core updates (GORA,
>>> filters, etc...). See comments on 
>>> NUTCH-1714
>>>
>>>
>>> On 1 May 2014 07:27, Lewis John Mcgibbney wrote:
>>>
 Hi Alparslan & Folks,

 OK so you can see the road map's here

 *http://s.apache.org/Xqk* 

 As you can see in 2.3 development drive we've addressed 66 of 71
 issues. The remainders being as follows

 NUTCH-1741 

 Support of Sitemaps in Nutch 
 2.x
 NUTCH-1714 

 Nutch 2.x upgrade to Gora 
 0.4
 NUTCH-1709 

 Generated classes o.a.n.storage.Host and o.a.n.storage.ProtocolStatus
 contain methods not defined in source 
 .avsc
 NUTCH-1674 

 Use batchId filter to enable scan (GORA-119) for
 Fetch,Parse,Update,Index
  NUTCH-1570 

 Add filtering capability to Datastore 
 Queries
 I think if we addressed the above then we could push an RC.
 Any comments?
 I'll be able to crack on with this final push relatively soon.

 On Tue, Apr 29, 2014 at 1:09 PM, wrote:

>
> I think we can also add
> https://issues.apache.org/jira/browse/NUTCH-1674. This issue was
> waiting the stable release of gora-0.4.
>
> And IMHO, we can add https://issues.apache.org/jira/browse/NUTCH-1741,
> if anyone could review and test it.
>
> Thanks,
> Alparslan
>
>
>

>>>
>>>
>>> --
>>>
>>> Open Source Solutions for Text Engineering
>>>
>>> http://digitalpebble.blogspot.com/
>>> http://www.digitalpebble.com
>>> http://twitter.com/digitalpebble
>>>
>>
>>
>>
>> --
>> Talat UYARER
>> Websitesi: http://talat.uyarer.com
>> Twitter: http://twitter.com/talatuyarer
>> Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304
>>
>
>
>
> --
>
> Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>



-- 
Talat UYARER
Websitesi: http://talat.uyarer.com
Twitter: http://twitter.com/talatuyarer
Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304


Re: [DISCUSS] Roadmap for 2.3 Release

2014-05-01 Thread Julien Nioche
Hi Talat

Not clear what you mean here. "I need them" is not really an explanation as
to why they should be part of the next release. [If you want your own
repository then open an account on GitHub (or somewhere else) and clone the
2.x branch to add the patches of your choice].

Lewis suggested a roadmap for the next release and the changes he made
reflect his suggestions. If you think some of the issues should be part of
the 2.3 release then please explain why. BTW I don't think you agree with
me as I was suggesting we stick to the ones already listed minus 1741.

Thanks

Julien


On 1 May 2014 08:40, Talat Uyarer  wrote:

> I aggree with you Julien. Today Lewis change some issues's fix version
>  2.3 to 2.4. Most of my issues :) May I ask, If I update these issues, can
> I change fix version to 2.3  ? I need them.
>
> Thanks
> Talat
>
>
> 2014-05-01 9:47 GMT+03:00 Julien Nioche :
>
> I'd exclude NUTCH-1741 for now and focus on the core updates (GORA,
>> filters, etc...). See comments on 
>> NUTCH-1714
>>
>>
>> On 1 May 2014 07:27, Lewis John Mcgibbney wrote:
>>
>>> Hi Alparslan & Folks,
>>>
>>> OK so you can see the road map's here
>>>
>>> *http://s.apache.org/Xqk* 
>>>
>>> As you can see in 2.3 development drive we've addressed 66 of 71 issues.
>>> The remainders being as follows
>>>
>>> NUTCH-1741 
>>>
>>> Support of Sitemaps in Nutch 
>>> 2.x
>>> NUTCH-1714 
>>>
>>> Nutch 2.x upgrade to Gora 
>>> 0.4
>>> NUTCH-1709 
>>>
>>> Generated classes o.a.n.storage.Host and o.a.n.storage.ProtocolStatus
>>> contain methods not defined in source 
>>> .avsc
>>> NUTCH-1674 
>>>
>>> Use batchId filter to enable scan (GORA-119) for 
>>> Fetch,Parse,Update,Index
>>>  NUTCH-1570 
>>>
>>> Add filtering capability to Datastore 
>>> Queries
>>> I think if we addressed the above then we could push an RC.
>>> Any comments?
>>> I'll be able to crack on with this final push relatively soon.
>>>
>>> On Tue, Apr 29, 2014 at 1:09 PM, wrote:
>>>

 I think we can also add
 https://issues.apache.org/jira/browse/NUTCH-1674. This issue was
 waiting the stable release of gora-0.4.

 And IMHO, we can add https://issues.apache.org/jira/browse/NUTCH-1741,
 if anyone could review and test it.

 Thanks,
 Alparslan



>>>
>>
>>
>> --
>>
>> Open Source Solutions for Text Engineering
>>
>> http://digitalpebble.blogspot.com/
>> http://www.digitalpebble.com
>> http://twitter.com/digitalpebble
>>
>
>
>
> --
> Talat UYARER
> Websitesi: http://talat.uyarer.com
> Twitter: http://twitter.com/talatuyarer
> Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304
>



-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


Re: [DISCUSS] Roadmap for 2.3 Release

2014-05-01 Thread Talat Uyarer
I aggree with you Julien. Today Lewis change some issues's fix version  2.3
to 2.4. Most of my issues :) May I ask, If I update these issues, can I
change fix version to 2.3  ? I need them.

Thanks
Talat


2014-05-01 9:47 GMT+03:00 Julien Nioche :

> I'd exclude NUTCH-1741 for now and focus on the core updates (GORA,
> filters, etc...). See comments on 
> NUTCH-1714
>
>
> On 1 May 2014 07:27, Lewis John Mcgibbney wrote:
>
>> Hi Alparslan & Folks,
>>
>> OK so you can see the road map's here
>>
>> *http://s.apache.org/Xqk* 
>>
>> As you can see in 2.3 development drive we've addressed 66 of 71 issues.
>> The remainders being as follows
>>
>> NUTCH-1741 
>>
>> Support of Sitemaps in Nutch 
>> 2.x
>> NUTCH-1714 
>>
>> Nutch 2.x upgrade to Gora 
>> 0.4
>> NUTCH-1709 
>>
>> Generated classes o.a.n.storage.Host and o.a.n.storage.ProtocolStatus
>> contain methods not defined in source 
>> .avsc
>> NUTCH-1674 
>>
>> Use batchId filter to enable scan (GORA-119) for 
>> Fetch,Parse,Update,Index
>>  NUTCH-1570 
>>
>> Add filtering capability to Datastore 
>> Queries
>> I think if we addressed the above then we could push an RC.
>> Any comments?
>> I'll be able to crack on with this final push relatively soon.
>>
>> On Tue, Apr 29, 2014 at 1:09 PM, wrote:
>>
>>>
>>> I think we can also add https://issues.apache.org/jira/browse/NUTCH-1674.
>>> This issue was waiting the stable release of gora-0.4.
>>>
>>> And IMHO, we can add https://issues.apache.org/jira/browse/NUTCH-1741,
>>> if anyone could review and test it.
>>>
>>> Thanks,
>>> Alparslan
>>>
>>>
>>>
>>
>
>
> --
>
> Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>



-- 
Talat UYARER
Websitesi: http://talat.uyarer.com
Twitter: http://twitter.com/talatuyarer
Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304


Re: [DISCUSS] Roadmap for 2.3 Release

2014-04-30 Thread Julien Nioche
I'd exclude NUTCH-1741 for now and focus on the core updates (GORA,
filters, etc...). See comments on
NUTCH-1714


On 1 May 2014 07:27, Lewis John Mcgibbney  wrote:

> Hi Alparslan & Folks,
>
> OK so you can see the road map's here
>
> *http://s.apache.org/Xqk* 
>
> As you can see in 2.3 development drive we've addressed 66 of 71 issues.
> The remainders being as follows
>
> NUTCH-1741 
>
> Support of Sitemaps in Nutch 
> 2.x
> NUTCH-1714 
>
> Nutch 2.x upgrade to Gora 
> 0.4
> NUTCH-1709 
>
> Generated classes o.a.n.storage.Host and o.a.n.storage.ProtocolStatus
> contain methods not defined in source 
> .avsc
> NUTCH-1674 
>
> Use batchId filter to enable scan (GORA-119) for 
> Fetch,Parse,Update,Index
> NUTCH-1570 
>
> Add filtering capability to Datastore 
> Queries
> I think if we addressed the above then we could push an RC.
> Any comments?
> I'll be able to crack on with this final push relatively soon.
>
> On Tue, Apr 29, 2014 at 1:09 PM,  wrote:
>
>>
>> I think we can also add https://issues.apache.org/jira/browse/NUTCH-1674.
>> This issue was waiting the stable release of gora-0.4.
>>
>> And IMHO, we can add https://issues.apache.org/jira/browse/NUTCH-1741,
>> if anyone could review and test it.
>>
>> Thanks,
>> Alparslan
>>
>>
>>
>


-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


Re: [DISCUSS] Roadmap for 2.3 Release

2014-04-30 Thread Lewis John Mcgibbney
Hi Alparslan & Folks,

OK so you can see the road map's here

*http://s.apache.org/Xqk* 

As you can see in 2.3 development drive we've addressed 66 of 71 issues.
The remainders being as follows

NUTCH-1741 

Support of Sitemaps in Nutch
2.x
NUTCH-1714 

Nutch 2.x upgrade to Gora 0.4
NUTCH-1709 

Generated classes o.a.n.storage.Host and o.a.n.storage.ProtocolStatus
contain methods not defined in source
.avsc
NUTCH-1674 

Use batchId filter to enable scan (GORA-119) for
Fetch,Parse,Update,Index
NUTCH-1570 

Add filtering capability to Datastore
Queries
I think if we addressed the above then we could push an RC.
Any comments?
I'll be able to crack on with this final push relatively soon.

On Tue, Apr 29, 2014 at 1:09 PM,  wrote:

>
> I think we can also add https://issues.apache.org/jira/browse/NUTCH-1674.
> This issue was waiting the stable release of gora-0.4.
>
> And IMHO, we can add https://issues.apache.org/jira/browse/NUTCH-1741, if
> anyone could review and test it.
>
> Thanks,
> Alparslan
>
>
>


Re: [DISCUSS] Roadmap for 2.3 Release

2014-04-28 Thread Alparslan Avcı
Hi Lewis,

I think we can also add https://issues.apache.org/jira/browse/NUTCH-1674.
This issue was waiting the stable release of gora-0.4.

And IMHO, we can add https://issues.apache.org/jira/browse/NUTCH-1741, if
anyone could review and test it.

Thanks,
Alparslan


On Mon, Apr 28, 2014 at 3:50 AM, Lewis John Mcgibbney <
lewis.mcgibb...@gmail.com> wrote:

> Hi Folks,
> I suggest we get in
> https://issues.apache.org/jira/browse/NUTCH-1714
> then push a release anyone have any other suggestions/additions?
> Unless someone else wants to do RM then I can put time in if required.
> Thanks
> Lewis
>
> --
> *Lewis*
>



-- 
Alparslan Avcı


[DISCUSS] Roadmap for 2.3 Release

2014-04-27 Thread Lewis John Mcgibbney
Hi Folks,
I suggest we get in
https://issues.apache.org/jira/browse/NUTCH-1714
then push a release anyone have any other suggestions/additions?
Unless someone else wants to do RM then I can put time in if required.
Thanks
Lewis

-- 
*Lewis*