Jenkins build is back to normal : Nutch-trunk #2377

2013-10-05 Thread Apache Jenkins Server
See 



Jenkins build is back to normal : Nutch-nutchgora #779

2013-10-05 Thread Apache Jenkins Server
See 



[jira] [Updated] (NUTCH-1562) Order of execution for scoring filters

2013-10-05 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1562:
---

Attachment: NUTCH-1562-trunk.patch.v3

> Order of execution for scoring filters
> --
>
> Key: NUTCH-1562
> URL: https://issues.apache.org/jira/browse/NUTCH-1562
> Project: Nutch
>  Issue Type: Bug
>  Components: documentation
>Affects Versions: 1.6, 2.1
>Reporter: Julien Nioche
> Fix For: 2.3, 1.8
>
> Attachments: NUTCH-1562-trunk.patch, NUTCH-1562-trunk.patch.v2, 
> NUTCH-1562-trunk.patch.v3
>
>
> The documentation in nutch-default.xml states that :
> {quote}
> 
>   scoring.filter.order
>   
>   The order in which scoring filters are applied.
>   This may be left empty (in which case all available scoring
>   filters will be applied in the order defined in plugin-includes
>   and plugin-excludes), or a space separated list of implementation
>   classes.
>   
> 
> {quote}
> however if no order is specified the filters are ordered randomly and not in 
> the order defined in plugin-includes.
> The other *order parameters (e.g. urlfilter.order) have a different 
> documentation and "are loaded and applied in system defined order" which 
> corresponds to what the code does.
> The patch attached is for 1.x and puts the code in accordance with the 
> documentation by ordering the filters according to the order of the plugins, 
> which gives users more control without having to specify the classes 
> explicitly in scoring.filter.order.
> We could extend the same idea to the other *order params.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (NUTCH-1562) Order of execution for scoring filters

2013-10-05 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13787361#comment-13787361
 ] 

Sebastian Nagel commented on NUTCH-1562:


Hi Julien,
originally, this issue was only about ordering of scoring filters in "order 
defined in plugin-includes and plugin-excludes". Is this ever possible? It 
seems that the order of filter plugins does not depend on how "plugin.includes" 
is written - order is stable but "random". Property "plugin.includes" is a 
regular expression only used to filter plugins. Unrolling a regex to an ordered 
list is not simple, sometimes almost impossible because both 
{{scoring-(depth|opic)}} and {{scoring-(d\[Ee]pth|.p.c)}} are valid and cause 
exactly the same plugins loaded (until you start implementing a 
{{scoring-apoc}} plugin. Maybe we should simply fix the description in 
nutch-default.xml?

+1 to fix the NPE. But this could be done at one point for all filter plugins 
(scoring/url/parse/indexing). Attached a new patch which tries to "centralize" 
the code to load filter plugins in an order defined by a property.

> Order of execution for scoring filters
> --
>
> Key: NUTCH-1562
> URL: https://issues.apache.org/jira/browse/NUTCH-1562
> Project: Nutch
>  Issue Type: Bug
>  Components: documentation
>Affects Versions: 1.6, 2.1
>Reporter: Julien Nioche
> Fix For: 2.3, 1.8
>
> Attachments: NUTCH-1562-trunk.patch, NUTCH-1562-trunk.patch.v2, 
> NUTCH-1562-trunk.patch.v3
>
>
> The documentation in nutch-default.xml states that :
> {quote}
> 
>   scoring.filter.order
>   
>   The order in which scoring filters are applied.
>   This may be left empty (in which case all available scoring
>   filters will be applied in the order defined in plugin-includes
>   and plugin-excludes), or a space separated list of implementation
>   classes.
>   
> 
> {quote}
> however if no order is specified the filters are ordered randomly and not in 
> the order defined in plugin-includes.
> The other *order parameters (e.g. urlfilter.order) have a different 
> documentation and "are loaded and applied in system defined order" which 
> corresponds to what the code does.
> The patch attached is for 1.x and puts the code in accordance with the 
> documentation by ordering the filters according to the order of the plugins, 
> which gives users more control without having to specify the classes 
> explicitly in scoring.filter.order.
> We could extend the same idea to the other *order params.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (NUTCH-1568) port pluggable indexing architecture to 2.x

2013-10-05 Thread Talat UYARER (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13787321#comment-13787321
 ] 

Talat UYARER commented on NUTCH-1568:
-

You are welcome Lewis. Actually I should say thank you. You developed a very 
good project.

I will upgrade compatible with Solr 4.x  in NUTCH-1486. 

 

> port pluggable indexing architecture to 2.x
> ---
>
> Key: NUTCH-1568
> URL: https://issues.apache.org/jira/browse/NUTCH-1568
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 2.2
>Reporter: Lewis John McGibbney
> Fix For: 2.3
>
>
> I would like to port the work done by Julien on NUTCH-1047 to 2.x. This issue 
> should track that. It would be nice to do the upgrade in NUTCH-1486 before we 
> do the upgrade so that people can get using with solr 4.x ASAP.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (NUTCH-1568) port pluggable indexing architecture to 2.x

2013-10-05 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13787250#comment-13787250
 ] 

Lewis John McGibbney commented on NUTCH-1568:
-

Talat, your contribs are being noticed and acknowledged within nutch
community.
Great work and thank u for the patches.


https://issues.apache.org/jira/browse/NUTCH-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13787205#comment-13787205]
Thanks Talat
issue should track that. It would be nice to do the upgrade in NUTCH-1486
before we do the upgrade so that people can get using with solr 4.x ASAP.

-- 
*Lewis*


> port pluggable indexing architecture to 2.x
> ---
>
> Key: NUTCH-1568
> URL: https://issues.apache.org/jira/browse/NUTCH-1568
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 2.2
>Reporter: Lewis John McGibbney
> Fix For: 2.3
>
>
> I would like to port the work done by Julien on NUTCH-1047 to 2.x. This issue 
> should track that. It would be nice to do the upgrade in NUTCH-1486 before we 
> do the upgrade so that people can get using with solr 4.x ASAP.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (NUTCH-1568) port pluggable indexing architecture to 2.x

2013-10-05 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13787205#comment-13787205
 ] 

Julien Nioche commented on NUTCH-1568:
--

Great! I don't think anyone is working on this so feel free to do it. Thanks 
Talat

> port pluggable indexing architecture to 2.x
> ---
>
> Key: NUTCH-1568
> URL: https://issues.apache.org/jira/browse/NUTCH-1568
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 2.2
>Reporter: Lewis John McGibbney
> Fix For: 2.3
>
>
> I would like to port the work done by Julien on NUTCH-1047 to 2.x. This issue 
> should track that. It would be nice to do the upgrade in NUTCH-1486 before we 
> do the upgrade so that people can get using with solr 4.x ASAP.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (NUTCH-1645) Junit Test Case for Adaptive Fetch Schedule class

2013-10-05 Thread Talat UYARER (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13787203#comment-13787203
 ] 

Talat UYARER commented on NUTCH-1645:
-

This is not Junit Test Yasin. Yes this is good but enough. May be you can write 
more like Unit Test.

> Junit Test Case for Adaptive Fetch Schedule class
> -
>
> Key: NUTCH-1645
> URL: https://issues.apache.org/jira/browse/NUTCH-1645
> Project: Nutch
>  Issue Type: Test
>Affects Versions: 2.2.1
>Reporter: Talat UYARER
>Priority: Minor
> Fix For: 2.3
>
> Attachments: NUTCH-1645.patch
>
>
> Currently there is not Test Case for Adaptive Fetch Schedule. Junit test 
> Writes for its. 



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (NUTCH-1568) port pluggable indexing architecture to 2.x

2013-10-05 Thread Talat UYARER (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13787191#comment-13787191
 ] 

Talat UYARER commented on NUTCH-1568:
-

is anybody dealing with this issue? I want to deal with this issue.

> port pluggable indexing architecture to 2.x
> ---
>
> Key: NUTCH-1568
> URL: https://issues.apache.org/jira/browse/NUTCH-1568
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 2.2
>Reporter: Lewis John McGibbney
> Fix For: 2.3
>
>
> I would like to port the work done by Julien on NUTCH-1047 to 2.x. This issue 
> should track that. It would be nice to do the upgrade in NUTCH-1486 before we 
> do the upgrade so that people can get using with solr 4.x ASAP.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (NUTCH-1588) Port NUTCH-1245 URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again to 2.x

2013-10-05 Thread Talat UYARER (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Talat UYARER updated NUTCH-1588:


Attachment: NUTCH-1588.patch

I develop for 2.x 

> Port NUTCH-1245 URL gone with 404 after db.fetch.interval.max stays 
> db_unfetched in CrawlDb and is generated over and over again to 2.x
> ---
>
> Key: NUTCH-1588
> URL: https://issues.apache.org/jira/browse/NUTCH-1588
> Project: Nutch
>  Issue Type: Bug
>Reporter: Lewis John McGibbney
>Priority: Critical
> Fix For: 2.3
>
> Attachments: NUTCH-1588.patch
>
>
> A document gone with 404 after db.fetch.interval.max (90 days) has passed
> is fetched over and over again but although fetch status is fetch_gone
> its status in CrawlDb keeps db_unfetched. Consequently, this document will
> be generated and fetched from now on in every cycle.
> To reproduce:
> # create a CrawlDatum in CrawlDb which retry interval hits 
> db.fetch.interval.max (I manipulated the shouldFetch() in 
> AbstractFetchSchedule to achieve this)
> # now this URL is fetched again
> # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to 
> db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 
> days)
> # this does not change with every generate-fetch-update cycle, here for two 
> segments:
> {noformat}
> /tmp/testcrawl/segments/20120105161430
> SegmentReader: get 'http://localhost/page_gone'
> Crawl Generate::
> Status: 1 (db_unfetched)
> Fetch time: Thu Jan 05 16:14:21 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> Crawl Fetch::
> Status: 37 (fetch_gone)
> Fetch time: Thu Jan 05 16:14:48 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> /tmp/testcrawl/segments/20120105161631
> SegmentReader: get 'http://localhost/page_gone'
> Crawl Generate::
> Status: 1 (db_unfetched)
> Fetch time: Thu Jan 05 16:16:23 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> Crawl Fetch::
> Status: 37 (fetch_gone)
> Fetch time: Thu Jan 05 16:20:05 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> {noformat}
> As far as I can see it's caused by setPageGoneSchedule() in 
> AbstractFetchSchedule. Some pseudo-code:
> {code}
> setPageGoneSchedule (called from update / CrawlDbReducer.reduce):
> datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * 
> maxInterval
> datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516
> if (maxInterval < datum.fetchInterval) // necessarily true
>forceRefetch()
> forceRefetch:
> if (datum.fetchInterval > maxInterval) // true because it's 1.35 * 
> maxInterval
>datum.fetchInterval = 0.9 * maxInterval
> datum.status = db_unfetched // 
> shouldFetch (called from generate / Generator.map):
> if ((datum.fetchTime - curTime) > maxInterval)
>// always true if the crawler is launched in short intervals
>// (lower than 0.35 * maxInterval)
>datum.fetchTime = curTime // forces a refetch
> {code}
> After setPageGoneSchedule is called via update the state is db_unfetched and 
> the retry interval 0.9 * db.fetch.interval.max (81 days). 
> Although the fetch time in the CrawlDb is far in the future
> {noformat}
> % nutch readdb testcrawl/crawldb -url http://localhost/page_gone
> URL: http://localhost/page_gone
> Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Sun May 06 05:20:05 CEST 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Score: 1.0
> Signature: null
> Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone
> {noformat}
> the URL is generated again because (fetch time - current time) is larger than 
> db.fetch.interval.max.
> The retry interval (datum.fetchInterval) oscillates between 0.9 and 1.35, and 
> the fetch time is always close to current time + 1.35 * db.fetch.interval.max.
> It's possibly a side effect of NUTCH-516, and may be related to NUTCH-578



--
This message was sent by Atlassian JIRA
(v6.1#6144)