[jira] [Commented] (NUTCH-1643) Unnecessary fetching with http.content.limit when using protocol-http

2013-10-29 Thread Talat UYARER (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13808006#comment-13808006
 ] 

Talat UYARER commented on NUTCH-1643:
-

Hi [~lewismc], I can look every protocol for this improvement. But we should 
consider your second item. I look at ParseUtil.java  when parsers return null, 
it writes warning log. I think this is not a problem. What do you think ? I 
will uplaod for other protocols. I added parse methods from ParseUtil.java: 

{code:title=ParseUtil.java|borderStyle=solid}
  public Parse parse(String url, WebPage page) throws ParserNotFound, 
  ParseException {
Parser[] parsers = null;

String contentType = TableUtil.toString(page.getContentType());

parsers = this.parserFactory.getParsers(contentType, url);

for (int i=0; iparsers.length; i++) {
  if (LOG.isDebugEnabled()) {
LOG.debug(Parsing [ + url + ] with [ + parsers[i] + ]);
  }
  Parse parse = null;
  
  if (maxParseTime!=-1)
  parse = runParser(parsers[i], url, page);
  else 
  parse = parsers[i].getParse(url, page);
  
  if (parse!=null  ParseStatusUtils.isSuccess(parse.getParseStatus())) {
return parse;
  }
}

LOG.warn(Unable to successfully parse content  + url +
 of type  + contentType);
return ParseStatusUtils.getEmptyParse(new ParseException(Unable to 
successfully parse content), null);
  }
{code}

 Unnecessary fetching with http.content.limit when using protocol-http
 -

 Key: NUTCH-1643
 URL: https://issues.apache.org/jira/browse/NUTCH-1643
 Project: Nutch
  Issue Type: Bug
  Components: protocol
Affects Versions: 2.1, 2.2, 2.2.1
Reporter: Talat UYARER
Priority: Minor
 Fix For: 2.3

 Attachments: NUTCH-1643.patch, NUTCH-1643v2.patch


 In protocol-http, Even If I have http.content.limit value set, protocol-http 
 fetches files of all sizes (larger files are fetched until limit allows). 
 But when Parsing, parser skips incomplete files (if parser.skip.truncated 
 configuration is true). It seems like an unnecessary effort to partially 
 fetch contents larger than limit if they are not gonna be parsed.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (NUTCH-1651) modifiedTime and prevmodifiedTime never set

2013-10-29 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13808045#comment-13808045
 ] 

lufeng commented on NUTCH-1651:
---

Hi Talat

but I think get last modified from header is not appropriate to put in here. If 
user want to check the modification of a html in parser plugin through it's 
content of that url not that metadata in html headers. even the value of 
Last-Modified in headers is changed.

{code:java}
+Utf8 lastModified = page.getFromHeaders(new Utf8(Last-Modified));
+if ( lastModified != null ){
+  try {
+modifiedTime = HttpDateFormat.toLong(lastModified.toString());
+prevModifiedTime = page.getModifiedTime();
+  } catch (Exception e) {
+  }
+}
{code}

maybe appropriate way is to let parser plugin defined by user to set the value 
of modified time not in DbUpdateReducer class.

 modifiedTime and prevmodifiedTime never set 
 

 Key: NUTCH-1651
 URL: https://issues.apache.org/jira/browse/NUTCH-1651
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.2.1
Reporter: Talat UYARER
 Fix For: 2.3

 Attachments: NUTCH-1651.patch


 modifiedTime is never set. If you use DefaultFetchScheduler, modifiedTime is 
 always zero as default. But if you use AdaptiveFetchScheduler, modifiedTime 
 is set only once in the beginning by zero-control of AdaptiveFetchScheduler.
 But this is not sufficient since modifiedTime needs to be updated whenever 
 last modified time is available. We corrected this with a patch.
 Also we noticed that prevModifiedTime is not written to database and we 
 corrected that too.
 With this patch, whenever lastModifiedTime is available, we do two things. 
 First we set modifiedTime in the Page object to prevModifiedTime. After that 
 we set lastModifiedTime to modifiedTime.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (NUTCH-1564) AdaptiveFetchSchedule: sync_delta forces immediate refetch for documents not modified

2013-10-29 Thread Talat UYARER (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13808122#comment-13808122
 ] 

Talat UYARER commented on NUTCH-1564:
-

[~amuseme.lu] How do you check this problem. Do you use main method in 
AdaptiveFetchScheduler for checking ? If you use that, that has some issues. 
[~icebergx5] fixed in TestAdaptiveFetchScheduler.java you can control that.


 AdaptiveFetchSchedule: sync_delta forces immediate refetch for documents not 
 modified
 -

 Key: NUTCH-1564
 URL: https://issues.apache.org/jira/browse/NUTCH-1564
 Project: Nutch
  Issue Type: Bug
  Components: crawldb
Affects Versions: 1.6, 2.1
Reporter: Sebastian Nagel
Priority: Critical

 In a continuous crawl with adaptive fetch scheduling documents not modified 
 for a longer time are may be fetched in every cycle.
 A continous crawl is run daily with a 3 cycles and the following scheduling 
 intervals (freshness matters):
 {code}
 db.fetch.schedule.class = org.apache.nutch.crawl.AdaptiveFetchSchedule
 db.fetch.schedule.adaptive.sync_delta   = true (default)
 db.fetch.schedule.adaptive.sync_delta_rate = 0.3 (default)
 db.fetch.interval.default   = 172800 (2 days)
 db.fetch.schedule.adaptive.min_interval =  86400 (1 day)
 db.fetch.schedule.adaptive.max_interval = 604800 (7 days)
 db.fetch.interval.max   = 604800 (7 days)
 {code}
 At Apr 18 a URL is generated and fetched (from segment dump):
 {code}
 Crawl Generate::
 Status: 2 (db_fetched)
 Fetch time: Mon Apr 15 19:43:22 CEST 2013
 Modified time: Tue Mar 19 01:07:42 CET 2013
 Retries since fetch: 0
 Retry interval: 604800 seconds (7 days)
 Crawl Fetch::
 Status: 33 (fetch_success)
 Fetch time: Thu Apr 18 01:23:51 CEST 2013
 Modified time: Tue Mar 19 01:07:42 CET 2013
 Retries since fetch: 0
 Retry interval: 604800 seconds (7 days)
 {code}
 Running CrawlDb update results in a next fetch time in the past (which forces 
 an immediate refetch in the next cycle):
 {code}
 Status: 6 (db_notmodified)
 Fetch time: Tue Apr 16 01:37:00 CEST 2013
 Modified time: Tue Mar 19 01:07:42 CET 2013
 Retries since fetch: 0
 Retry interval: 604800 seconds (7 days)
 {code}
 This behavior is caused by the sync_delta calculation in 
 AdaptiveFetchSchedule:
 {code}
   if (SYNC_DELTA) {
 // try to synchronize with the time of change
 long delta = (fetchTime - modifiedTime) / 1000L;
 if (delta  interval) interval = delta;
 refTime = fetchTime - Math.round(delta * SYNC_DELTA_RATE * 1000);
   }
   if (interval  MIN_INTERVAL) {
 interval = MIN_INTERVAL;
   } else if (interval  MAX_INTERVAL) {
 interval = MAX_INTERVAL;
   }
 ...
 datum.setFetchTime(refTime + Math.round(interval * 1000.0));
 {code}
 {{delta}} is 30 days (Apr 18 - Mar 19). {{refTime}} is then 9 days in the 
 past ({{delta}} * 0.3). After adding {{interval}} (adjusted to 
 {{MAX_INTERVAL}} = 7 days) to {{refTime}} the next fetch should take place 
 2 days in the past (Apr 16).
 According to the 
 [javadoc|http://nutch.apache.org/apidocs-2.1/org/apache/nutch/crawl/AdaptiveFetchSchedule.html]
  (if understood right), there are to aims of the sync_delta if we know that a 
 document hasn't been modified for long:
 * increase the fetch interval immediately (not step by step)
 * because we expect the document to be changed within the adaptive interval 
 (but it hasn't), we shift the reference time, i.e. we expect a change soon.
 These two aims are somehow in contradiction. In any case, the next fetch time 
 should be always within the range of (currentFetchTime + MIN_INTERVAL) and 
 (currentFetchTime + MAX_INTERVAL) and never in the past.
 This problem has been noted by [~pascaldimassimo] in 
 [1|http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/] and 
 [2|http://lucene.472066.n3.nabble.com/Adaptive-sync-with-the-time-of-page-change-td870842.html#a897234].



--
This message was sent by Atlassian JIRA
(v6.1#6144)


Lucene SOLR Revolution Dublin

2013-10-29 Thread Julien Nioche
Hi guys

Anyone going to http://www.lucenerevolution.org/ next week? I'll be giving
a talk on Nutch, ping me on twitter or email if you want to meet up and
have a chat.

Julien

-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


[jira] [Updated] (NUTCH-1564) AdaptiveFetchSchedule: sync_delta forces immediate refetch for documents not modified

2013-10-29 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1564:
---

Description: 
In a continuous crawl with adaptive fetch scheduling documents not modified for 
a longer time are may be fetched in every cycle.

A continous crawl is run daily with a 3 cycles and the following scheduling 
intervals (freshness matters):
{code}
db.fetch.schedule.class = org.apache.nutch.crawl.AdaptiveFetchSchedule
db.fetch.schedule.adaptive.sync_delta   = true (default)
db.fetch.schedule.adaptive.sync_delta_rate = 0.3 (default)
db.fetch.interval.default   = 172800 (2 days)
db.fetch.schedule.adaptive.min_interval =  86400 (1 day)
db.fetch.schedule.adaptive.max_interval = 604800 (7 days)
db.fetch.interval.max   = 604800 (7 days)
{code}

At Apr 18 a URL is generated and fetched (from segment dump):
{code}
Crawl Generate::
Status: 2 (db_fetched)
Fetch time: Mon Apr 15 19:43:22 CEST 2013
Modified time: Tue Mar 19 01:07:42 CET 2013
Retries since fetch: 0
Retry interval: 604800 seconds (7 days)

Crawl Fetch::
Status: 33 (fetch_success)
Fetch time: Thu Apr 18 01:23:51 CEST 2013
Modified time: Tue Mar 19 01:07:42 CET 2013
Retries since fetch: 0
Retry interval: 604800 seconds (7 days)
{code}

Running CrawlDb update results in a next fetch time in the past (which forces 
an immediate refetch in the next cycle):
{code}
Status: 6 (db_notmodified)
Fetch time: Tue Apr 16 01:37:00 CEST 2013
Modified time: Tue Mar 19 01:07:42 CET 2013
Retries since fetch: 0
Retry interval: 604800 seconds (7 days)
{code}

This behavior is caused by the sync_delta calculation in AdaptiveFetchSchedule:
{code}
  if (SYNC_DELTA) {
// try to synchronize with the time of change
long delta = (fetchTime - modifiedTime) / 1000L;
if (delta  interval) interval = delta;
refTime = fetchTime - Math.round(delta * SYNC_DELTA_RATE * 1000);
  }
  if (interval  MIN_INTERVAL) {
interval = MIN_INTERVAL;
  } else if (interval  MAX_INTERVAL) {
interval = MAX_INTERVAL;
  }
...
datum.setFetchTime(refTime + Math.round(interval * 1000.0));
{code}
{{delta}} is 30 days (Apr 18 - Mar 19). {{refTime}} is then 9 days in the past 
({{delta}} * 0.3). After adding {{interval}} (adjusted to {{MAX_INTERVAL}} = 7 
days) to {{refTime}} the next fetch should take place 2 days in the past (Apr 
16).

According to the 
[javadoc|http://nutch.apache.org/apidocs-2.1/org/apache/nutch/crawl/AdaptiveFetchSchedule.html]
 (if understood right), there are two aims of the sync_delta if we know that a 
document hasn't been modified for long:
* increase the fetch interval immediately (not step by step)
* because we expect the document to be changed within the adaptive interval 
(but it hasn't), we shift the reference time, i.e. we expect a change soon.

These two aims are somehow in contradiction. In any case, the next fetch time 
should be always within the range of (currentFetchTime + MIN_INTERVAL) and 
(currentFetchTime + MAX_INTERVAL) and never in the past.

This problem has been noted by [~pascaldimassimo] in 
[1|http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/] and 
[2|http://lucene.472066.n3.nabble.com/Adaptive-sync-with-the-time-of-page-change-td870842.html#a897234].


  was:
In a continuous crawl with adaptive fetch scheduling documents not modified for 
a longer time are may be fetched in every cycle.

A continous crawl is run daily with a 3 cycles and the following scheduling 
intervals (freshness matters):
{code}
db.fetch.schedule.class = org.apache.nutch.crawl.AdaptiveFetchSchedule
db.fetch.schedule.adaptive.sync_delta   = true (default)
db.fetch.schedule.adaptive.sync_delta_rate = 0.3 (default)
db.fetch.interval.default   = 172800 (2 days)
db.fetch.schedule.adaptive.min_interval =  86400 (1 day)
db.fetch.schedule.adaptive.max_interval = 604800 (7 days)
db.fetch.interval.max   = 604800 (7 days)
{code}

At Apr 18 a URL is generated and fetched (from segment dump):
{code}
Crawl Generate::
Status: 2 (db_fetched)
Fetch time: Mon Apr 15 19:43:22 CEST 2013
Modified time: Tue Mar 19 01:07:42 CET 2013
Retries since fetch: 0
Retry interval: 604800 seconds (7 days)

Crawl Fetch::
Status: 33 (fetch_success)
Fetch time: Thu Apr 18 01:23:51 CEST 2013
Modified time: Tue Mar 19 01:07:42 CET 2013
Retries since fetch: 0
Retry interval: 604800 seconds (7 days)
{code}

Running CrawlDb update results in a next fetch time in the past (which forces 
an immediate refetch in the next cycle):
{code}
Status: 6 (db_notmodified)
Fetch time: Tue Apr 16 01:37:00 CEST 2013
Modified time: Tue Mar 19 01:07:42 CET 2013
Retries since fetch: 0
Retry interval: 604800 seconds (7 days)
{code}

This behavior is caused by the sync_delta calculation in AdaptiveFetchSchedule:
{code}
  if (SYNC_DELTA) {
// try to synchronize with the time of change
long delta = (fetchTime - modifiedTime) / 1000L;
if 

[jira] [Updated] (NUTCH-1564) AdaptiveFetchSchedule: sync_delta forces immediate refetch for documents not modified

2013-10-29 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1564:
---

Description: 
In a continuous crawl with adaptive fetch scheduling documents not modified for 
a longer time may be fetched in every cycle.

A continous crawl is run daily with a 3 cycles and the following scheduling 
intervals (freshness matters):
{code}
db.fetch.schedule.class = org.apache.nutch.crawl.AdaptiveFetchSchedule
db.fetch.schedule.adaptive.sync_delta   = true (default)
db.fetch.schedule.adaptive.sync_delta_rate = 0.3 (default)
db.fetch.interval.default   = 172800 (2 days)
db.fetch.schedule.adaptive.min_interval =  86400 (1 day)
db.fetch.schedule.adaptive.max_interval = 604800 (7 days)
db.fetch.interval.max   = 604800 (7 days)
{code}

At Apr 18 a URL is generated and fetched (from segment dump):
{code}
Crawl Generate::
Status: 2 (db_fetched)
Fetch time: Mon Apr 15 19:43:22 CEST 2013
Modified time: Tue Mar 19 01:07:42 CET 2013
Retries since fetch: 0
Retry interval: 604800 seconds (7 days)

Crawl Fetch::
Status: 33 (fetch_success)
Fetch time: Thu Apr 18 01:23:51 CEST 2013
Modified time: Tue Mar 19 01:07:42 CET 2013
Retries since fetch: 0
Retry interval: 604800 seconds (7 days)
{code}

Running CrawlDb update results in a next fetch time in the past (which forces 
an immediate refetch in the next cycle):
{code}
Status: 6 (db_notmodified)
Fetch time: Tue Apr 16 01:37:00 CEST 2013
Modified time: Tue Mar 19 01:07:42 CET 2013
Retries since fetch: 0
Retry interval: 604800 seconds (7 days)
{code}

This behavior is caused by the sync_delta calculation in AdaptiveFetchSchedule:
{code}
  if (SYNC_DELTA) {
// try to synchronize with the time of change
long delta = (fetchTime - modifiedTime) / 1000L;
if (delta  interval) interval = delta;
refTime = fetchTime - Math.round(delta * SYNC_DELTA_RATE * 1000);
  }
  if (interval  MIN_INTERVAL) {
interval = MIN_INTERVAL;
  } else if (interval  MAX_INTERVAL) {
interval = MAX_INTERVAL;
  }
...
datum.setFetchTime(refTime + Math.round(interval * 1000.0));
{code}
{{delta}} is 30 days (Apr 18 - Mar 19). {{refTime}} is then 9 days in the past 
({{delta}} * 0.3). After adding {{interval}} (adjusted to {{MAX_INTERVAL}} = 7 
days) to {{refTime}} the next fetch should take place 2 days in the past (Apr 
16).

According to the 
[javadoc|http://nutch.apache.org/apidocs-2.1/org/apache/nutch/crawl/AdaptiveFetchSchedule.html]
 (if understood right), there are two aims of the sync_delta if we know that a 
document hasn't been modified for long:
* increase the fetch interval immediately (not step by step)
* because we expect the document to be changed within the adaptive interval 
(but it hasn't), we shift the reference time, i.e. we expect a change soon.

These two aims are somehow in contradiction. In any case, the next fetch time 
should be always within the range of (currentFetchTime + MIN_INTERVAL) and 
(currentFetchTime + MAX_INTERVAL) and never in the past.

This problem has been noted by [~pascaldimassimo] in 
[1|http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/] and 
[2|http://lucene.472066.n3.nabble.com/Adaptive-sync-with-the-time-of-page-change-td870842.html#a897234].


  was:
In a continuous crawl with adaptive fetch scheduling documents not modified for 
a longer time are may be fetched in every cycle.

A continous crawl is run daily with a 3 cycles and the following scheduling 
intervals (freshness matters):
{code}
db.fetch.schedule.class = org.apache.nutch.crawl.AdaptiveFetchSchedule
db.fetch.schedule.adaptive.sync_delta   = true (default)
db.fetch.schedule.adaptive.sync_delta_rate = 0.3 (default)
db.fetch.interval.default   = 172800 (2 days)
db.fetch.schedule.adaptive.min_interval =  86400 (1 day)
db.fetch.schedule.adaptive.max_interval = 604800 (7 days)
db.fetch.interval.max   = 604800 (7 days)
{code}

At Apr 18 a URL is generated and fetched (from segment dump):
{code}
Crawl Generate::
Status: 2 (db_fetched)
Fetch time: Mon Apr 15 19:43:22 CEST 2013
Modified time: Tue Mar 19 01:07:42 CET 2013
Retries since fetch: 0
Retry interval: 604800 seconds (7 days)

Crawl Fetch::
Status: 33 (fetch_success)
Fetch time: Thu Apr 18 01:23:51 CEST 2013
Modified time: Tue Mar 19 01:07:42 CET 2013
Retries since fetch: 0
Retry interval: 604800 seconds (7 days)
{code}

Running CrawlDb update results in a next fetch time in the past (which forces 
an immediate refetch in the next cycle):
{code}
Status: 6 (db_notmodified)
Fetch time: Tue Apr 16 01:37:00 CEST 2013
Modified time: Tue Mar 19 01:07:42 CET 2013
Retries since fetch: 0
Retry interval: 604800 seconds (7 days)
{code}

This behavior is caused by the sync_delta calculation in AdaptiveFetchSchedule:
{code}
  if (SYNC_DELTA) {
// try to synchronize with the time of change
long delta = (fetchTime - modifiedTime) / 1000L;
if 

RE: Lucene SOLR Revolution Dublin

2013-10-29 Thread Markus Jelsma
Won't be there as usual but i'll be sure to check out your talk if it's 
available! :)

cheers mate

-Original message-
From: Julien Niochelists.digitalpeb...@gmail.com
Sent: Tuesday 29th October 2013 16:27
To: u...@nutch.apache.org; dev@nutch.apache.org
Subject: Lucene SOLR Revolution Dublin

Hi guys

Anyone going to http://www.lucenerevolution.org/ 
http://www.lucenerevolution.org/ next week? Ill be giving a talk on Nutch, 
ping me on twitter or email if you want to meet up and have a chat.

Julien

--

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/ http://digitalpebble.blogspot.com/
http://www.digitalpebble.com http://www.digitalpebble.com
http://twitter.com/digitalpebble http://twitter.com/digitalpebble