[jira] [Commented] (NUTCH-1948) Make the Selenium remote web driver specification, configuration and selection available via a Factory-type mechanism

2015-03-09 Thread Mo Omer (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353191#comment-14353191 ] Mo Omer commented on NUTCH-1948: Yo Lewis, In addition to being able to configure the

[jira] [Commented] (NUTCH-1936) GSoC 2015 - Move Nutch to Hadoop 2.X

2015-03-09 Thread Ashwini Tokekar (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353537#comment-14353537 ] Ashwini Tokekar commented on NUTCH-1936: Hi Lewis, I am interested in this

[Nutch Wiki] Update of CommonCrawlDataDumper by GiuseppeTotaro

2015-03-09 Thread Apache Wiki
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The CommonCrawlDataDumper page has been changed by GiuseppeTotaro: https://wiki.apache.org/nutch/CommonCrawlDataDumper New page: The CommonCrawlDataDumper is a Nutch tool able to dump out

Handling servers with wrong Last Modified HTTP header

2015-03-09 Thread Jorge Luis Betancourt González
Recently in the search app we are working on we've encountered a lot of websites that have a wrong and invalid date in the Last Modified HTTP header, meaning for instance that an article posted on a news site back in 2010 has a Las Modified header of just a few days back, this could be for any