Re: NutchGora release, and Nutch 1.x trunk release

2012-04-03 Thread Markus Jelsma
Nutch 1.5 is now ships with Tika 1.1. Thanks Julien! How about preparing for 1.5 and moving all but blocker issues to 1.6? On Thu, 8 Mar 2012 07:32:56 -0800, Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov wrote: Hey Guys, OK, sounds good. Looks like we need to wait for the Tika 1.1

Re: NutchGora release, and Nutch 1.x trunk release

2012-04-03 Thread Lewis John Mcgibbney
+1 Lewis On Tue, Apr 3, 2012 at 11:29 AM, Markus Jelsma markus.jel...@openindex.iowrote: Nutch 1.5 is now ships with Tika 1.1. Thanks Julien! How about preparing for 1.5 and moving all but blocker issues to 1.6? On Thu, 8 Mar 2012 07:32:56 -0800, Mattmann, Chris A (388J)

Re: GSoC : Web page scraper plugin

2012-04-03 Thread Lewis John Mcgibbney
Hi Aamir, Please excuse me not getting back to you off-list, the message is in my drafts and I got distracted yesterday. At this stage if you intend on applying for the issue then I would advise you to get registered with GSoC, and begin writing up a publicly viewable draft submission. You have

Re: GSoC : Web page scraper plugin

2012-04-03 Thread Aamir Khan
On Tue, Apr 3, 2012 at 4:31 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Aamir, Please excuse me not getting back to you off-list, the message is in my drafts and I got distracted yesterday. No problem. At this stage if you intend on applying for the issue then I would

Re: GSoC : Web page scraper plugin

2012-04-03 Thread Lewis John Mcgibbney
Hi Aamir, On Tue, Apr 3, 2012 at 12:05 PM, Aamir Khan syst3m.w...@gmail.com wrote: Exactly, I will have full summer to understand and get up to speed. But since my knowledge is very limited my proposal won't be too good.. :) This doesn't need to be the case. In fact it is crucial that the

Re: NutchGora release, and Nutch 1.x trunk release

2012-04-03 Thread Julien Nioche
Good idea. On 3 April 2012 11:29, Markus Jelsma markus.jel...@openindex.io wrote: Nutch 1.5 is now ships with Tika 1.1. Thanks Julien! How about preparing for 1.5 and moving all but blocker issues to 1.6? On Thu, 8 Mar 2012 07:32:56 -0800, Mattmann, Chris A (388J)

[jira] [Resolved] (NUTCH-1222) Upgrade to new Hadoop 0.22.0

2012-04-03 Thread Markus Jelsma (Resolved) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-1222. -- Resolution: Won't Fix Fix Version/s: (was: 1.5) Assignee: (was: Markus

[jira] [Resolved] (NUTCH-1225) Migrate CrawlDBScanner to MapReduce API

2012-04-03 Thread Markus Jelsma (Resolved) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-1225. -- Resolution: Won't Fix Fix Version/s: (was: 1.5) Assignee: (was: Markus

[jira] [Updated] (NUTCH-717) Make Nutch Solr integration easier

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-717: Fix Version/s: (was: 1.5) (was: nutchgora) 1.6

[jira] [Updated] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1245: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 URL

[jira] [Updated] (NUTCH-1318) Parse time outs crash parsing fetcher

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1318: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 Parse

[jira] [Updated] (NUTCH-1219) Upgrade all jobs to new MapReduce API

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1219: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 Upgrade

[jira] [Updated] (NUTCH-1251) Deletion of duplicates fails with org.apache.solr.client.solrj.SolrServerException

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1251: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6

[jira] [Updated] (NUTCH-578) URL fetched with 403 is generated over and over again

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-578: Fix Version/s: (was: 1.5) 1.6 URL fetched with 403 is generated over

Re: GSoC : Web page scraper plugin

2012-04-03 Thread Aamir Khan
On Tue, Apr 3, 2012 at 4:45 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Aamir, On Tue, Apr 3, 2012 at 12:05 PM, Aamir Khan syst3m.w...@gmail.com wrote: Exactly, I will have full summer to understand and get up to speed. But since my knowledge is very limited my proposal

[jira] [Updated] (NUTCH-1249) Resolve all issues flagged up by adding javac -Xlint arguement

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1249: - Affects Version/s: (was: 1.5) Fix Version/s: (was: 1.5)

[jira] [Updated] (NUTCH-1273) Fix [deprecation] javac warnings

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1273: - Fix Version/s: (was: 1.5) (was: nutchgora) 1.6

[jira] [Updated] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1113: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 Merging

[jira] [Updated] (NUTCH-1116) Write JUnit tests for all plugins

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1116: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 Write

[jira] [Updated] (NUTCH-1084) ReadDB url throws exception

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1084: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 ReadDB

[jira] [Updated] (NUTCH-1150) http.redirect.max can lead to multiple parses of the same url

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1150: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6

[jira] [Updated] (NUTCH-1147) WebGraph nodeDumper uses only 1 reducer

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1147: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6

[jira] [Updated] (NUTCH-1194) CrawlDB lock should be released earlier

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1194: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 CrawlDB

[jira] [Updated] (NUTCH-1201) Allow for different FetcherThread impls

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1201: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 Allow

[jira] [Updated] (NUTCH-1183) Summary task for adding command line usage instructions to webgraph classes

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1183: - Fix Version/s: (was: 1.5) (was: nutchgora) 1.6

[jira] [Updated] (NUTCH-1176) Fix all javadoc warnings from nightly builds

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1176: - Fix Version/s: (was: 1.5) (was: nutchgora) 1.6

[jira] [Updated] (NUTCH-1040) Backport REST-API from 2.0

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1040: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6

[jira] [Updated] (NUTCH-1274) Fix [cast] javac warnings

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1274: - Fix Version/s: (was: 1.5) (was: nutchgora) 1.6

[jira] [Updated] (NUTCH-1233) Rely on Tika for outlink extraction

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1233: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 Rely on

[jira] [Updated] (NUTCH-1014) Migrate from Apache ORO to java.util.regex

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1014: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 Migrate

[jira] [Updated] (NUTCH-1063) OutlinkExtractor test generates an exception but does not fail

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1063: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6

[jira] [Updated] (NUTCH-1220) Upgrade Solr deps

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1220: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 Upgrade

[jira] [Updated] (NUTCH-1123) JUnit test for scoring-link

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1123: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 JUnit

[jira] [Updated] (NUTCH-865) Format source code in unique style

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-865: Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 Format

[jira] [Updated] (NUTCH-1120) JUnit test for microformats-reltag

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1120: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 JUnit

[jira] [Updated] (NUTCH-1186) FreeGenerator always normalizes

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1186: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6

[jira] [Updated] (NUTCH-1308) Unnecessary truncate content configuration, and logging in parse-zip/ZipParser

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1308: - Fix Version/s: (was: 1.5) (was: nutchgora) 1.6

[jira] [Updated] (NUTCH-1252) SegmentReader -get shows wrong data

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1252: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6

[jira] [Updated] (NUTCH-1121) JUnit test for parse-js

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1121: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 JUnit

[jira] [Updated] (NUTCH-809) Parse-metatags plugin

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-809: Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6

[jira] [Updated] (NUTCH-1046) Add tests for indexing to SOLR

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1046: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 Add

[jira] [Updated] (NUTCH-1228) Change mapred.task.timeout to mapreduce.task.timeout in fetcher

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1228: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 Change

[jira] [Updated] (NUTCH-1001) bin/nutch fetch/parse handle crawl/segments directory

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1001: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6

[jira] [Updated] (NUTCH-1060) URL filters to produce regexes to be used by OutlinkExtractor.

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1060: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 URL

[jira] [Updated] (NUTCH-1100) SolrDedup broken

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1100: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6

[jira] [Updated] (NUTCH-1124) JUnit test for scoring-opic

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1124: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 JUnit

[jira] [Updated] (NUTCH-1197) Add statically configured field values to solrindex-mapping.xml

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1197: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 Add

[jira] [Updated] (NUTCH-1122) JUnit test for protocol-ftp

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1122: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 JUnit

[jira] [Updated] (NUTCH-1127) JUnit test for urlfilter-validator

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1127: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 JUnit

[jira] [Updated] (NUTCH-1247) CrawlDatum.retries should be int

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1247: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6

[jira] [Updated] (NUTCH-208) http: proxy exception list:

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-208: Fix Version/s: (was: 1.5) (was: nutchgora) 1.6

[jira] [Updated] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1031: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6

[jira] [Updated] (NUTCH-1107) Log slow parse entries

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1107: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 Log

[jira] [Updated] (NUTCH-585) [PARSE-HTML plugin] Block certain parts of HTML code from being indexed

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-585: Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6

[jira] [Updated] (NUTCH-1320) IndexChecker and ParseChecker choke on IDN's

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1320: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6

[jira] [Updated] (NUTCH-1126) JUnit test for urlfilter-prefix

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1126: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 JUnit

[jira] [Updated] (NUTCH-1087) Deprecate crawl command and replace with example script

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1087: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6

[jira] [Updated] (NUTCH-1128) JUnit test for urlmeta

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1128: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 JUnit

[jira] [Updated] (NUTCH-1034) Create Solr Velocity templates

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1034: - Fix Version/s: (was: 1.5) (was: nutchgora) 1.6

[jira] [Updated] (NUTCH-1179) Option to restrict generated records by metadata

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1179: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 Option

[jira] [Updated] (NUTCH-1300) Indexer to normalize URL's

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1300: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 Indexer

[jira] [Updated] (NUTCH-1130) JUnit test for Any23 RDF plugin

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1130: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 JUnit

[jira] [Updated] (NUTCH-1047) Pluggable indexing backends

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1047: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6

[jira] [Updated] (NUTCH-1035) Tune Solr config for Nutch users

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1035: - Fix Version/s: (was: 1.5) (was: nutchgora) 1.6

[jira] [Updated] (NUTCH-1226) Migrate CrawlDbReader to MapReduce API

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1226: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 Migrate

[jira] [Updated] (NUTCH-1223) Migrate WebGraph to MapReduce API

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1223: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 Migrate

[jira] [Updated] (NUTCH-1202) Fetcher timebomb kills long waiting fetch jobs

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1202: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 Fetcher

[jira] [Updated] (NUTCH-1079) StringBuffer converted to StringBuilder

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1079: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6

[jira] [Updated] (NUTCH-1319) HostNormalizer

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1319: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6

[jira] [Updated] (NUTCH-1140) index-more plugin, resetTitle method creates multiple values in the Title field

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1140: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6

[jira] [Updated] (NUTCH-1021) Migrate OutlinkExtractor from Apache ORO to java.util.regex

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1021: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 Migrate

[jira] [Updated] (NUTCH-1039) Fetcher fails for pages without content-length header

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1039: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 Fetcher

[jira] [Updated] (NUTCH-1275) Fix [unchecked] javac warnings

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1275: - Fix Version/s: (was: 1.5) (was: nutchgora) 1.6

[jira] [Updated] (NUTCH-1151) Index-anchor to add numInlinks count

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1151: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6

[jira] [Updated] (NUTCH-1053) Parsing of RSS feeds fails

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1053: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 Parsing

[jira] [Updated] (NUTCH-961) Expose Tika's boilerpipe support

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-961: Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 Expose

[jira] [Updated] (NUTCH-1118) JUnit test for index-basic

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1118: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 JUnit

[jira] [Updated] (NUTCH-1284) Add site fetcher.max.crawl.delay as log output by default.

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1284: - Fix Version/s: (was: 1.5) (was: nutchgora) 1.6

[jira] [Updated] (NUTCH-1149) DomainStats should process numeric CrawlDB metadata

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1149: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6

[jira] [Updated] (NUTCH-1181) Indexer to use webgraph inlinks

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1181: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 Indexer

[jira] [Updated] (NUTCH-1117) JUnit test for index-anchor

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1117: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 JUnit

[jira] [Updated] (NUTCH-1024) Dynamically set fetchInterval by MIME-type

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1024: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6

[jira] [Updated] (NUTCH-1317) Max content length by MIME-type

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1317: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 Max

[jira] [Updated] (NUTCH-1277) Fix [fallthrough] javac warnings

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1277: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 Fix

[jira] [Updated] (NUTCH-1215) UpdateDB should not require segment as input

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1215: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6

[jira] [Updated] (NUTCH-1103) Port protocol-sftp to 1.4

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1103: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 Port

[jira] [Updated] (NUTCH-1088) Write Solr XML documents

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1088: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 Write

[jira] [Updated] (NUTCH-828) Fetch Filter

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-828: Fix Version/s: (was: 1.5) (was: nutchgora) 1.6

Re: NutchGora release, and Nutch 1.x trunk release

2012-04-03 Thread Markus Jelsma
Remaining issue for 1.5: NUTCH-1208 Don't include KEYS file in bin distribution I obviously couldn't supress e-mail notifications. My sincere apologies for the deluge of e-mail! On Tuesday 03 April 2012 13:22:17 Julien Nioche wrote: Good idea. On 3 April 2012 11:29, Markus Jelsma

[jira] [Commented] (NUTCH-1270) some of Deflate encoded pages not fetched

2012-04-03 Thread behnam nikbakht (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13245259#comment-13245259 ] behnam nikbakht commented on NUTCH-1270: for example, with the site:

Re: NutchGora release, and Nutch 1.x trunk release

2012-04-03 Thread Markus Jelsma
Cool! Next time i'll ask infra to allow to supress notifications. Chris, will you RM one RC? And if possible list the detailed steps/command in the process in case you don't have to time RM 1.6 when the time comes. The wiki is dated. I'm looking forward to yet another big release with lots

[jira] [Commented] (NUTCH-1208) Don't include KEYS file in bin distribution

2012-04-03 Thread Hudson (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13245295#comment-13245295 ] Hudson commented on NUTCH-1208: --- Integrated in nutch-trunk-maven #224 (See

Re: NutchGora release, and Nutch 1.x trunk release

2012-04-03 Thread Mattmann, Chris A (388J)
Hi Markus, On Apr 3, 2012, at 5:50 AM, Markus Jelsma wrote: Cool! Next time i'll ask infra to allow to supress notifications. Chris, will you RM one RC? And if possible list the detailed steps/command in the process in case you don't have to time RM 1.6 when the time comes. The wiki

Re: NutchGora release, and Nutch 1.x trunk release

2012-04-03 Thread Markus Jelsma
On Tuesday 03 April 2012 15:58:54 you wrote: Hi Markus, On Apr 3, 2012, at 5:50 AM, Markus Jelsma wrote: Cool! Next time i'll ask infra to allow to supress notifications. Chris, will you RM one RC? And if possible list the detailed steps/command in the process in case you don't

[Nutch Wiki] Trivial Update of FrontPage by LewisJohnMcgibbney

2012-04-03 Thread Apache Wiki
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The FrontPage page has been changed by LewisJohnMcgibbney: http://wiki.apache.org/nutch/FrontPage?action=diffrev1=238rev2=239 === Tutorials === * NutchTutorial - How to configure

[Nutch Wiki] Trivial Update of Release_HOWTO by LewisJohnMcgibbney

2012-04-03 Thread Apache Wiki
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The Release_HOWTO page has been changed by LewisJohnMcgibbney: http://wiki.apache.org/nutch/Release_HOWTO?action=diffrev1=12rev2=13 = Preparation = 1. Create a new release in

Re: NutchGora release, and Nutch 1.x trunk release

2012-04-03 Thread Lewis John Mcgibbney
Hi, On Tue, Apr 3, 2012 at 3:12 PM, Markus Jelsma markus.jel...@openindex.iowrote: Seems fine. Only updating KEYS is no longer necessary. Now sorted. Thanks whenever you can get round to this Chris. Best Lewis

Re: NutchGora release, and Nutch 1.x trunk release

2012-04-03 Thread Mattmann, Chris A (388J)
Thanks Lewis! Cheers, Chris P.S. Hopefully by this weekend... On Apr 3, 2012, at 7:23 AM, Lewis John Mcgibbney wrote: Hi, On Tue, Apr 3, 2012 at 3:12 PM, Markus Jelsma markus.jel...@openindex.io wrote: Seems fine. Only updating KEYS is no longer necessary. Now sorted. Thanks

[jira] [Commented] (NUTCH-1306) Commit after finished writing to solr index

2012-04-03 Thread Lewis John McGibbney (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13245798#comment-13245798 ] Lewis John McGibbney commented on NUTCH-1306: - Hi Dan. In trunk, we have a

[jira] [Commented] (NUTCH-1251) Deletion of duplicates fails with org.apache.solr.client.solrj.SolrServerException

2012-04-03 Thread Arkadi Kosmynin (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13245877#comment-13245877 ] Arkadi Kosmynin commented on NUTCH-1251: Thanks Markus!

  1   2   >