[jira] [Updated] (NUTCH-2215) Generator to restrict crawl to mime type

2016-02-11 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2215: - Attachment: NUTCH-2215.patch Patch for trunk. Unit test passes! > Generator to restrict cr

[jira] [Created] (NUTCH-2215) Generator to restrict crawl to mime type

2016-02-11 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2215: Summary: Generator to restrict crawl to mime type Key: NUTCH-2215 URL: https://issues.apache.org/jira/browse/NUTCH-2215 Project: Nutch Issue Type

[jira] [Created] (NUTCH-2214) Index clean to be flexible on what it deletes

2016-02-10 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2214: Summary: Index clean to be flexible on what it deletes Key: NUTCH-2214 URL: https://issues.apache.org/jira/browse/NUTCH-2214 Project: Nutch Issue Type

[jira] [Resolved] (NUTCH-2197) Add solr5 solrcloud indexer support

2016-02-03 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-2197. -- Resolution: Fixed Committed to trunk in revision 1728313. Thanks Jurian Broertjes! >

[jira] [Updated] (NUTCH-2197) Add solr5 solrcloud indexer support

2016-02-03 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2197: - Fix Version/s: 1.12 > Add solr5 solrcloud indexer supp

[jira] [Updated] (NUTCH-2197) Add solr5 solrcloud indexer support

2016-02-03 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2197: - Affects Version/s: (was: 1.12) 1.11 > Add solr5 solrcloud inde

[jira] [Updated] (NUTCH-2197) Add solr5 solrcloud indexer support

2016-02-03 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2197: - Attachment: NUTCH-2197.patch Previous patch was missing a proper version in plugin.xml

[jira] [Created] (NUTCH-2212) Decrease memory consumption by tuning stack size

2016-02-03 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2212: Summary: Decrease memory consumption by tuning stack size Key: NUTCH-2212 URL: https://issues.apache.org/jira/browse/NUTCH-2212 Project: Nutch Issue Type

[jira] [Closed] (NUTCH-2211) Filter and normalizer checkers missing in bin/nutch

2016-02-03 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-2211. > Filter and normalizer checkers missing in bin/nu

[jira] [Created] (NUTCH-2211) Filter and normalizer checkers missing in bin/nutch

2016-02-03 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2211: Summary: Filter and normalizer checkers missing in bin/nutch Key: NUTCH-2211 URL: https://issues.apache.org/jira/browse/NUTCH-2211 Project: Nutch Issue Type

[jira] [Resolved] (NUTCH-2211) Filter and normalizer checkers missing in bin/nutch

2016-02-03 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-2211. -- Resolution: Fixed Committed to trunk in revision 1728339. > Filter and normalizer check

[jira] [Updated] (NUTCH-2211) Filter and normalizer checkers missing in bin/nutch

2016-02-03 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2211: - Attachment: NUTCH-2211.patch Patch for trunk. > Filter and normalizer checkers missing in

[jira] [Updated] (NUTCH-2197) Add solr5 solrcloud indexer support

2016-02-02 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2197: - Attachment: NUTCH-2197.patch Here's the updated patch with Solr 5.4.1 > Add solr5 solrcl

[jira] [Created] (NUTCH-2210) Upgrade to Tika 1.12

2016-02-02 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2210: Summary: Upgrade to Tika 1.12 Key: NUTCH-2210 URL: https://issues.apache.org/jira/browse/NUTCH-2210 Project: Nutch Issue Type: Task Reporter

[jira] [Commented] (NUTCH-2197) Add solr5 solrcloud indexer support

2016-02-02 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15128366#comment-15128366 ] Markus Jelsma commented on NUTCH-2197: -- I am going to commit this soon unless objections. >

[jira] [Commented] (NUTCH-961) Expose Tika's boilerpipe support

2016-01-26 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15117024#comment-15117024 ] Markus Jelsma commented on NUTCH-961: - Yes! :) > Expose Tika's boilerpipe supp

[jira] [Commented] (NUTCH-961) Expose Tika's boilerpipe support

2016-01-26 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15116975#comment-15116975 ] Markus Jelsma commented on NUTCH-961: - With boilerpipe, you get only a very few outlinks, those found

[jira] [Updated] (NUTCH-1465) Support sitemaps in Nutch

2016-01-26 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1465: - Fix Version/s: 1.13 > Support sitemaps in Nu

[jira] [Commented] (NUTCH-961) Expose Tika's boilerpipe support

2016-01-25 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15114989#comment-15114989 ] Markus Jelsma commented on NUTCH-961: - That is probably due to the patch parsing twice. Once with BP

[jira] [Commented] (NUTCH-2205) Nutch solrdedup error in solrcloud for larger docs

2016-01-25 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15114991#comment-15114991 ] Markus Jelsma commented on NUTCH-2205: -- This looks like your cluster was down, not a Nutch error

[jira] [Updated] (NUTCH-1325) HostDB for Nutch

2016-01-21 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1325: - Patch Info: Patch Available Description: h1. HostDB for Apache Nutch 1.x * automatically

[jira] [Updated] (NUTCH-1325) HostDB for Nutch

2016-01-21 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1325: - Attachment: NUTCH-1325.patch Updated patch for trunk contains more thorough config descriptions

[jira] [Updated] (NUTCH-1325) HostDB for Nutch

2016-01-21 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1325: - Attachment: NUTCH-1325.patch Updated patch to use TDigest for streaming percentiles. But because

[jira] [Updated] (NUTCH-1325) HostDB for Nutch

2016-01-21 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1325: - Attachment: NUTCH-1325.patch TDigest is awesome! Here's with support for user configurable list

[jira] [Updated] (NUTCH-1325) HostDB for Nutch

2016-01-21 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1325: - Fix Version/s: 1.12 > HostDB for Nutch > > > Key

[jira] [Resolved] (NUTCH-1325) HostDB for Nutch

2016-01-21 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-1325. -- Resolution: Fixed Committed to trunk in revision 1725952. Many thanks to all contributors

[jira] [Updated] (NUTCH-1325) HostDB for Nutch

2016-01-21 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1325: - Component/s: hostdb > HostDB for Nutch > > > Key

[jira] [Commented] (NUTCH-1233) Rely on Tika for outlink extraction

2016-01-21 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110375#comment-15110375 ] Markus Jelsma commented on NUTCH-1233: -- Yes, we'll get this support with Tika 1.12. Timothy Allison

[jira] [Commented] (NUTCH-961) Expose Tika's boilerpipe support

2016-01-21 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110373#comment-15110373 ] Markus Jelsma commented on NUTCH-961: - Hello - that doesn't seem related to this issue as it doesn't

[jira] [Updated] (NUTCH-2201) Remove loops program from webgraph package

2016-01-21 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2201: - Attachment: NUTCH-2201.patch Patch for trunk which removed the loops program and all references

[jira] [Updated] (NUTCH-2201) Remove loops program from webgraph package

2016-01-21 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2201: - Patch Info: Patch Available > Remove loops program from webgraph pack

[jira] [Commented] (NUTCH-2197) Add solr5 solrcloud indexer support

2016-01-21 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110797#comment-15110797 ] Markus Jelsma commented on NUTCH-2197: -- This Solr 5 plugin is capable of indexing to Solr 5 in cloud

[jira] [Resolved] (NUTCH-2201) Remove loops program from webgraph package

2016-01-21 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-2201. -- Resolution: Fixed Committed to trunk revision 1725981. Thanks Dennis! > Remove loops prog

[jira] [Commented] (NUTCH-2202) Integration of Anthelion (Focused Crawling Module) into Nutch

2016-01-21 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110947#comment-15110947 ] Markus Jelsma commented on NUTCH-2202: -- Yes, a patch would be a good place to start. I've read

[jira] [Commented] (NUTCH-961) Expose Tika's boilerpipe support

2016-01-21 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15111292#comment-15111292 ] Markus Jelsma commented on NUTCH-961: - Some news, the upstream Tika issue has been committed

[jira] [Commented] (NUTCH-961) Expose Tika's boilerpipe support

2016-01-19 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15106570#comment-15106570 ] Markus Jelsma commented on NUTCH-961: - Yes but it requires NUTCH-1233. > Expose Tika's boilerp

[jira] [Updated] (NUTCH-1233) Rely on Tika for outlink extraction

2016-01-19 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1233: - Attachment: NUTCH-1233.patch Updated patch for trunk > Rely on Tika for outlink extract

[jira] [Updated] (NUTCH-1233) Rely on Tika for outlink extraction

2016-01-19 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1233: - Attachment: pre-1233.txt post-1233.txt Two lists of extracted URL's, before

[jira] [Updated] (NUTCH-1233) Rely on Tika for outlink extraction

2016-01-19 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1233: - Attachment: NUTCH-1233.patch Updated patch. Patch now contains the old link extraction commented

[jira] [Updated] (NUTCH-1233) Rely on Tika for outlink extraction

2016-01-19 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1233: - Attachment: pre-1233-2.txt post-1233-2.txt Here's another set to compare > R

[jira] [Comment Edited] (NUTCH-1233) Rely on Tika for outlink extraction

2016-01-19 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15106633#comment-15106633 ] Markus Jelsma edited comment on NUTCH-1233 at 1/19/16 11:57 AM: It seems

[jira] [Commented] (NUTCH-1233) Rely on Tika for outlink extraction

2016-01-19 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15106633#comment-15106633 ] Markus Jelsma commented on NUTCH-1233: -- It seems Tika's link extraction does not cover and elements

[jira] [Updated] (NUTCH-2201) Remove loops program from webgraph package

2016-01-19 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2201: - Affects Version/s: 1.11 > Remove loops program from webgraph pack

[jira] [Updated] (NUTCH-2201) Remove loops program from webgraph package

2016-01-19 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2201: - Fix Version/s: 1.12 > Remove loops program from webgraph pack

[jira] [Commented] (NUTCH-961) Expose Tika's boilerpipe support

2016-01-19 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15106783#comment-15106783 ] Markus Jelsma commented on NUTCH-961: - Update, i've updated NUTCH-1233 for current trunk as well

[jira] [Updated] (NUTCH-2203) Suffix URL filter can't handle trailing/leading whitespaces

2016-01-19 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2203: - Fix Version/s: 1.12 > Suffix URL filter can't handle trailing/leading whitespa

[jira] [Resolved] (NUTCH-2203) Suffix URL filter can't handle trailing/leading whitespaces

2016-01-19 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-2203. -- Resolution: Fixed Committed to trunk in revision 1725538. Thanks Jurian Broertjes. > Suf

[jira] [Assigned] (NUTCH-2203) Suffix URL filter can't handle trailing/leading whitespaces

2016-01-19 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma reassigned NUTCH-2203: Assignee: Markus Jelsma > Suffix URL filter can't handle trailing/leading whitespa

[jira] [Assigned] (NUTCH-1325) HostDB for Nutch

2016-01-19 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma reassigned NUTCH-1325: Assignee: Markus Jelsma > HostDB for Nutch > > >

RE: Nutch/Solr communication problem

2016-01-18 Thread Markus Jelsma
ommand then please let me know. thanks On Mon, Jan 18, 2016 at 4:50 PM, Markus Jelsma <markus.jel...@openindex.io <mailto:markus.jel...@openindex.io>> wrote: Hi - This doesnt look like a HTTP basic authentication problem. Are you running Solr 5.x? Markus -Original message-

RE: Nutch/Solr communication problem

2016-01-18 Thread Markus Jelsma
Hi - can you post the log output? Markus -Original message- From: Zara Parst Sent: Monday 18th January 2016 2:06 To: dev@nutch.apache.org Subject: Nutch/Solr communication problem Hi everyone, I have situation here, I am using nutch 1.11 and solr 5.4 Solr is

RE: Nutch/Solr communication problem

2016-01-18 Thread Markus Jelsma
Job.java:145) at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:228) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:237) On Mon, Jan 18, 2016 at 4:15 PM, Markus Jelsma <markus.jel...@openindex.io <mailto:markus.jel...@op

[jira] [Closed] (NUTCH-1107) Log slow parse entries

2016-01-18 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-1107. Resolution: Won't Fix > Log slow parse entries > -- > >

[jira] [Created] (NUTCH-2201) Remove loops program from webgrapg package

2016-01-18 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2201: Summary: Remove loops program from webgrapg package Key: NUTCH-2201 URL: https://issues.apache.org/jira/browse/NUTCH-2201 Project: Nutch Issue Type: Task

[jira] [Resolved] (NUTCH-1838) Host and domain based regex and automaton filtering

2016-01-18 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-1838. -- Resolution: Fixed > Host and domain based regex and automaton filter

[jira] [Assigned] (NUTCH-2197) Add solr5 solrcloud indexer support

2016-01-18 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma reassigned NUTCH-2197: Assignee: Markus Jelsma > Add solr5 solrcloud indexer supp

[jira] [Assigned] (NUTCH-2201) Remove loops program from webgraph package

2016-01-18 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma reassigned NUTCH-2201: Assignee: Markus Jelsma > Remove loops program from webgraph pack

RE: Nutch/Solr communication problem

2016-01-18 Thread Markus Jelsma
016 16:16 To: dev@nutch.apache.org Subject: Re: Nutch/Solr communication problem Mind to share that patch ? On Mon, Jan 18, 2016 at 8:28 PM, Markus Jelsma <markus.jel...@openindex.io <mailto:markus.jel...@openindex.io>> wrote: Yes i have used it, i made the damn patch myself yea

[jira] [Updated] (NUTCH-2201) Remove loops program from webgraph package

2016-01-18 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2201: - Summary: Remove loops program from webgraph package (was: Remove loops program from webgrapg

[jira] [Closed] (NUTCH-1149) DomainStats should process numeric CrawlDB metadata

2016-01-18 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-1149. Resolution: Won't Fix Will upload proper patch for NUTCH-1325 soon which already contains numeric

[jira] [Resolved] (NUTCH-2194) Run IndexingFilterChecker as simple Telnet server

2016-01-15 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-2194. -- Resolution: Fixed Committed to trunk in revision 1724771. > Run IndexingFilterChec

[jira] [Updated] (NUTCH-2194) Run IndexingFilterChecker as simple Telnet server

2016-01-14 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2194: - Attachment: NUTCH-2194.patch Updated patch. Signature is now also added to CrawlDatum, in case

[jira] [Updated] (NUTCH-2195) IndexingFilterChecker to optionally follow N redirects

2016-01-13 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2195: - Priority: Trivial (was: Major) > IndexingFilterChecker to optionally follow N redire

[jira] [Updated] (NUTCH-2195) IndexingFilterChecker to optionally follow N redirects

2016-01-13 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2195: - Attachment: NUTCH-2195.patch Patch for trunk. -followRedirects now follow redirects a few times

[jira] [Updated] (NUTCH-2194) Run IndexingFilterChecker as simple Telnet server

2016-01-13 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2194: - Priority: Minor (was: Major) > Run IndexingFilterChecker as simple Telnet ser

[jira] [Updated] (NUTCH-2196) IndexingFilterChecker to optionally normalize

2016-01-13 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2196: - Priority: Trivial (was: Major) > IndexingFilterChecker to optionally normal

[jira] [Resolved] (NUTCH-2195) IndexingFilterChecker to optionally follow N redirects

2016-01-13 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-2195. -- Resolution: Fixed Assignee: Markus Jelsma Committed to trunk in revision 1724409

[jira] [Updated] (NUTCH-2195) IndexingFilterChecker to optionally follow N redirects

2016-01-13 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2195: - Patch Info: Patch Available > IndexingFilterChecker to optionally follow N redire

[jira] [Commented] (NUTCH-2196) IndexingFilterChecker to optionally normalize

2016-01-13 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15096156#comment-15096156 ] Markus Jelsma commented on NUTCH-2196: -- Committed to trunk in revision 1724418

[jira] [Resolved] (NUTCH-2196) IndexingFilterChecker to optionally normalize

2016-01-13 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-2196. -- Resolution: Fixed > IndexingFilterChecker to optionally normal

[jira] [Updated] (NUTCH-2196) IndexingFilterChecker to optionally normalize

2016-01-13 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2196: - Assignee: Markus Jelsma Patch Info: Patch Available > IndexingFilterChecker to optiona

[jira] [Updated] (NUTCH-2196) IndexingFilterChecker to optionally normalize

2016-01-13 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2196: - Attachment: NUTCH-2196.patch Patch for trunk introducing the -normalize flag. If enabled, input

[jira] [Updated] (NUTCH-2194) Run IndexingFilterChecker as simple Telnet server

2016-01-13 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2194: - Patch Info: Patch Available > Run IndexingFilterChecker as simple Telnet ser

[jira] [Commented] (NUTCH-2194) Run IndexingFilterChecker as simple Telnet server

2016-01-13 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15096263#comment-15096263 ] Markus Jelsma commented on NUTCH-2194: -- Please check it out :) > Run IndexingFilterChecker as sim

[jira] [Updated] (NUTCH-2194) Run IndexingFilterChecker as simple Telnet server

2016-01-13 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2194: - Description: We have used a customized IndexingFilterChecker running as server to be able

[jira] [Updated] (NUTCH-2194) Run IndexingFilterChecker as simple Telnet server

2016-01-13 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2194: - Attachment: NUTCH-2194.patch Patch for trunk. With default settings this server needs just about

[jira] [Commented] (NUTCH-1712) Use MultipleInputs in Injector to make it a single mapreduce job

2016-01-12 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15093697#comment-15093697 ] Markus Jelsma commented on NUTCH-1712: -- Nice! > Use MultipleInputs in Injector to make it a sin

[jira] [Reopened] (NUTCH-2190) Protocol normalizer

2016-01-12 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma reopened NUTCH-2190: -- Need to add the example config file. > Protocol normali

[jira] [Resolved] (NUTCH-2190) Protocol normalizer

2016-01-12 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-2190. -- Resolution: Fixed Committed revision 1724199. > Protocol normali

[jira] [Resolved] (NUTCH-2190) Protocol normalizer

2016-01-11 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-2190. -- Resolution: Fixed Assignee: Markus Jelsma Committed revision 1724085. > Proto

[jira] [Updated] (NUTCH-2190) Protocol normalizer

2016-01-11 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2190: - Attachment: NUTCH-2190.patch Final patch including all entries for build.xml

[jira] [Resolved] (NUTCH-1449) Optionally delete documents skipped by IndexingFilters

2016-01-08 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-1449. -- Resolution: Fixed Committed revision 1723688. > Optionally delete documents skip

[jira] [Updated] (NUTCH-2178) DeduplicationJob to optionally group on host or domain

2016-01-08 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2178: - Summary: DeduplicationJob to optionally group on host or domain (was: DeduplicationJob

[jira] [Resolved] (NUTCH-2178) DeduplicationJob to optionally group on host or domain

2016-01-08 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-2178. -- Resolution: Fixed Committed to trunk in revision 1723690. > DeduplicationJob to optiona

[jira] [Comment Edited] (NUTCH-1449) Optionally delete documents skipped by IndexingFilters

2016-01-08 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15089073#comment-15089073 ] Markus Jelsma edited comment on NUTCH-1449 at 1/8/16 11:16 AM: --- Committed

[jira] [Commented] (NUTCH-2190) Protocol normalizer

2016-01-08 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15089081#comment-15089081 ] Markus Jelsma commented on NUTCH-2190: -- I'll also get this one in soon unless objections of course

[jira] [Commented] (NUTCH-2191) Add protocol-htmlunit

2016-01-08 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15089297#comment-15089297 ] Markus Jelsma commented on NUTCH-2191: -- Hi - i've 'read' that discussion that couple of weeks ago

[jira] [Commented] (NUTCH-1838) Host and domain based regex and automaton filtering

2016-01-08 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15089121#comment-15089121 ] Markus Jelsma commented on NUTCH-1838: -- Committed to trunk in revision 1723710. > Host and dom

[jira] [Commented] (NUTCH-1449) Optionally delete documents skipped by IndexingFilters

2016-01-05 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083109#comment-15083109 ] Markus Jelsma commented on NUTCH-1449: -- We have it nicely running for some years. I will commit

[jira] [Updated] (NUTCH-1321) IDNNormalizer

2016-01-05 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1321: - Patch Info: Patch Available > IDNNormalizer > - > > Key

[jira] [Created] (NUTCH-2196) IndexingFilterChecker to optionally normalize

2016-01-05 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2196: Summary: IndexingFilterChecker to optionally normalize Key: NUTCH-2196 URL: https://issues.apache.org/jira/browse/NUTCH-2196 Project: Nutch Issue Type

[jira] [Commented] (NUTCH-2191) Add protocol-htmlunit

2016-01-05 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083273#comment-15083273 ] Markus Jelsma commented on NUTCH-2191: -- Hey Chris! An Ajax pattern handler is new to me. Can you

[jira] [Created] (NUTCH-2195) IndexingFilterChecker to optionally follow N redirects

2016-01-05 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2195: Summary: IndexingFilterChecker to optionally follow N redirects Key: NUTCH-2195 URL: https://issues.apache.org/jira/browse/NUTCH-2195 Project: Nutch Issue

[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb

2016-01-05 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083092#comment-15083092 ] Markus Jelsma commented on NUTCH-2184: -- Hello Lewis! * it should be no problem. But since

[jira] [Updated] (NUTCH-2191) Add protocol-htmlunit

2016-01-05 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2191: - Patch Info: Patch Available > Add protocol-htmlu

[jira] [Commented] (NUTCH-1838) Host and domain based regex and automaton filtering

2016-01-05 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083101#comment-15083101 ] Markus Jelsma commented on NUTCH-1838: -- If no objections, i'll get this one in soon > H

[jira] [Created] (NUTCH-2194) Run IndexingFilterChecker as simple Telnet server

2016-01-05 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2194: Summary: Run IndexingFilterChecker as simple Telnet server Key: NUTCH-2194 URL: https://issues.apache.org/jira/browse/NUTCH-2194 Project: Nutch Issue Type

[jira] [Commented] (NUTCH-2191) Add protocol-htmlunit

2016-01-05 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083104#comment-15083104 ] Markus Jelsma commented on NUTCH-2191: -- Does anyone have an idea on how to force the plugin to use

[jira] [Commented] (NUTCH-1257) Support for the x-robots-tag HTTP Header

2016-01-05 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083114#comment-15083114 ] Markus Jelsma commented on NUTCH-1257: -- Hmm, there is no patch but i remember having had this support

[jira] [Updated] (NUTCH-2178) DeduplicationJob to optionall group on host or domain

2016-01-05 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2178: - Patch Info: Patch Available > DeduplicationJob to optionall group on host or dom

<    4   5   6   7   8   9   10   11   12   13   >