[jira] [Created] (NUTCH-2202) Integration of Anthelion (Focused Crawling Module) into Nutch

2016-01-19 Thread Robert Meusel (JIRA)
Robert Meusel created NUTCH-2202: Summary: Integration of Anthelion (Focused Crawling Module) into Nutch Key: NUTCH-2202 URL: https://issues.apache.org/jira/browse/NUTCH-2202 Project: Nutch

[jira] [Commented] (NUTCH-961) Expose Tika's boilerpipe support

2016-01-19 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15106570#comment-15106570 ] Markus Jelsma commented on NUTCH-961: - Yes but it requires NUTCH-1233. > Expose Tika's boilerpipe

[jira] [Updated] (NUTCH-1233) Rely on Tika for outlink extraction

2016-01-19 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1233: - Attachment: NUTCH-1233.patch Updated patch for trunk > Rely on Tika for outlink extraction >

[jira] [Updated] (NUTCH-1233) Rely on Tika for outlink extraction

2016-01-19 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1233: - Attachment: pre-1233.txt post-1233.txt Two lists of extracted URL's, before and

[jira] [Updated] (NUTCH-1233) Rely on Tika for outlink extraction

2016-01-19 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1233: - Attachment: NUTCH-1233.patch Updated patch. Patch now contains the old link extraction commented

[jira] [Updated] (NUTCH-1233) Rely on Tika for outlink extraction

2016-01-19 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1233: - Attachment: pre-1233-2.txt post-1233-2.txt Here's another set to compare > Rely

[jira] [Comment Edited] (NUTCH-1233) Rely on Tika for outlink extraction

2016-01-19 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15106633#comment-15106633 ] Markus Jelsma edited comment on NUTCH-1233 at 1/19/16 11:57 AM: It seems

[jira] [Commented] (NUTCH-1233) Rely on Tika for outlink extraction

2016-01-19 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15106633#comment-15106633 ] Markus Jelsma commented on NUTCH-1233: -- It seems Tika's link extraction does not cover and

[jira] [Commented] (NUTCH-2203) Suffix URL filter can't handle trailing/leading whitespaces

2016-01-19 Thread Hudson (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107543#comment-15107543 ] Hudson commented on NUTCH-2203: --- SUCCESS: Integrated in Nutch-trunk #3338 (See

[jira] [Updated] (NUTCH-2201) Remove loops program from webgraph package

2016-01-19 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2201: - Affects Version/s: 1.11 > Remove loops program from webgraph package >

[jira] [Updated] (NUTCH-2201) Remove loops program from webgraph package

2016-01-19 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2201: - Fix Version/s: 1.12 > Remove loops program from webgraph package >

[jira] [Commented] (NUTCH-1325) HostDB for Nutch

2016-01-19 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15108023#comment-15108023 ] Otis Gospodnetic commented on NUTCH-1325: - Median is the same as 50th percentile, isn't it? What

[jira] [Commented] (NUTCH-1233) Rely on Tika for outlink extraction

2016-01-19 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15108025#comment-15108025 ] Otis Gospodnetic commented on NUTCH-1233: - My opinion: better to have this in Nutch (the issue is

[jira] [Commented] (NUTCH-2201) Remove loops program from webgraph package

2016-01-19 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15106667#comment-15106667 ] Dennis Kubes commented on NUTCH-2201: - +1 on this. The loops program, iirc, is a factorial

[jira] [Commented] (NUTCH-961) Expose Tika's boilerpipe support

2016-01-19 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15106783#comment-15106783 ] Markus Jelsma commented on NUTCH-961: - Update, i've updated NUTCH-1233 for current trunk as well as a

Re: [MASSMAIL]Re: Nutch/Solr communication problem

2016-01-19 Thread Roannel Fernández Hernández
Hi I think that your problem is not related with Solr authentication. The fields of documents sent by you to Solr and the fields defined in Solr schema are differents. Perhaps the Nutch document has a multivalued field defined in Solr schema as simple field, or in Solr schema there is a

[jira] [Updated] (NUTCH-2203) Suffix URL filter can't handle trailing/leading whitespaces

2016-01-19 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2203: - Fix Version/s: 1.12 > Suffix URL filter can't handle trailing/leading whitespaces >

[jira] [Resolved] (NUTCH-2203) Suffix URL filter can't handle trailing/leading whitespaces

2016-01-19 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-2203. -- Resolution: Fixed Committed to trunk in revision 1725538. Thanks Jurian Broertjes. > Suffix

[jira] [Updated] (NUTCH-2203) Suffix URL filter can't handle trailing/leading whitespaces

2016-01-19 Thread Jurian Broertjes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jurian Broertjes updated NUTCH-2203: Attachment: NUTCH-2203.patch Attached a patch to fix this. > Suffix URL filter can't

[jira] [Assigned] (NUTCH-2203) Suffix URL filter can't handle trailing/leading whitespaces

2016-01-19 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma reassigned NUTCH-2203: Assignee: Markus Jelsma > Suffix URL filter can't handle trailing/leading whitespaces >

[jira] [Created] (NUTCH-2203) Suffix URL filter can't handle trailing/leading whitespaces

2016-01-19 Thread Jurian Broertjes (JIRA)
Jurian Broertjes created NUTCH-2203: --- Summary: Suffix URL filter can't handle trailing/leading whitespaces Key: NUTCH-2203 URL: https://issues.apache.org/jira/browse/NUTCH-2203 Project: Nutch

[jira] [Assigned] (NUTCH-1325) HostDB for Nutch

2016-01-19 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma reassigned NUTCH-1325: Assignee: Markus Jelsma > HostDB for Nutch > > > Key: