Robert Meusel created NUTCH-2202:
Summary: Integration of Anthelion (Focused Crawling Module) into
Nutch
Key: NUTCH-2202
URL: https://issues.apache.org/jira/browse/NUTCH-2202
Project: Nutch
[
https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15106570#comment-15106570
]
Markus Jelsma commented on NUTCH-961:
-
Yes but it requires NUTCH-1233.
> Expose Tika's boilerpipe
[
https://issues.apache.org/jira/browse/NUTCH-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated NUTCH-1233:
-
Attachment: NUTCH-1233.patch
Updated patch for trunk
> Rely on Tika for outlink extraction
>
[
https://issues.apache.org/jira/browse/NUTCH-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated NUTCH-1233:
-
Attachment: pre-1233.txt
post-1233.txt
Two lists of extracted URL's, before and
[
https://issues.apache.org/jira/browse/NUTCH-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated NUTCH-1233:
-
Attachment: NUTCH-1233.patch
Updated patch. Patch now contains the old link extraction commented
[
https://issues.apache.org/jira/browse/NUTCH-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated NUTCH-1233:
-
Attachment: pre-1233-2.txt
post-1233-2.txt
Here's another set to compare
> Rely
[
https://issues.apache.org/jira/browse/NUTCH-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15106633#comment-15106633
]
Markus Jelsma edited comment on NUTCH-1233 at 1/19/16 11:57 AM:
It seems
[
https://issues.apache.org/jira/browse/NUTCH-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15106633#comment-15106633
]
Markus Jelsma commented on NUTCH-1233:
--
It seems Tika's link extraction does not cover and
[
https://issues.apache.org/jira/browse/NUTCH-2203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107543#comment-15107543
]
Hudson commented on NUTCH-2203:
---
SUCCESS: Integrated in Nutch-trunk #3338 (See
[
https://issues.apache.org/jira/browse/NUTCH-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated NUTCH-2201:
-
Affects Version/s: 1.11
> Remove loops program from webgraph package
>
[
https://issues.apache.org/jira/browse/NUTCH-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated NUTCH-2201:
-
Fix Version/s: 1.12
> Remove loops program from webgraph package
>
[
https://issues.apache.org/jira/browse/NUTCH-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15108023#comment-15108023
]
Otis Gospodnetic commented on NUTCH-1325:
-
Median is the same as 50th percentile, isn't it? What
[
https://issues.apache.org/jira/browse/NUTCH-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15108025#comment-15108025
]
Otis Gospodnetic commented on NUTCH-1233:
-
My opinion: better to have this in Nutch (the issue is
[
https://issues.apache.org/jira/browse/NUTCH-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15106667#comment-15106667
]
Dennis Kubes commented on NUTCH-2201:
-
+1 on this.
The loops program, iirc, is a factorial
[
https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15106783#comment-15106783
]
Markus Jelsma commented on NUTCH-961:
-
Update, i've updated NUTCH-1233 for current trunk as well as a
Hi
I think that your problem is not related with Solr authentication. The fields
of documents sent by you to Solr and the fields defined in Solr schema are
differents. Perhaps the Nutch document has a multivalued field defined in Solr
schema as simple field, or in Solr schema there is a
[
https://issues.apache.org/jira/browse/NUTCH-2203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated NUTCH-2203:
-
Fix Version/s: 1.12
> Suffix URL filter can't handle trailing/leading whitespaces
>
[
https://issues.apache.org/jira/browse/NUTCH-2203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma resolved NUTCH-2203.
--
Resolution: Fixed
Committed to trunk in revision 1725538. Thanks Jurian Broertjes.
> Suffix
[
https://issues.apache.org/jira/browse/NUTCH-2203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jurian Broertjes updated NUTCH-2203:
Attachment: NUTCH-2203.patch
Attached a patch to fix this.
> Suffix URL filter can't
[
https://issues.apache.org/jira/browse/NUTCH-2203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma reassigned NUTCH-2203:
Assignee: Markus Jelsma
> Suffix URL filter can't handle trailing/leading whitespaces
>
Jurian Broertjes created NUTCH-2203:
---
Summary: Suffix URL filter can't handle trailing/leading
whitespaces
Key: NUTCH-2203
URL: https://issues.apache.org/jira/browse/NUTCH-2203
Project: Nutch
[
https://issues.apache.org/jira/browse/NUTCH-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma reassigned NUTCH-1325:
Assignee: Markus Jelsma
> HostDB for Nutch
>
>
> Key:
22 matches
Mail list logo