Better Parser Plugin

2014-05-02 Thread Talat Uyarer
Hi all, Now used parser plugins nekohtml doesnt parse correctly. When I tested in huge website site, it leaves html tags. IMHO our parser is little bit old. After doing some research, I found Jsoup[1] and Gumbo[2] parser. I did some test on broken websites. I saw gumbo and jsoup parsed very simil

About RankingJob for Giraph

2014-05-02 Thread Talat Uyarer
Hi all, A long time ago, we talked with Julien and Lewis about major needs for 2.x on the mail list. I know that Giraph uses only map slots as workers. At the present our architecture of scoring plugins don't permit. Giraph and Opic have different work types. IMHO We should create a pluggable Ran

Giraph Integration

2014-05-02 Thread Talat Uyarer
Hi all, A long time ago, we talk with Julien and Lewis about major needs for 2.x on the maillist. As far as I know Giraph use only map slots as works. At the present our architecture of scoring plugins dont permit. IMHO We should create a pluggable RankingJob like as IndexingJob. The Pluggable ar

[jira] [Commented] (NUTCH-1768) port NUTCH-1745 to Nutch 2.x (Upgrade to ElasticSearch 1.1.0)

2014-05-02 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-1768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13987861#comment-13987861 ] Rogério Pereira Araújo commented on NUTCH-1768: --- Yes, sure, set to correct n

[jira] [Commented] (NUTCH-1768) port NUTCH-1745 to Nutch 2.x (Upgrade to ElasticSearch 1.1.0)

2014-05-02 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13987809#comment-13987809 ] Julien Nioche commented on NUTCH-1768: -- Do you set the cluster name in the config (el

[jira] [Commented] (NUTCH-1674) Use batchId filter to enable scan (GORA-119) for Fetch,Parse,Update,Index

2014-05-02 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13987782#comment-13987782 ] Julien Nioche commented on NUTCH-1674: -- OK, so it *does* filter based on the Mark, wh

[jira] [Comment Edited] (NUTCH-1768) port NUTCH-1745 to Nutch 2.x (Upgrade to ElasticSearch 1.1.0)

2014-05-02 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-1768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13987757#comment-13987757 ] Rogério Pereira Araújo edited comment on NUTCH-1768 at 5/2/14 3:01 PM: -

[jira] [Commented] (NUTCH-1674) Use batchId filter to enable scan (GORA-119) for Fetch,Parse,Update,Index

2014-05-02 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-1674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13987758#comment-13987758 ] Alparslan Avcı commented on NUTCH-1674: --- Hi [~jnioche], You are right; the filter i

[jira] [Comment Edited] (NUTCH-1768) port NUTCH-1745 to Nutch 2.x (Upgrade to ElasticSearch 1.1.0)

2014-05-02 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-1768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13987757#comment-13987757 ] Rogério Pereira Araújo edited comment on NUTCH-1768 at 5/2/14 3:01 PM: -

[jira] [Commented] (NUTCH-1768) port NUTCH-1745 to Nutch 2.x (Upgrade to ElasticSearch 1.1.0)

2014-05-02 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-1768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13987757#comment-13987757 ] Rogério Pereira Araújo commented on NUTCH-1768: --- Julien, When I set the hos

[jira] [Updated] (NUTCH-1741) Support of Sitemaps in Nutch 2.x

2014-05-02 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1741: - Fix Version/s: (was: 2.3) 2.4 > Support of Sitemaps in Nutch 2.x > ---

[jira] [Commented] (NUTCH-1674) Use batchId filter to enable scan (GORA-119) for Fetch,Parse,Update,Index

2014-05-02 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13987605#comment-13987605 ] Julien Nioche commented on NUTCH-1674: -- Hi, I haven't played with the filtering in G

[jira] [Commented] (NUTCH-1714) Nutch 2.x upgrade to Gora 0.4

2014-05-02 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13987602#comment-13987602 ] Julien Nioche commented on NUTCH-1714: -- bq. I do not know if you have tested the patc

[jira] [Commented] (NUTCH-1714) Nutch 2.x upgrade to Gora 0.4

2014-05-02 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-1714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13987551#comment-13987551 ] Alparslan Avcı commented on NUTCH-1714: --- Hi [~jnioche], I do not know if you have t

[jira] [Resolved] (NUTCH-1728) indexer-solr plugin is not delete docs from solr

2014-05-02 Thread Talat UYARER (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Talat UYARER resolved NUTCH-1728. - Resolution: Fixed Committed revision 1591849. > indexer-solr plugin is not delete docs from solr

[jira] [Commented] (NUTCH-1622) Create Outlinks with metadata

2014-05-02 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13987520#comment-13987520 ] Julien Nioche commented on NUTCH-1622: -- Hi Daniel Sorry for not commenting on your p

[jira] [Assigned] (NUTCH-1622) Create Outlinks with metadata

2014-05-02 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche reassigned NUTCH-1622: Assignee: Julien Nioche > Create Outlinks with metadata > - > >

[jira] [Updated] (NUTCH-1622) Create Outlinks with metadata

2014-05-02 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1622: - Fix Version/s: 2.4 > Create Outlinks with metadata > - > >

[jira] [Updated] (NUTCH-1622) Create Outlinks with metadata

2014-05-02 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1622: - Fix Version/s: (was: 1.9) 1.8 > Create Outlinks with metadata > --

[jira] [Commented] (NUTCH-1714) Nutch 2.x upgrade to Gora 0.4

2014-05-02 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13987496#comment-13987496 ] Julien Nioche commented on NUTCH-1714: -- Hi [~alparslan.avci] It does not fix the is

[jira] [Resolved] (NUTCH-1753) Eclipse dependecy problem for 2.x

2014-05-02 Thread Talat UYARER (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Talat UYARER resolved NUTCH-1753. - Resolution: Fixed Fix Version/s: (was: 2.4) 2.3 Committed revision

[jira] [Created] (NUTCH-1769) API refactoring

2014-05-02 Thread Ivan Vershinin (JIRA)
Ivan Vershinin created NUTCH-1769: - Summary: API refactoring Key: NUTCH-1769 URL: https://issues.apache.org/jira/browse/NUTCH-1769 Project: Nutch Issue Type: Improvement Components:

[jira] [Commented] (NUTCH-1753) Eclipse dependecy problem for 2.x

2014-05-02 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13987476#comment-13987476 ] Julien Nioche commented on NUTCH-1753: -- then please mark it as resolved and while doi

[jira] [Commented] (NUTCH-1768) port NUTCH-1745 to Nutch 2.x (Upgrade to ElasticSearch 1.1.0)

2014-05-02 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13987475#comment-13987475 ] Julien Nioche commented on NUTCH-1768: -- hi Rogerio Patches are applied against the t

[jira] [Commented] (NUTCH-1753) Eclipse dependecy problem for 2.x

2014-05-02 Thread Talat UYARER (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13987473#comment-13987473 ] Talat UYARER commented on NUTCH-1753: - I committed this issue. > Eclipse dependecy p

[jira] [Updated] (NUTCH-1662) Indexer Plugin for Solr Cloud

2014-05-02 Thread Talat UYARER (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Talat UYARER updated NUTCH-1662: Patch Info: Patch Available > Indexer Plugin for Solr Cloud > - > >