[jira] [Commented] (NUTCH-2238) Indexer for Elasticsearch 2.x

2016-04-13 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15239811#comment-15239811
 ] 

Hudson commented on NUTCH-2238:
---

FAILURE: Integrated in Nutch-nutchgora #1552 (See 
[https://builds.apache.org/job/Nutch-nutchgora/1552/])
fix for NUTCH-2238 contributed by ptorrestr (pablo.torres: rev 
7e43de60bbace397aecabb7c0b960555aac313a8)
* src/plugin/indexer-elastic2/ivy.xml
* src/plugin/indexer-elastic2/howto_upgrade_es.txt
* src/plugin/indexer-elastic2/build.xml
* src/plugin/indexer-elastic2/build-ivy.xml
* 
src/plugin/indexer-elastic2/src/java/org/apache/nutch/indexwriter/elastic2/ElasticConstants.java
* src/plugin/indexer-elastic2/plugin.xml
* 
src/plugin/indexer-elastic2/src/java/org/apache/nutch/indexwriter/elastic2/package-info.java
* 
src/plugin/indexer-elastic2/src/java/org/apache/nutch/indexwriter/elastic2/ElasticIndexWriter.java
* src/plugin/build.xml


> Indexer for Elasticsearch 2.x
> -
>
> Key: NUTCH-2238
> URL: https://issues.apache.org/jira/browse/NUTCH-2238
> Project: Nutch
>  Issue Type: New Feature
>  Components: plugin
>Affects Versions: 2.3.1
>Reporter: Pablo Torres
>Assignee: Pablo Torres
> Fix For: 2.4
>
>
> Add an additional plugin for Elasticsearch 2.x



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Build failed in Jenkins: Nutch-nutchgora #1552

2016-04-13 Thread Apache Jenkins Server
See 

Changes:

[pablo.torres] fix for NUTCH-2238 contributed by ptorrestr

[pablo.torres] move to elastic2

[pablo.torres] remove flags

[pablo.torres] parametrized logging

--
[...truncated 4915 lines...]
  [javadoc] 
:38:
 error: package org.elasticsearch.client does not exist
  [javadoc] import org.elasticsearch.client.Client;
  [javadoc]^
  [javadoc] 
:39:
 error: package org.elasticsearch.client.transport does not exist
  [javadoc] import org.elasticsearch.client.transport.TransportClient;
  [javadoc]  ^
  [javadoc] 
:40:
 error: package org.elasticsearch.common.settings does not exist
  [javadoc] import org.elasticsearch.common.settings.ImmutableSettings;
  [javadoc] ^
  [javadoc] 
:41:
 error: package org.elasticsearch.common.settings.ImmutableSettings does not 
exist
  [javadoc] import org.elasticsearch.common.settings.ImmutableSettings.Builder;
  [javadoc]   ^
  [javadoc] 
:42:
 error: package org.elasticsearch.common.settings does not exist
  [javadoc] import org.elasticsearch.common.settings.Settings;
  [javadoc] ^
  [javadoc] 
:43:
 error: package org.elasticsearch.common.transport does not exist
  [javadoc] import 
org.elasticsearch.common.transport.InetSocketTransportAddress;
  [javadoc]  ^
  [javadoc] 
:44:
 error: package org.elasticsearch.node does not exist
  [javadoc] import org.elasticsearch.node.Node;
  [javadoc]  ^
  [javadoc] 
:56:
 error: cannot find symbol
  [javadoc]   private Client client;
  [javadoc]   ^
  [javadoc]   symbol:   class Client
  [javadoc]   location: class ElasticIndexWriter
  [javadoc] 
:57:
 error: cannot find symbol
  [javadoc]   private Node node;
  [javadoc]   ^
  [javadoc]   symbol:   class Node
  [javadoc]   location: class ElasticIndexWriter
  [javadoc] 
:62:
 error: cannot find symbol
  [javadoc]   private BulkRequestBuilder bulk;
  [javadoc]   ^
  [javadoc]   symbol:   class BulkRequestBuilder
  [javadoc]   location: class ElasticIndexWriter
  [javadoc] 
:63:
 error: cannot find symbol
  [javadoc]   private ListenableActionFuture execute;
  [javadoc]   ^
  [javadoc]   symbol:   class ListenableActionFuture
  [javadoc]   location: class ElasticIndexWriter
  [javadoc] 
:63:
 error: cannot find symbol
  [javadoc]   private ListenableActionFuture execute;
  [javadoc]  ^
  [javadoc]   symbol:   class BulkResponse
  [javadoc]   location: class ElasticIndexWriter
  [javadoc] 
:175:
 error: cannot find symbol
  [javadoc]   public static IOException makeIOException(ElasticsearchException 
e) {
  [javadoc] ^
  [javadoc]   symbol:   class ElasticsearchException
  [javadoc]   location: class ElasticIndexWriter
  [jav

[jira] [Updated] (NUTCH-2188) While crawling with solr url (kerberos enabled) Error: org.apache.solr.common.SolrException: Unauthorized

2016-04-13 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-2188:

Fix Version/s: (was: 1.9)
   1.12

> While crawling with solr url (kerberos enabled) Error: 
> org.apache.solr.common.SolrException: Unauthorized
> -
>
> Key: NUTCH-2188
> URL: https://issues.apache.org/jira/browse/NUTCH-2188
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.9
> Environment: Proof of Concept
>Reporter: Mohankumar K H
> Fix For: 1.12
>
>
> 15/12/16 21:49:22 INFO mapreduce.Job: Task Id : 
> attempt_1449548680888_0063_r_02_0, Status : FAILED
> Error: org.apache.solr.common.SolrException: Unauthorized
> Unauthorized
> request: 
> https://hdrdn001c.cps.intel.com:8985/solr/nutch_std_config/update?wt=javabin&version=2
> at 
> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:430)
> at 
> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
> at 
> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
> at 
> org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:155)
> at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:118)
> at 
> org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:44)
> at 
> org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.close(ReduceTask.java:502)
> at 
> org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:456)
> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
> Container killed by the ApplicationMaster.
> Container killed on request. Exit code is 143
> Container exited with a non-zero exit code 143



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2217) Crawl pages with specified language

2016-04-13 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-2217:

Fix Version/s: (was: 1.11)
   2.5

> Crawl pages with specified language
> ---
>
> Key: NUTCH-2217
> URL: https://issues.apache.org/jira/browse/NUTCH-2217
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin
>Reporter: Dawid Wolski
>Priority: Minor
>  Labels: language, plugin
> Fix For: 2.5
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Plugin to filter out the pages on languages other than specified. It bases on 
> language returned by language-identifier plugin.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-2238) Indexer for Elasticsearch 2.x

2016-04-13 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-2238.
-
Resolution: Fixed

Thank you [~ptorrestr]

> Indexer for Elasticsearch 2.x
> -
>
> Key: NUTCH-2238
> URL: https://issues.apache.org/jira/browse/NUTCH-2238
> Project: Nutch
>  Issue Type: New Feature
>  Components: plugin
>Affects Versions: 2.3.1
>Reporter: Pablo Torres
>Assignee: Pablo Torres
> Fix For: 2.4
>
>
> Add an additional plugin for Elasticsearch 2.x



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1741) Support of Sitemaps in Nutch 2.x

2016-04-13 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1741:

Fix Version/s: (was: 2.3.2)
   2.4

> Support of Sitemaps in Nutch 2.x
> 
>
> Key: NUTCH-1741
> URL: https://issues.apache.org/jira/browse/NUTCH-1741
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher, generator
>Reporter: Alparslan Avcı
>Assignee: cihad güzel
>  Labels: gsoc2015
> Fix For: 2.4
>
> Attachments: NUTCH-1741-v2.patch, NUTCH-1741-v3.patch, 
> NUTCH-1741-v4.patch, NUTCH-1741.patch, NUTCH-1741v5.patch, 
> NUTCH-1741v6.patch, NUTCH-1741v7.patch, SitemapCrawlerLifeCycle.pdf, 
> SitemapDevelopmentFor2x.pdf
>
>
> Sitemap support has to be implemented for 2.x branch. It is being discussed 
> in NUTCH-1465 for trunk. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2222) re-fetch deletes all metadata except _csh_ and _rs_

2016-04-13 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-:

Fix Version/s: (was: 2.3.2)
   2.4

> re-fetch deletes all  metadata except _csh_ and _rs_
> 
>
> Key: NUTCH-
> URL: https://issues.apache.org/jira/browse/NUTCH-
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 2.3.1
> Environment: Centos 6, mongodb 2.6 and mongodb 3.0 and 
> hbase-0.98.8-hadoop2
>Reporter: Adnane B.
>Assignee: Lewis John McGibbney
> Fix For: 2.4
>
> Attachments: TestReFetch.java, index.html
>
>
> This problem happens at the the second time I crawl a page
> {code}
> bin/nutch inject urls/
> bin/nutch generate -topN 1000
> bin/nutch fetch  -all
> bin/nutch parse -force   -all
> bin/nutch updatedb  -all
> {code}
> seconde time (re-fetch) : 
> {code}
> bin/nutch generate -topN 1000 --> batchid changes for all existing pages
> bin/nutch fetch  -all   -->  *** metadatas are delete for all pages already 
> crawled  **
> bin/nutch parse -force   -all
> bin/nutch updatedb  -all
> {code}
> I reproduce it with mongodb 2.6, mongodb 3.0, and hbase-0.98.8-hadoop2
> It happens only if the page has not changed
> To reproduce easily, please add to nutch-site.xml :
> {code}
> 
>   db.fetch.interval.default
>   60
>   The default number of seconds between re-fetches of a page (1 
> minute)
> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2238) Indexer for Elasticsearch 2.x

2016-04-13 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-2238:

Assignee: Pablo Torres

> Indexer for Elasticsearch 2.x
> -
>
> Key: NUTCH-2238
> URL: https://issues.apache.org/jira/browse/NUTCH-2238
> Project: Nutch
>  Issue Type: New Feature
>  Components: plugin
>Affects Versions: 2.3.1
>Reporter: Pablo Torres
>Assignee: Pablo Torres
> Fix For: 2.4
>
>
> Add an additional plugin for Elasticsearch 2.x



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2238) Indexer for Elasticsearch 2.x

2016-04-13 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-2238:

Fix Version/s: (was: 2.3.2)

> Indexer for Elasticsearch 2.x
> -
>
> Key: NUTCH-2238
> URL: https://issues.apache.org/jira/browse/NUTCH-2238
> Project: Nutch
>  Issue Type: New Feature
>  Components: plugin
>Affects Versions: 2.3.1
>Reporter: Pablo Torres
>Assignee: Pablo Torres
> Fix For: 2.4
>
>
> Add an additional plugin for Elasticsearch 2.x



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2238) Indexer for Elasticsearch 2.x

2016-04-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15239776#comment-15239776
 ] 

ASF GitHub Bot commented on NUTCH-2238:
---

Github user asfgit closed the pull request at:

https://github.com/apache/nutch/pull/96


> Indexer for Elasticsearch 2.x
> -
>
> Key: NUTCH-2238
> URL: https://issues.apache.org/jira/browse/NUTCH-2238
> Project: Nutch
>  Issue Type: New Feature
>  Components: plugin
>Affects Versions: 2.3.1
>Reporter: Pablo Torres
> Fix For: 2.4, 2.3.2
>
>
> Add an additional plugin for Elasticsearch 2.x



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] nutch pull request: fix for NUTCH-2238 contributed by ptorrestr

2016-04-13 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/nutch/pull/96


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Updated] (NUTCH-2249) WordNet Integration for Cosine Similarity

2016-04-13 Thread Sujen Shah (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sujen Shah updated NUTCH-2249:
--
Fix Version/s: 1.12

> WordNet Integration for Cosine Similarity
> -
>
> Key: NUTCH-2249
> URL: https://issues.apache.org/jira/browse/NUTCH-2249
> Project: Nutch
>  Issue Type: New Feature
>  Components: plugin, scoring
>Reporter: Bhavya Sanghavi
>Assignee: Sujen Shah
>Priority: Minor
>  Labels: memex
> Fix For: 1.12
>
>
> Integrated WordNet database to enhance the cosine similarity plugin. 
> This helps in reducing the size of the vectors for calculating the cosine 
> similarity by mapping the synonymous words to the same entry in the vector. 
> Consequently, it would increase the accuracy of the scores given to the 
> webpages to be crawled. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2249) WordNet Integration for Cosine Similarity

2016-04-13 Thread Sujen Shah (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sujen Shah updated NUTCH-2249:
--
Labels: memex  (was: )

> WordNet Integration for Cosine Similarity
> -
>
> Key: NUTCH-2249
> URL: https://issues.apache.org/jira/browse/NUTCH-2249
> Project: Nutch
>  Issue Type: New Feature
>  Components: plugin, scoring
>Reporter: Bhavya Sanghavi
>Assignee: Sujen Shah
>Priority: Minor
>  Labels: memex
>
> Integrated WordNet database to enhance the cosine similarity plugin. 
> This helps in reducing the size of the vectors for calculating the cosine 
> similarity by mapping the synonymous words to the same entry in the vector. 
> Consequently, it would increase the accuracy of the scores given to the 
> webpages to be crawled. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (NUTCH-2249) WordNet Integration for Cosine Similarity

2016-04-13 Thread Sujen Shah (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sujen Shah reassigned NUTCH-2249:
-

Assignee: Sujen Shah

> WordNet Integration for Cosine Similarity
> -
>
> Key: NUTCH-2249
> URL: https://issues.apache.org/jira/browse/NUTCH-2249
> Project: Nutch
>  Issue Type: New Feature
>  Components: plugin, scoring
>Reporter: Bhavya Sanghavi
>Assignee: Sujen Shah
>Priority: Minor
>  Labels: memex
>
> Integrated WordNet database to enhance the cosine similarity plugin. 
> This helps in reducing the size of the vectors for calculating the cosine 
> similarity by mapping the synonymous words to the same entry in the vector. 
> Consequently, it would increase the accuracy of the scores given to the 
> webpages to be crawled. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Adding a new field to Nutch + MongoDB datastore using plugin

2016-04-13 Thread Jean Vence
I am running Nutch 2.3.1 configured with MondoDB (using Gora) +
Elasticsearch and would like to add a new field to the storage
database NOT the index.


I am able to add a field to the elasticsearch index using a custom
plugin but would like to add it to the mongodb record for each
website.


I've added the field to the ./conf/schema.xml file and to
./conf/gora-mongodb-mapping.xml - The field does appear in the index
but not in the mongo record..


Here'e a snapshot of my plugin:


public class AddNewField implements IndexingFilter {

...

@Override

  public NutchDocument filter(NutchDocument doc, String url, WebPage page)

  throws IndexingException {

//adds the new field to the document

doc.add("mynewField", "HelloWorld");

return doc;

  }

}

Can this be achieved using a plugin or would I need to modify Nutch's
source code?

Thank you for any assistance you can provide.