[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.

2018-07-10 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16538658#comment-16538658
 ] 

Hudson commented on NUTCH-1480:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3543 (See 
[https://builds.apache.org/job/Nutch-trunk/3543/])
fixes for NUTCH-1514: Support for NUTCH-1480. (r0ann3l: 
[https://github.com/apache/nutch/commit/0176883bd663088da99ab54840987092066dc5ac])
* (edit) 
src/plugin/indexer-csv/src/java/org/apache/nutch/indexwriter/csv/CSVIndexWriter.java
Fix unit tests for changes related to NUTCH-1480 (snagel: 
[https://github.com/apache/nutch/commit/06221e069227b6e2f7b6b13eb0df6cb98ba21a46])
* (edit) 
src/plugin/indexer-csv/src/test/org/apache/nutch/indexwriter/csv/TestCSVIndexWriter.java


> SolrIndexer to write to multiple servers.
> -
>
> Key: NUTCH-1480
> URL: https://issues.apache.org/jira/browse/NUTCH-1480
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.15
>
> Attachments: NUTCH-1480-1.6.1.patch, 
> adding-support-for-sharding-indexer-for-solr.patch
>
>
> SolrUtils should return an array of SolrServers and read the SolrUrl as a 
> comma delimited list of URL's using Configuration.getString(). SolrWriter 
> should be able to handle this list of SolrServers.
> This is useful if you want to send documents to multiple servers if no 
> replication is available or if you want to send documents to multiple NOCs.
> edit:
> This does not replace NUTCH-1377 but complements it. With NUTCH-1377 this 
> issue allows you to index to multiple SolrCloud clusters at the same time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.

2018-06-09 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16507142#comment-16507142
 ] 

ASF GitHub Bot commented on NUTCH-1480:
---

odisleysi commented on issue #218: fix for NUTCH-1480 contributed by r0ann3l
URL: https://github.com/apache/nutch/pull/218#issuecomment-327018446
 
 
   I like this idea. I work in a project that needs to save documents in solr 
for searching and elasticsearch for statistics. This solve the problem.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> SolrIndexer to write to multiple servers.
> -
>
> Key: NUTCH-1480
> URL: https://issues.apache.org/jira/browse/NUTCH-1480
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.15
>
> Attachments: NUTCH-1480-1.6.1.patch, 
> adding-support-for-sharding-indexer-for-solr.patch
>
>
> SolrUtils should return an array of SolrServers and read the SolrUrl as a 
> comma delimited list of URL's using Configuration.getString(). SolrWriter 
> should be able to handle this list of SolrServers.
> This is useful if you want to send documents to multiple servers if no 
> replication is available or if you want to send documents to multiple NOCs.
> edit:
> This does not replace NUTCH-1377 but complements it. With NUTCH-1377 this 
> issue allows you to index to multiple SolrCloud clusters at the same time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.

2018-06-02 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16499028#comment-16499028
 ] 

Hudson commented on NUTCH-1480:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3528 (See 
[https://builds.apache.org/job/Nutch-trunk/3528/])
Fixes for NUTCH-2580: Support for NUTCH-1480. (roannel.fdez: 
[https://github.com/apache/nutch/commit/5d7d8167e350edd5bc37454cd73a412c570a13b1])
* (edit) conf/index-writers.xml.template
* (edit) src/java/org/apache/nutch/indexer/IndexWriterParams.java


> SolrIndexer to write to multiple servers.
> -
>
> Key: NUTCH-1480
> URL: https://issues.apache.org/jira/browse/NUTCH-1480
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.15
>
> Attachments: NUTCH-1480-1.6.1.patch, 
> adding-support-for-sharding-indexer-for-solr.patch
>
>
> SolrUtils should return an array of SolrServers and read the SolrUrl as a 
> comma delimited list of URL's using Configuration.getString(). SolrWriter 
> should be able to handle this list of SolrServers.
> This is useful if you want to send documents to multiple servers if no 
> replication is available or if you want to send documents to multiple NOCs.
> edit:
> This does not replace NUTCH-1377 but complements it. With NUTCH-1377 this 
> issue allows you to index to multiple SolrCloud clusters at the same time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.

2018-06-01 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16498408#comment-16498408
 ] 

Hudson commented on NUTCH-1480:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3527 (See 
[https://builds.apache.org/job/Nutch-trunk/3527/])
Fixes for NUTCH-1480: Multiple index writer instances with different 
(roannel.fdez: 
[https://github.com/apache/nutch/commit/e4a7f871b1b03f901279e24cc1c626e5c1b67643])
* (edit) 
src/plugin/indexer-dummy/src/java/org/apache/nutch/indexwriter/dummy/DummyIndexWriter.java
* (edit) 
src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrConstants.java
* (add) conf/index-writers.xsd
* (edit) src/java/org/apache/nutch/indexer/IndexWriter.java
* (edit) src/java/org/apache/nutch/indexer/NutchField.java
* (edit) src/java/org/apache/nutch/indexer/NutchDocument.java
* (edit) 
src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java
* (edit) 
src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrUtils.java
* (add) src/java/org/apache/nutch/indexer/IndexWriterConfig.java
* (delete) 
src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrMappingReader.java
* (add) conf/index-writers.xml.template
* (edit) 
src/plugin/indexer-rabbit/src/java/org/apache/nutch/indexwriter/rabbit/RabbitMQConstants.java
* (edit) 
src/plugin/indexer-rabbit/src/java/org/apache/nutch/indexwriter/rabbit/RabbitIndexWriter.java
* (edit) 
src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrIndexWriter.java
* (edit) 
src/plugin/indexer-cloudsearch/src/java/org/apache/nutch/indexwriter/cloudsearch/CloudSearchIndexWriter.java
* (edit) src/java/org/apache/nutch/indexer/IndexWriters.java
* (add) src/java/org/apache/nutch/indexer/MappingReader.java
* (edit) 
src/plugin/indexer-elastic-rest/src/java/org/apache/nutch/indexwriter/elasticrest/ElasticRestIndexWriter.java
Fixes for NUTCH-1480: Some improvements based on reviewers feedback. 
(roannel.fdez: 
[https://github.com/apache/nutch/commit/86cd375e267036596f19376e2499e1d1c4ccdcbb])
* (edit) src/java/org/apache/nutch/indexer/MappingReader.java
* (edit) 
src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrIndexWriter.java
* (edit) src/java/org/apache/nutch/indexer/IndexWriters.java
* (edit) 
src/plugin/indexer-rabbit/src/java/org/apache/nutch/indexwriter/rabbit/RabbitIndexWriter.java
* (edit) conf/index-writers.xml.template
* (edit) src/java/org/apache/nutch/indexer/IndexWriterConfig.java
* (edit) conf/index-writers.xsd
Fixes for NUTCH-1480: Sections for all indexer-* plugins, relaxed 
(roannel.fdez: 
[https://github.com/apache/nutch/commit/84246a9e8fb183a28983a70d3d30d7d9a474ce58])
* (edit) src/java/org/apache/nutch/indexer/IndexWriters.java
* (add) 
src/plugin/indexer-dummy/src/java/org/apache/nutch/indexwriter/dummy/DummyConstants.java
* (edit) 
src/plugin/indexer-rabbit/src/java/org/apache/nutch/indexwriter/rabbit/RabbitIndexWriter.java
* (edit) 
src/plugin/indexer-cloudsearch/src/java/org/apache/nutch/indexwriter/cloudsearch/CloudSearchConstants.java
* (edit) 
src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java
* (add) src/java/org/apache/nutch/indexer/IndexWriterParams.java
* (edit) 
src/plugin/indexer-elastic-rest/src/java/org/apache/nutch/indexwriter/elasticrest/ElasticRestConstants.java
* (edit) conf/index-writers.xml.template
* (edit) 
src/plugin/indexer-dummy/src/java/org/apache/nutch/indexwriter/dummy/DummyIndexWriter.java
* (edit) src/java/org/apache/nutch/indexer/IndexWriter.java
* (edit) 
src/plugin/indexer-cloudsearch/src/java/org/apache/nutch/indexwriter/cloudsearch/CloudSearchIndexWriter.java
* (edit) src/java/org/apache/nutch/indexer/IndexWriterConfig.java
* (edit) 
src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrIndexWriter.java
* (edit) 
src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticConstants.java
* (edit) 
src/plugin/indexer-elastic-rest/src/java/org/apache/nutch/indexwriter/elasticrest/ElasticRestIndexWriter.java
Fixes for NUTCH-1480: Changes: - Logs for IndexerOutputFormat class to 
(roannel.fdez: 
[https://github.com/apache/nutch/commit/7e9d1df08817c54d50eed3945136033a7fd7af00])
* (edit) conf/log4j.properties
* (edit) src/java/org/apache/nutch/indexer/IndexerOutputFormat.java
* (edit) src/java/org/apache/nutch/indexer/IndexingJob.java
* (edit) src/java/org/apache/nutch/indexer/IndexerMapReduce.java
* (edit) src/java/org/apache/nutch/util/ObjectCache.java
Fixes for NUTCH-1480: Support for NUTCH-2484 and NUTCH-2380. (roannel.fdez: 
[https://github.com/apache/nutch/commit/d45510c186b3dbee3c3f7882c90ab3d28409a0b8])
* (edit) 
src/plugin/indexer-elastic-rest/src/java/org/apache/nutch/indexwriter/elasticrest/ElasticRestIndexWriter.java
* (edit) conf/index-writers.xml.template
* (edit) 
src/plugin/indexer-elastic-rest/s

[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.

2018-06-01 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16498332#comment-16498332
 ] 

ASF GitHub Bot commented on NUTCH-1480:
---

sebastian-nagel closed pull request #218: fix for NUTCH-1480 contributed by 
r0ann3l
URL: https://github.com/apache/nutch/pull/218
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/conf/index-writers.xml.template b/conf/index-writers.xml.template
new file mode 100644
index 0..118c8bc88
--- /dev/null
+++ b/conf/index-writers.xml.template
@@ -0,0 +1,144 @@
+
+
+http://lucene.apache.org/nutch";
+ xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance";
+ xsi:schemaLocation="http://lucene.apache.org/nutch index-writers.xsd">
+
+  
+
+  
+  
+  http://localhost:8983/solr/nutch"/>
+  
+  
+  
+  
+  
+
+
+  
+
+
+  
+  
+
+
+  
+  
+
+  
+
+  
+  
+
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+
+
+  
+
+  
+  
+
+
+  
+  
+
+
+
+  
+
+  
+  
+
+  
+  
+
+
+  
+  
+  
+
+  
+  
+
+  
+  
+  
+  
+  
+  
+  
+  
+  
+
+
+  
+
+  
+  
+  
+
+  
+  
+
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+
+
+  
+
+  
+  
+  
+
+  
+  
+
+  
+  
+  
+  
+
+
+  
+  
+  
+
+  
+
diff --git a/conf/index-writers.xsd b/conf/index-writers.xsd
new file mode 100644
index 0..50ab1f313
--- /dev/null
+++ b/conf/index-writers.xsd
@@ -0,0 +1,179 @@
+
+
+http://www.w3.org/2001/XMLSchema";
+   targetNamespace="http://lucene.apache.org/nutch";
+   xmlns="http://lucene.apache.org/nutch";
+   elementFormDefault="qualified">
+  
+
+  
+Root tag of index-writers.xml document. It's a wrapper for the all 
index writers.
+  
+
+
+  
+
+  
+
+  Contains all the configuration of a particular index writer.
+
+  
+
+  
+
+  
+  
+
+  
+
+  
+This tag contains all the parameters that will be passed to the 
index writer implementation.
+  
+
+  
+  
+
+  
+It's a wrapper for the allowed actions over a document before it's 
indexed.
+  
+
+  
+
+
+  
+
+  Writer's ID.
+
+  
+
+
+  
+
+  The class of the index writer implementation which will be used.
+
+  
+
+  
+  
+
+  
+
+  
+One single parameter that will be pass to the index writer 
implementation.
+  
+
+  
+
+  
+  
+
+  
+
+  Parameter's name. It is used to identify the parameter.
+
+  
+
+
+  
+
+  Parameter's value.
+
+  
+
+  
+  
+
+  
+
+  
+Action of copy fields. Multiple comma-separated targets can be 
specified.
+  
+
+  
+  
+
+  
+Action of rename fields.
+  
+
+  
+  
+
+  
+Action of remove fields.
+  
+
+
+  
+
+  
+
+  One single field that will be mapped.
+
+  
+  
+
+  
+
+  
+
+  Field's name before it's mapped.
+
+  
+
+  
+
+  
+
+  
+
+  
+
+  
+  
+
+  
+
+  
+One single field that will be mapped.
+  
+
+  
+
+  
+  
+
+  
+
+  
+
+  Field's name before it's mapped.
+
+  
+
+
+  
+
+  Field's name after the action is applied.
+
+  
+
+  
+
+  
+
diff --git a/conf/log4j.properties b/conf/log4j.properties
index 6fad2b5d3..1939014dc 100644
--- a/conf/log4j.properties
+++ b/conf/log4j.prop

[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.

2018-06-01 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16498326#comment-16498326
 ] 

ASF GitHub Bot commented on NUTCH-1480:
---

sebastian-nagel commented on a change in pull request #218: fix for NUTCH-1480 
contributed by r0ann3l
URL: https://github.com/apache/nutch/pull/218#discussion_r192467967
 
 

 ##
 File path: 
src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrConstants.java
 ##
 @@ -19,19 +19,23 @@
 public interface SolrConstants {
   public static final String SOLR_PREFIX = "solr.";
 
 Review comment:
   It's still used for {{ZOOKEEPER_HOSTS = SOLR_PREFIX + "zookeeper.hosts"}}


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> SolrIndexer to write to multiple servers.
> -
>
> Key: NUTCH-1480
> URL: https://issues.apache.org/jira/browse/NUTCH-1480
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Attachments: NUTCH-1480-1.6.1.patch, 
> adding-support-for-sharding-indexer-for-solr.patch
>
>
> SolrUtils should return an array of SolrServers and read the SolrUrl as a 
> comma delimited list of URL's using Configuration.getString(). SolrWriter 
> should be able to handle this list of SolrServers.
> This is useful if you want to send documents to multiple servers if no 
> replication is available or if you want to send documents to multiple NOCs.
> edit:
> This does not replace NUTCH-1377 but complements it. With NUTCH-1377 this 
> issue allows you to index to multiple SolrCloud clusters at the same time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.

2018-06-01 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16498327#comment-16498327
 ] 

ASF GitHub Bot commented on NUTCH-1480:
---

sebastian-nagel commented on a change in pull request #218: fix for NUTCH-1480 
contributed by r0ann3l
URL: https://github.com/apache/nutch/pull/218#discussion_r192467967
 
 

 ##
 File path: 
src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrConstants.java
 ##
 @@ -19,19 +19,23 @@
 public interface SolrConstants {
   public static final String SOLR_PREFIX = "solr.";
 
 Review comment:
   It's still used for `ZOOKEEPER_HOSTS = SOLR_PREFIX + "zookeeper.hosts"`


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> SolrIndexer to write to multiple servers.
> -
>
> Key: NUTCH-1480
> URL: https://issues.apache.org/jira/browse/NUTCH-1480
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Attachments: NUTCH-1480-1.6.1.patch, 
> adding-support-for-sharding-indexer-for-solr.patch
>
>
> SolrUtils should return an array of SolrServers and read the SolrUrl as a 
> comma delimited list of URL's using Configuration.getString(). SolrWriter 
> should be able to handle this list of SolrServers.
> This is useful if you want to send documents to multiple servers if no 
> replication is available or if you want to send documents to multiple NOCs.
> edit:
> This does not replace NUTCH-1377 but complements it. With NUTCH-1377 this 
> issue allows you to index to multiple SolrCloud clusters at the same time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.

2018-05-31 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16496678#comment-16496678
 ] 

ASF GitHub Bot commented on NUTCH-1480:
---

sebastian-nagel commented on issue #218: fix for NUTCH-1480 contributed by 
r0ann3l
URL: https://github.com/apache/nutch/pull/218#issuecomment-393561347
 
 
   Thanks, @r0ann3l! +1 - I've tested the solution in local and 
pseudo-distributed mode and was able to index into Solr (a single index). If 
there are no objections I'll commit/merge soon.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> SolrIndexer to write to multiple servers.
> -
>
> Key: NUTCH-1480
> URL: https://issues.apache.org/jira/browse/NUTCH-1480
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Attachments: NUTCH-1480-1.6.1.patch, 
> adding-support-for-sharding-indexer-for-solr.patch
>
>
> SolrUtils should return an array of SolrServers and read the SolrUrl as a 
> comma delimited list of URL's using Configuration.getString(). SolrWriter 
> should be able to handle this list of SolrServers.
> This is useful if you want to send documents to multiple servers if no 
> replication is available or if you want to send documents to multiple NOCs.
> edit:
> This does not replace NUTCH-1377 but complements it. With NUTCH-1377 this 
> issue allows you to index to multiple SolrCloud clusters at the same time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.

2018-05-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16482083#comment-16482083
 ] 

ASF GitHub Bot commented on NUTCH-1480:
---

r0ann3l commented on issue #218: fix for NUTCH-1480 contributed by r0ann3l
URL: https://github.com/apache/nutch/pull/218#issuecomment-390522266
 
 
   Hi @sebastian-nagel,
   
   I changed the use of `ObjectCache.class` to an internal CACHE object, to 
avoid changing the behavior of other functionalities. In this case it is not 
necessary to have a different instance of `IndexWriters.class` for each 
`Configuration.class`. This is beacuse the index writer's configuration is 
handled in other individual file.
   
   Also, I fixed an issue in `TestElasticIndexWriter.class` (associated with 
the use of `IndexWriterParams.class`), which causes the unit test fail.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> SolrIndexer to write to multiple servers.
> -
>
> Key: NUTCH-1480
> URL: https://issues.apache.org/jira/browse/NUTCH-1480
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Attachments: NUTCH-1480-1.6.1.patch, 
> adding-support-for-sharding-indexer-for-solr.patch
>
>
> SolrUtils should return an array of SolrServers and read the SolrUrl as a 
> comma delimited list of URL's using Configuration.getString(). SolrWriter 
> should be able to handle this list of SolrServers.
> This is useful if you want to send documents to multiple servers if no 
> replication is available or if you want to send documents to multiple NOCs.
> edit:
> This does not replace NUTCH-1377 but complements it. With NUTCH-1377 this 
> issue allows you to index to multiple SolrCloud clusters at the same time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.

2018-01-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16328935#comment-16328935
 ] 

ASF GitHub Bot commented on NUTCH-1480:
---

r0ann3l commented on issue #218: fix for NUTCH-1480 contributed by r0ann3l
URL: https://github.com/apache/nutch/pull/218#issuecomment-358355731
 
 
   Hi @sebastian-nagel:
   In this case I propose to use an internal CACHE object, as the 
PluginRepository does, to store the IndexWriters object. The code could be 
something like this:
   ```
   private static final WeakHashMap CACHE = new 
WeakHashMap<>();
   public static synchronized IndexWriters get(Configuration conf) {
 String uuid = NutchConfiguration.getUUID(conf);
 if (uuid == null) {
   uuid = "nonNutchConf@" + conf.hashCode();
 }
 return CACHE.computeIfAbsent(uuid, k -> new IndexWriters(conf));
   }
   ```
   What do you think?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> SolrIndexer to write to multiple servers.
> -
>
> Key: NUTCH-1480
> URL: https://issues.apache.org/jira/browse/NUTCH-1480
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Attachments: NUTCH-1480-1.6.1.patch, 
> adding-support-for-sharding-indexer-for-solr.patch
>
>
> SolrUtils should return an array of SolrServers and read the SolrUrl as a 
> comma delimited list of URL's using Configuration.getString(). SolrWriter 
> should be able to handle this list of SolrServers.
> This is useful if you want to send documents to multiple servers if no 
> replication is available or if you want to send documents to multiple NOCs.
> edit:
> This does not replace NUTCH-1377 but complements it. With NUTCH-1377 this 
> issue allows you to index to multiple SolrCloud clusters at the same time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.

2018-01-03 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16309960#comment-16309960
 ] 

ASF GitHub Bot commented on NUTCH-1480:
---

lewismc commented on issue #218: fix for NUTCH-1480 contributed by r0ann3l
URL: https://github.com/apache/nutch/pull/218#issuecomment-355074694
 
 
   @r0ann3l can you please update this PR inline with master?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> SolrIndexer to write to multiple servers.
> -
>
> Key: NUTCH-1480
> URL: https://issues.apache.org/jira/browse/NUTCH-1480
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Attachments: NUTCH-1480-1.6.1.patch, 
> adding-support-for-sharding-indexer-for-solr.patch
>
>
> SolrUtils should return an array of SolrServers and read the SolrUrl as a 
> comma delimited list of URL's using Configuration.getString(). SolrWriter 
> should be able to handle this list of SolrServers.
> This is useful if you want to send documents to multiple servers if no 
> replication is available or if you want to send documents to multiple NOCs.
> edit:
> This does not replace NUTCH-1377 but complements it. With NUTCH-1377 this 
> issue allows you to index to multiple SolrCloud clusters at the same time.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.

2017-11-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16271543#comment-16271543
 ] 

ASF GitHub Bot commented on NUTCH-1480:
---

sebastian-nagel commented on issue #218: fix for NUTCH-1480 contributed by 
r0ann3l
URL: https://github.com/apache/nutch/pull/218#issuecomment-347996663
 
 
   Hi @r0ann3l,
   did you verify that the change using `NutchConfiguration.getUUID(conf)` 
changes the behavior. Cf. 
[NUTCH-2407](https://issues.apache.org/jira/browse/NUTCH-2407?focusedCommentId=16180780&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16180780)
 which let me doubt it as it's a only random UUID for each Configuration 
object. 
   
   Using the UUID in the ObjectCache makes the unit tests fail (TestGenerator): 
in fact the ObjectCache now returns the same object even if the configuration 
is different. We need actually really implement a hash value for Configuration 
objects.
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> SolrIndexer to write to multiple servers.
> -
>
> Key: NUTCH-1480
> URL: https://issues.apache.org/jira/browse/NUTCH-1480
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Attachments: NUTCH-1480-1.6.1.patch, 
> adding-support-for-sharding-indexer-for-solr.patch
>
>
> SolrUtils should return an array of SolrServers and read the SolrUrl as a 
> comma delimited list of URL's using Configuration.getString(). SolrWriter 
> should be able to handle this list of SolrServers.
> This is useful if you want to send documents to multiple servers if no 
> replication is available or if you want to send documents to multiple NOCs.
> edit:
> This does not replace NUTCH-1377 but complements it. With NUTCH-1377 this 
> issue allows you to index to multiple SolrCloud clusters at the same time.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.

2017-11-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16265473#comment-16265473
 ] 

ASF GitHub Bot commented on NUTCH-1480:
---

r0ann3l commented on issue #218: fix for NUTCH-1480 contributed by r0ann3l
URL: https://github.com/apache/nutch/pull/218#issuecomment-346834693
 
 
   Hi @sebastian-nagel, thank you very much for your comments!!! I agree with 
your suggestions and I included the changes you propose from your fork.
   
   About indexer-dummy, I also tried to make it work, but it was not possible. 
In theory, you can build as many instances of `IndexWriters` as you want, that 
you will always get the same instance, because it gotten from cache. So, the 
first issue I found was the `ObjectCache` uses the `Configuration` object 
itself as the key, and this object is not the same in each call. This causes 
that there are two instances of `IndexWriters` writing to same file, as you 
say. So, I replaced the key of `ObjectCache` with the UUID of the 
`Configuration` object.
   
   Now, we have only one instance of `IndexWriters`, but there is another 
problem: when you try to commit the writers in 
`IndexingJob.index(IndexingJob.java151)` it is already closed from 
`IndexerOutputFormat.getRecordWriter(IndexerOutputFormat.java:44)`. Therefore, 
I moved the `commit()` call from `IndexingJob` to `IndexerOutputFormat`, just 
before the `close()` method is called.
   
   I also, moved the indexers description from `IndexingJob` to 
`IndexerOutputFormat`, to avoid to build `IndexWriters` instance twice.
   
   Thanks


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> SolrIndexer to write to multiple servers.
> -
>
> Key: NUTCH-1480
> URL: https://issues.apache.org/jira/browse/NUTCH-1480
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Attachments: NUTCH-1480-1.6.1.patch, 
> adding-support-for-sharding-indexer-for-solr.patch
>
>
> SolrUtils should return an array of SolrServers and read the SolrUrl as a 
> comma delimited list of URL's using Configuration.getString(). SolrWriter 
> should be able to handle this list of SolrServers.
> This is useful if you want to send documents to multiple servers if no 
> replication is available or if you want to send documents to multiple NOCs.
> edit:
> This does not replace NUTCH-1377 but complements it. With NUTCH-1377 this 
> issue allows you to index to multiple SolrCloud clusters at the same time.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.

2017-11-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16265302#comment-16265302
 ] 

ASF GitHub Bot commented on NUTCH-1480:
---

r0ann3l commented on issue #218: fix for NUTCH-1480 contributed by r0ann3l
URL: https://github.com/apache/nutch/pull/218#issuecomment-346834693
 
 
   Hi @sebastian, thank you very much for your comments!!! I agree with your 
suggestions and I included the changes you propose from your fork.
   
   About indexer-dummy, I also tried to make it work, but it was not possible. 
In theory, you can build as many instances of `IndexWriters` as you want, that 
you will always get the same instance, because it gotten from cache. So, the 
first issue I found was the `ObjectCache` uses the `Configuration` object 
itself as the key, and this object is not the same in each call. This causes 
that there are two instances of `IndexWriters` writing to same file, as you 
say. So, I replaced the key of `ObjectCache` with the UUID of the 
`Configuration` object.
   
   Now, we have only one instance of `IndexWriters`, but there is another 
problem: when you try to commit the writers in 
`IndexingJob.index(IndexingJob.java151)` it is already closed from 
`IndexerOutputFormat.getRecordWriter(IndexerOutputFormat.java:44)`. Therefore, 
I moved the `commit()` call from `IndexingJob` to `IndexerOutputFormat`, just 
before the `close()` method is called.
   
   I also, moved the indexers description from `IndexingJob` to 
`IndexerOutputFormat`, to avoid to build `IndexWriters` instance twice.
   
   Thanks


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> SolrIndexer to write to multiple servers.
> -
>
> Key: NUTCH-1480
> URL: https://issues.apache.org/jira/browse/NUTCH-1480
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Attachments: NUTCH-1480-1.6.1.patch, 
> adding-support-for-sharding-indexer-for-solr.patch
>
>
> SolrUtils should return an array of SolrServers and read the SolrUrl as a 
> comma delimited list of URL's using Configuration.getString(). SolrWriter 
> should be able to handle this list of SolrServers.
> This is useful if you want to send documents to multiple servers if no 
> replication is available or if you want to send documents to multiple NOCs.
> edit:
> This does not replace NUTCH-1377 but complements it. With NUTCH-1377 this 
> issue allows you to index to multiple SolrCloud clusters at the same time.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.

2017-11-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16257012#comment-16257012
 ] 

ASF GitHub Bot commented on NUTCH-1480:
---

sebastian-nagel commented on issue #218: fix for NUTCH-1480 contributed by 
r0ann3l
URL: https://github.com/apache/nutch/pull/218#issuecomment-345254791
 
 
   Hi @r0ann3l, thanks! I've continued testing, and was able to feed two Solr 
indexes in parallel. Great! Afaics, all requested changes have been made (also 
that of @lewismc).
   
   To make the configuration work out of the box, I would suggest 3 changes:
   - use only field names defined in the default schema.xml
   `ERROR: [doc=http://nutch.apache.org/] unknown field 'search'
   - default Solr core name should be "nutch" as described in the 
[tutorial](https://wiki.apache.org/nutch/NutchTutorial)
   
   I've tried to fix these issues in "[a fork of 
NUTCH-1480](https://github.com/sebastian-nagel/nutch/commits/NUTCH-1480)". Feel 
free to cherry pick it from there.
   
   I've also tried to make indexer-dummy work. Without success, the file is 
created but then overwritten:
   
   - there are two instances of `IndexWriters` active, each having a separate 
instance of DummyIndexWriter.
  - the instance created from 
`IndexerOutputFormat.getRecordWriter(IndexerOutputFormat.java:39)` writes into 
the file
  - but later on the instance created from 
`IndexWriters.open(IndexWriters.java:187)` opens the file anew, at the end 
there is an empty file. Because it's two instances there is no possibility to 
check whether the file writer is already instantiated.
   
   I see two potential solutions:
   1. the IndexWriter interface method `open(job, name)` was defined with file 
indexers in mind (cf. 
NUTCH-1541/[CSVIndexWriter](https://github.com/sebastian-nagel/nutch/blob/NUTCH-1541/src/plugin/indexer-csv/src/java/org/apache/nutch/indexwriter/csv/CSVIndexWriter.java#L233)),
 an index writer can then decide to do nothing when called with name "commit".
   2. do not call the `commit()` method explicitly (ev. also remove it from the 
interface: it does not safely work in distributed mode because it's not run in 
the reducers (see the comment in RabbitIndexWriter).
   
   I tend to the second solution. It would also solve the problem of having two 
IndexWriters instances active. What do you think?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> SolrIndexer to write to multiple servers.
> -
>
> Key: NUTCH-1480
> URL: https://issues.apache.org/jira/browse/NUTCH-1480
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Attachments: NUTCH-1480-1.6.1.patch, 
> adding-support-for-sharding-indexer-for-solr.patch
>
>
> SolrUtils should return an array of SolrServers and read the SolrUrl as a 
> comma delimited list of URL's using Configuration.getString(). SolrWriter 
> should be able to handle this list of SolrServers.
> This is useful if you want to send documents to multiple servers if no 
> replication is available or if you want to send documents to multiple NOCs.
> edit:
> This does not replace NUTCH-1377 but complements it. With NUTCH-1377 this 
> issue allows you to index to multiple SolrCloud clusters at the same time.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.

2017-11-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16244013#comment-16244013
 ] 

ASF GitHub Bot commented on NUTCH-1480:
---

r0ann3l commented on issue #218: fix for NUTCH-1480 contributed by r0ann3l
URL: https://github.com/apache/nutch/pull/218#issuecomment-342832072
 
 
   Thanks @sebastian-nagel  for your review. Sections for all indexer-* plugins 
were added, so they work out-of-the-box as you required in your comments. Also, 
it is not mandatory to specify fields for the actions (the schema is relaxed).
   
   I included a new change, to avoid duplicate values in a field when someone 
tries to copy to the same field, like: 
   
   ```
   

   
   ```
   
   In addition, I added a new class (IndexWriterParams) to facilitate the 
process of obtaining and parsing values from the index-writers.xml file. Now, 
an instance of IndexWriterParams is passed to each IndexWriter instead of 
HashMap.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> SolrIndexer to write to multiple servers.
> -
>
> Key: NUTCH-1480
> URL: https://issues.apache.org/jira/browse/NUTCH-1480
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Attachments: NUTCH-1480-1.6.1.patch, 
> adding-support-for-sharding-indexer-for-solr.patch
>
>
> SolrUtils should return an array of SolrServers and read the SolrUrl as a 
> comma delimited list of URL's using Configuration.getString(). SolrWriter 
> should be able to handle this list of SolrServers.
> This is useful if you want to send documents to multiple servers if no 
> replication is available or if you want to send documents to multiple NOCs.
> edit:
> This does not replace NUTCH-1377 but complements it. With NUTCH-1377 this 
> issue allows you to index to multiple SolrCloud clusters at the same time.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.

2017-09-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16172924#comment-16172924
 ] 

ASF GitHub Bot commented on NUTCH-1480:
---

sebastian-nagel commented on a change in pull request #218: fix for NUTCH-1480 
contributed by r0ann3l
URL: https://github.com/apache/nutch/pull/218#discussion_r139908742
 
 

 ##
 File path: conf/index-writers.xml.template
 ##
 @@ -0,0 +1,75 @@
+
+
+http://lucene.apache.org/nutch";
+ xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance";
+ xsi:schemaLocation="http://lucene.apache.org/nutch index-writers.xsd">
+
+  
+
+  
+  http://localhost:8983/solr/core_name"/>
+  
+  
+  
+  
+  
+
+
+  
+
+
+  
+  
+
+
+  
+  
+
+  
+
+  
+  
+
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+
+
+  
+
+  
+  
+
+
+  
+  
+
+
+
+  
+
+  
 
 Review comment:
   Add stub  sections for all indexer-* plugins so that they work 
out-of-the-box without modifications of the index-writers.xml required, e.g. 
for indexer-dummy:
   ```
 
   
 
   
   
 
   
 
 
 
   
 
   ```
   That's long for a dummy section, but the schema (index-writers.xsd) and the 
IndexWriters class requires all the elements and attributes. Maybe it's better 
to "relax" the schema, make elements/attributes optional and make IndexWriters 
not fail with NPEs.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> SolrIndexer to write to multiple servers.
> -
>
> Key: NUTCH-1480
> URL: https://issues.apache.org/jira/browse/NUTCH-1480
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Attachments: adding-support-for-sharding-indexer-for-solr.patch, 
> NUTCH-1480-1.6.1.patch
>
>
> SolrUtils should return an array of SolrServers and read the SolrUrl as a 
> comma delimited list of URL's using Configuration.getString(). SolrWriter 
> should be able to handle this list of SolrServers.
> This is useful if you want to send documents to multiple servers if no 
> replication is available or if you want to send documents to multiple NOCs.
> edit:
> This does not replace NUTCH-1377 but complements it. With NUTCH-1377 this 
> issue allows you to index to multiple SolrCloud clusters at the same time.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.

2017-09-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16172899#comment-16172899
 ] 

ASF GitHub Bot commented on NUTCH-1480:
---

sebastian-nagel commented on issue #218: fix for NUTCH-1480 contributed by 
r0ann3l
URL: https://github.com/apache/nutch/pull/218#issuecomment-330784463
 
 
   Looks good! I've tried to use indexer-dummy with this PR applied - it took 
long to configure the index-writers.xml properly, so we should definitely add 
"stub" sections for all index writers which are (still) based on configuration 
properties. All index writers should work out-of-the-box!
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> SolrIndexer to write to multiple servers.
> -
>
> Key: NUTCH-1480
> URL: https://issues.apache.org/jira/browse/NUTCH-1480
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Attachments: adding-support-for-sharding-indexer-for-solr.patch, 
> NUTCH-1480-1.6.1.patch
>
>
> SolrUtils should return an array of SolrServers and read the SolrUrl as a 
> comma delimited list of URL's using Configuration.getString(). SolrWriter 
> should be able to handle this list of SolrServers.
> This is useful if you want to send documents to multiple servers if no 
> replication is available or if you want to send documents to multiple NOCs.
> edit:
> This does not replace NUTCH-1377 but complements it. With NUTCH-1377 this 
> issue allows you to index to multiple SolrCloud clusters at the same time.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.

2017-09-14 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16166732#comment-16166732
 ] 

ASF GitHub Bot commented on NUTCH-1480:
---

lewismc commented on issue #218: fix for NUTCH-1480 contributed by r0ann3l
URL: https://github.com/apache/nutch/pull/218#issuecomment-329563051
 
 
   Tested this on Solr 6 and works well... any comments folks?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> SolrIndexer to write to multiple servers.
> -
>
> Key: NUTCH-1480
> URL: https://issues.apache.org/jira/browse/NUTCH-1480
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Attachments: adding-support-for-sharding-indexer-for-solr.patch, 
> NUTCH-1480-1.6.1.patch
>
>
> SolrUtils should return an array of SolrServers and read the SolrUrl as a 
> comma delimited list of URL's using Configuration.getString(). SolrWriter 
> should be able to handle this list of SolrServers.
> This is useful if you want to send documents to multiple servers if no 
> replication is available or if you want to send documents to multiple NOCs.
> edit:
> This does not replace NUTCH-1377 but complements it. With NUTCH-1377 this 
> issue allows you to index to multiple SolrCloud clusters at the same time.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.

2017-09-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16164899#comment-16164899
 ] 

ASF GitHub Bot commented on NUTCH-1480:
---

r0ann3l commented on issue #218: fix for NUTCH-1480 contributed by r0ann3l
URL: https://github.com/apache/nutch/pull/218#issuecomment-329219778
 
 
   Thanks @lewismc and @jorgelbg for your reviews. All your comments have been 
fixed.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> SolrIndexer to write to multiple servers.
> -
>
> Key: NUTCH-1480
> URL: https://issues.apache.org/jira/browse/NUTCH-1480
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Attachments: adding-support-for-sharding-indexer-for-solr.patch, 
> NUTCH-1480-1.6.1.patch
>
>
> SolrUtils should return an array of SolrServers and read the SolrUrl as a 
> comma delimited list of URL's using Configuration.getString(). SolrWriter 
> should be able to handle this list of SolrServers.
> This is useful if you want to send documents to multiple servers if no 
> replication is available or if you want to send documents to multiple NOCs.
> edit:
> This does not replace NUTCH-1377 but complements it. With NUTCH-1377 this 
> issue allows you to index to multiple SolrCloud clusters at the same time.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.

2017-09-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16164857#comment-16164857
 ] 

ASF GitHub Bot commented on NUTCH-1480:
---

r0ann3l commented on a change in pull request #218: fix for NUTCH-1480 
contributed by r0ann3l
URL: https://github.com/apache/nutch/pull/218#discussion_r138659435
 
 

 ##
 File path: 
src/plugin/indexer-rabbit/src/java/org/apache/nutch/indexwriter/rabbit/RabbitMQConstants.java
 ##
 @@ -17,28 +17,26 @@
 package org.apache.nutch.indexwriter.rabbit;
 
 interface RabbitMQConstants {
-String RABBIT_PREFIX = "rabbitmq.indexer";
 
 Review comment:
   Hi @lewismc. The prefix is not necessary anymore. The new structure allows 
us to have the same key of a parameter to many index writers without ambiguity 
or confusion. The prefix makes a parameter key larger and really I do not 
believe that is necessary.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> SolrIndexer to write to multiple servers.
> -
>
> Key: NUTCH-1480
> URL: https://issues.apache.org/jira/browse/NUTCH-1480
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Attachments: adding-support-for-sharding-indexer-for-solr.patch, 
> NUTCH-1480-1.6.1.patch
>
>
> SolrUtils should return an array of SolrServers and read the SolrUrl as a 
> comma delimited list of URL's using Configuration.getString(). SolrWriter 
> should be able to handle this list of SolrServers.
> This is useful if you want to send documents to multiple servers if no 
> replication is available or if you want to send documents to multiple NOCs.
> edit:
> This does not replace NUTCH-1377 but complements it. With NUTCH-1377 this 
> issue allows you to index to multiple SolrCloud clusters at the same time.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.

2017-09-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16154578#comment-16154578
 ] 

ASF GitHub Bot commented on NUTCH-1480:
---

jorgelbg commented on a change in pull request #218: fix for NUTCH-1480 
contributed by r0ann3l
URL: https://github.com/apache/nutch/pull/218#discussion_r137142167
 
 

 ##
 File path: conf/index-writers.xsd
 ##
 @@ -0,0 +1,179 @@
+
+
+http://www.w3.org/2001/XMLSchema";
+   targetNamespace="http://lucene.apache.org/nutch";
+   xmlns="http://lucene.apache.org/nutch";
+   elementFormDefault="qualified">
+  
+
+  
+Root tag of index-writers.xml document. It's a wrapper for the all 
index writers.
+  
+
+
+  
+
+  
+
+  Contains the all configuration of a particular index writer.
 
 Review comment:
   typo? Also it would be a good idea to have empty lines at the end of this 
file and the `conf/index-writers.xml.template` for git/diff compatiblity.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> SolrIndexer to write to multiple servers.
> -
>
> Key: NUTCH-1480
> URL: https://issues.apache.org/jira/browse/NUTCH-1480
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Attachments: adding-support-for-sharding-indexer-for-solr.patch, 
> NUTCH-1480-1.6.1.patch
>
>
> SolrUtils should return an array of SolrServers and read the SolrUrl as a 
> comma delimited list of URL's using Configuration.getString(). SolrWriter 
> should be able to handle this list of SolrServers.
> This is useful if you want to send documents to multiple servers if no 
> replication is available or if you want to send documents to multiple NOCs.
> edit:
> This does not replace NUTCH-1377 but complements it. With NUTCH-1377 this 
> issue allows you to index to multiple SolrCloud clusters at the same time.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.

2017-09-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16154573#comment-16154573
 ] 

ASF GitHub Bot commented on NUTCH-1480:
---

jorgelbg commented on a change in pull request #218: fix for NUTCH-1480 
contributed by r0ann3l
URL: https://github.com/apache/nutch/pull/218#discussion_r137141458
 
 

 ##
 File path: 
src/plugin/indexer-rabbit/src/java/org/apache/nutch/indexwriter/rabbit/RabbitIndexWriter.java
 ##
 @@ -31,164 +31,182 @@
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
 
+import java.lang.invoke.MethodHandles;
 import java.util.*;
 import java.util.concurrent.TimeoutException;
 
 public class RabbitIndexWriter implements IndexWriter {
 
-private String serverHost;
-private int serverPort;
-private String serverVirtualHost;
-private String serverUsername;
-private String serverPassword;
+  private String serverHost;
+  private int serverPort;
+  private String serverVirtualHost;
+  private String serverUsername;
+  private String serverPassword;
 
-private String exchangeServer;
-private String exchangeType;
+  private String exchangeServer;
+  private String exchangeType;
 
-private String queueName;
-private boolean queueDurable;
-private String queueRoutingKey;
+  private String queueName;
+  private boolean queueDurable;
+  private String queueRoutingKey;
 
-private int commitSize;
+  private int commitSize;
 
-public static final Logger LOG = 
LoggerFactory.getLogger(RabbitIndexWriter.class);
+  private static final Logger LOG = LoggerFactory
+  .getLogger(MethodHandles.lookup().lookupClass());
 
-private Configuration config;
+  private Configuration config;
 
-private RabbitMessage rabbitMessage = new RabbitMessage();
+  private RabbitMessage rabbitMessage = new RabbitMessage();
 
-private Channel channel;
-private Connection connection;
+  private Channel channel;
+  private Connection connection;
 
-@Override
-public Configuration getConf() {
-return config;
-}
+  @Override
+  public Configuration getConf() {
+return config;
+  }
 
-@Override
-public void setConf(Configuration conf) {
-config = conf;
+  @Override
+  public void setConf(Configuration conf) {
+config = conf;
+  }
 
-serverHost = conf.get(RabbitMQConstants.SERVER_HOST, "localhost");
-serverPort = conf.getInt(RabbitMQConstants.SERVER_PORT, 15672);
-serverVirtualHost = conf.get(RabbitMQConstants.SERVER_VIRTUAL_HOST, 
null);
+  @Override
+  public void open(JobConf JobConf, String name) throws IOException {
+//Implementation not required
+  }
 
-serverUsername = conf.get(RabbitMQConstants.SERVER_USERNAME, "admin");
-serverPassword = conf.get(RabbitMQConstants.SERVER_PASSWORD, "admin");
+  /**
+   * Initializes the internal variables from a given index writer 
configuration.
+   *
+   * @param parameters Params from the index writer configuration.
+   * @throws IOException Some exception thrown by writer.
+   */
+  @Override
+  public void open(Map parameters) throws IOException {
+serverHost = parameters.getOrDefault(RabbitMQConstants.SERVER_HOST, 
"localhost");
+serverPort = 
Integer.parseInt(parameters.getOrDefault(RabbitMQConstants.SERVER_PORT, 
"5672"));
+serverVirtualHost = 
parameters.getOrDefault(RabbitMQConstants.SERVER_VIRTUAL_HOST, null);
 
-exchangeServer = conf.get(RabbitMQConstants.EXCHANGE_SERVER, 
"nutch.exchange");
-exchangeType = conf.get(RabbitMQConstants.EXCHANGE_TYPE, "direct");
+serverUsername = 
parameters.getOrDefault(RabbitMQConstants.SERVER_USERNAME, "admin");
+serverPassword = 
parameters.getOrDefault(RabbitMQConstants.SERVER_PASSWORD, "admin");
 
-queueName = conf.get(RabbitMQConstants.QUEUE_NAME, "nutch.queue");
-queueDurable = conf.getBoolean(RabbitMQConstants.QUEUE_DURABLE, true);
-queueRoutingKey = conf.get(RabbitMQConstants.QUEUE_ROUTING_KEY, 
"nutch.key");
+exchangeServer = 
parameters.getOrDefault(RabbitMQConstants.EXCHANGE_SERVER, "nutch.exchange");
+exchangeType = parameters.getOrDefault(RabbitMQConstants.EXCHANGE_TYPE, 
"direct");
 
-commitSize = conf.getInt(RabbitMQConstants.COMMIT_SIZE, 250);
-}
+queueName = parameters.getOrDefault(RabbitMQConstants.QUEUE_NAME, 
"nutch.queue");
+queueDurable = 
Boolean.parseBoolean(parameters.getOrDefault(RabbitMQConstants.QUEUE_DURABLE, 
"true"));
+queueRoutingKey = 
parameters.getOrDefault(RabbitMQConstants.QUEUE_ROUTING_KEY, "nutch.key");
 
-@Override
-public void open(JobConf JobConf, String name) throws IOException {
-ConnectionFactory factory = new ConnectionFactory();
-factory.setHost(serverHost);
-factory.setPort(serverPort);
+commitSize = 
Integer.parseInt(parameters.getOrDefault(RabbitMQConstants.COMMIT_SIZE, "25

[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.

2017-09-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16154358#comment-16154358
 ] 

ASF GitHub Bot commented on NUTCH-1480:
---

lewismc commented on a change in pull request #218: fix for NUTCH-1480 
contributed by r0ann3l
URL: https://github.com/apache/nutch/pull/218#discussion_r137120181
 
 

 ##
 File path: src/java/org/apache/nutch/indexer/IndexWriters.java
 ##
 @@ -16,132 +16,245 @@
  */
 package org.apache.nutch.indexer;
 
-import java.io.IOException;
-import java.lang.invoke.MethodHandles;
-import java.util.HashMap;
-
 import org.apache.hadoop.conf.Configuration;
 import org.apache.hadoop.mapred.JobConf;
-import org.apache.nutch.indexer.NutchDocument;
 import org.apache.nutch.plugin.Extension;
 import org.apache.nutch.plugin.ExtensionPoint;
 import org.apache.nutch.plugin.PluginRepository;
 import org.apache.nutch.plugin.PluginRuntimeException;
 import org.apache.nutch.util.ObjectCache;
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
+import org.w3c.dom.Document;
+import org.w3c.dom.Element;
+import org.w3c.dom.NodeList;
+import org.xml.sax.InputSource;
+import org.xml.sax.SAXException;
+
+import javax.xml.parsers.DocumentBuilder;
+import javax.xml.parsers.DocumentBuilderFactory;
+import javax.xml.parsers.ParserConfigurationException;
+import java.io.IOException;
+import java.io.InputStream;
+import java.lang.invoke.MethodHandles;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
 
-/** Creates and caches {@link IndexWriter} implementing plugins. */
+/**
+ * Creates and caches {@link IndexWriter} implementing plugins.
+ */
 public class IndexWriters {
 
   private static final Logger LOG = LoggerFactory
-  .getLogger(MethodHandles.lookup().lookupClass());
+  .getLogger(MethodHandles.lookup().lookupClass());
 
-  private IndexWriter[] indexWriters;
+  private HashMap indexWriters;
 
   public IndexWriters(Configuration conf) {
 ObjectCache objectCache = ObjectCache.get(conf);
+
 synchronized (objectCache) {
-  this.indexWriters = (IndexWriter[]) objectCache
-  .getObject(IndexWriter.class.getName());
+  this.indexWriters = (HashMap) objectCache
+  .getObject(IndexWriterWrapper.class.getName());
+
+  //It's not cached yet
   if (this.indexWriters == null) {
 try {
   ExtensionPoint point = PluginRepository.get(conf).getExtensionPoint(
-  IndexWriter.X_POINT_ID);
-  if (point == null)
+  IndexWriter.X_POINT_ID);
+
+  if (point == null) {
 throw new RuntimeException(IndexWriter.X_POINT_ID + " not found.");
+  }
+
   Extension[] extensions = point.getExtensions();
-  HashMap indexerMap = new HashMap<>();
-  for (int i = 0; i < extensions.length; i++) {
-Extension extension = extensions[i];
-IndexWriter writer = (IndexWriter) 
extension.getExtensionInstance();
-LOG.info("Adding " + writer.getClass().getName());
-if (!indexerMap.containsKey(writer.getClass().getName())) {
-  indexerMap.put(writer.getClass().getName(), writer);
+
+  HashMap extensionMap = new HashMap<>();
+  for (Extension extension : extensions) {
+LOG.info("Index writer " + extension.getClazz() + " identified.");
 
 Review comment:
   Please use parameterized logging
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> SolrIndexer to write to multiple servers.
> -
>
> Key: NUTCH-1480
> URL: https://issues.apache.org/jira/browse/NUTCH-1480
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Attachments: adding-support-for-sharding-indexer-for-solr.patch, 
> NUTCH-1480-1.6.1.patch
>
>
> SolrUtils should return an array of SolrServers and read the SolrUrl as a 
> comma delimited list of URL's using Configuration.getString(). SolrWriter 
> should be able to handle this list of SolrServers.
> This is useful if you want to send documents to multiple servers if no 
> replication is available or if you want to send documents to multiple NOCs.
> edit:
> This does not replace NUTCH-1377 but complements it. With NUTCH-1377 this 
> issue allows you to index to multiple SolrCloud clusters at the same time.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.

2017-09-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16154359#comment-16154359
 ] 

ASF GitHub Bot commented on NUTCH-1480:
---

lewismc commented on a change in pull request #218: fix for NUTCH-1480 
contributed by r0ann3l
URL: https://github.com/apache/nutch/pull/218#discussion_r137120754
 
 

 ##
 File path: 
src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrConstants.java
 ##
 @@ -19,19 +19,23 @@
 public interface SolrConstants {
   public static final String SOLR_PREFIX = "solr.";
 
 Review comment:
   Any reason to remove all of these as well?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> SolrIndexer to write to multiple servers.
> -
>
> Key: NUTCH-1480
> URL: https://issues.apache.org/jira/browse/NUTCH-1480
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Attachments: adding-support-for-sharding-indexer-for-solr.patch, 
> NUTCH-1480-1.6.1.patch
>
>
> SolrUtils should return an array of SolrServers and read the SolrUrl as a 
> comma delimited list of URL's using Configuration.getString(). SolrWriter 
> should be able to handle this list of SolrServers.
> This is useful if you want to send documents to multiple servers if no 
> replication is available or if you want to send documents to multiple NOCs.
> edit:
> This does not replace NUTCH-1377 but complements it. With NUTCH-1377 this 
> issue allows you to index to multiple SolrCloud clusters at the same time.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.

2017-09-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16154354#comment-16154354
 ] 

ASF GitHub Bot commented on NUTCH-1480:
---

lewismc commented on a change in pull request #218: fix for NUTCH-1480 
contributed by r0ann3l
URL: https://github.com/apache/nutch/pull/218#discussion_r137121058
 
 

 ##
 File path: 
src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrIndexWriter.java
 ##
 @@ -75,19 +70,61 @@
   private int totalUpdates = 0;
   private boolean delete = false;
 
+  @Override
   public void open(JobConf job, String name) throws IOException {
-solrClients = SolrUtils.getSolrClients(job);
-init(solrClients, job);
+//Implementation not required
   }
 
-  // package protected for tests
-  void init(List solrClients, JobConf job) throws IOException {
-batchSize = job.getInt(SolrConstants.COMMIT_SIZE, 1000);
-solrMapping = SolrMappingReader.getInstance(job);
-delete = job.getBoolean(IndexerMapReduce.INDEXER_DELETE, false);
+  /**
+   * Initializes the internal variables from a given index writer 
configuration.
+   *
+   * @param parameters Params from the index writer configuration.
+   * @throws IOException Some exception thrown by writer.
+   */
+  @Override
+  public void open(Map parameters) throws IOException {
+String type = parameters.getOrDefault("type", "http");
+
+String[] urls = StringUtils.getStrings(parameters.get("url"));
+
+if (urls == null) {
+  String message = "Missing SOLR URL.\n" + describe();
+  LOG.error(message);
+  throw new RuntimeException(message);
+}
+
+this.solrClients = new ArrayList<>();
+
+switch (type) {
+  case "http":
+for (String url : urls) {
+  solrClients.add(SolrUtils.getHttpSolrClient(url));
+}
+break;
+  case "cloud":
+for (String url : urls) {
+  CloudSolrClient sc = SolrUtils.getCloudSolrClient(url);
+  sc.setDefaultCollection(parameters.get(SolrConstants.COLLECTION));
+  solrClients.add(sc);
+}
+break;
+  case "concurrent":
 
 Review comment:
   Can you throw unsupported Exception at this stage? and also a default case?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> SolrIndexer to write to multiple servers.
> -
>
> Key: NUTCH-1480
> URL: https://issues.apache.org/jira/browse/NUTCH-1480
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Attachments: adding-support-for-sharding-indexer-for-solr.patch, 
> NUTCH-1480-1.6.1.patch
>
>
> SolrUtils should return an array of SolrServers and read the SolrUrl as a 
> comma delimited list of URL's using Configuration.getString(). SolrWriter 
> should be able to handle this list of SolrServers.
> This is useful if you want to send documents to multiple servers if no 
> replication is available or if you want to send documents to multiple NOCs.
> edit:
> This does not replace NUTCH-1377 but complements it. With NUTCH-1377 this 
> issue allows you to index to multiple SolrCloud clusters at the same time.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.

2017-09-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16154357#comment-16154357
 ] 

ASF GitHub Bot commented on NUTCH-1480:
---

lewismc commented on a change in pull request #218: fix for NUTCH-1480 
contributed by r0ann3l
URL: https://github.com/apache/nutch/pull/218#discussion_r137120319
 
 

 ##
 File path: src/java/org/apache/nutch/indexer/MappingReader.java
 ##
 @@ -0,0 +1,93 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.indexer;
+
+import org.w3c.dom.Element;
+import org.w3c.dom.Node;
+import org.w3c.dom.NodeList;
+
+import java.util.*;
 
 Review comment:
   Do not use wildcard
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> SolrIndexer to write to multiple servers.
> -
>
> Key: NUTCH-1480
> URL: https://issues.apache.org/jira/browse/NUTCH-1480
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Attachments: adding-support-for-sharding-indexer-for-solr.patch, 
> NUTCH-1480-1.6.1.patch
>
>
> SolrUtils should return an array of SolrServers and read the SolrUrl as a 
> comma delimited list of URL's using Configuration.getString(). SolrWriter 
> should be able to handle this list of SolrServers.
> This is useful if you want to send documents to multiple servers if no 
> replication is available or if you want to send documents to multiple NOCs.
> edit:
> This does not replace NUTCH-1377 but complements it. With NUTCH-1377 this 
> issue allows you to index to multiple SolrCloud clusters at the same time.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.

2017-09-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16154355#comment-16154355
 ] 

ASF GitHub Bot commented on NUTCH-1480:
---

lewismc commented on a change in pull request #218: fix for NUTCH-1480 
contributed by r0ann3l
URL: https://github.com/apache/nutch/pull/218#discussion_r137119842
 
 

 ##
 File path: src/java/org/apache/nutch/indexer/IndexWriterConfig.java
 ##
 @@ -0,0 +1,95 @@
+package org.apache.nutch.indexer;
+
+import org.w3c.dom.Element;
+import org.w3c.dom.NodeList;
+
+import java.util.*;
 
 Review comment:
   Please use explicit imports instead of wildcard
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> SolrIndexer to write to multiple servers.
> -
>
> Key: NUTCH-1480
> URL: https://issues.apache.org/jira/browse/NUTCH-1480
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Attachments: adding-support-for-sharding-indexer-for-solr.patch, 
> NUTCH-1480-1.6.1.patch
>
>
> SolrUtils should return an array of SolrServers and read the SolrUrl as a 
> comma delimited list of URL's using Configuration.getString(). SolrWriter 
> should be able to handle this list of SolrServers.
> This is useful if you want to send documents to multiple servers if no 
> replication is available or if you want to send documents to multiple NOCs.
> edit:
> This does not replace NUTCH-1377 but complements it. With NUTCH-1377 this 
> issue allows you to index to multiple SolrCloud clusters at the same time.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.

2017-09-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16154353#comment-16154353
 ] 

ASF GitHub Bot commented on NUTCH-1480:
---

lewismc commented on a change in pull request #218: fix for NUTCH-1480 
contributed by r0ann3l
URL: https://github.com/apache/nutch/pull/218#discussion_r137120636
 
 

 ##
 File path: 
src/plugin/indexer-rabbit/src/java/org/apache/nutch/indexwriter/rabbit/RabbitMQConstants.java
 ##
 @@ -17,28 +17,26 @@
 package org.apache.nutch.indexwriter.rabbit;
 
 interface RabbitMQConstants {
-String RABBIT_PREFIX = "rabbitmq.indexer";
 
 Review comment:
   Why did you remove all of these?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> SolrIndexer to write to multiple servers.
> -
>
> Key: NUTCH-1480
> URL: https://issues.apache.org/jira/browse/NUTCH-1480
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Attachments: adding-support-for-sharding-indexer-for-solr.patch, 
> NUTCH-1480-1.6.1.patch
>
>
> SolrUtils should return an array of SolrServers and read the SolrUrl as a 
> comma delimited list of URL's using Configuration.getString(). SolrWriter 
> should be able to handle this list of SolrServers.
> This is useful if you want to send documents to multiple servers if no 
> replication is available or if you want to send documents to multiple NOCs.
> edit:
> This does not replace NUTCH-1377 but complements it. With NUTCH-1377 this 
> issue allows you to index to multiple SolrCloud clusters at the same time.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.

2017-09-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16154356#comment-16154356
 ] 

ASF GitHub Bot commented on NUTCH-1480:
---

lewismc commented on a change in pull request #218: fix for NUTCH-1480 
contributed by r0ann3l
URL: https://github.com/apache/nutch/pull/218#discussion_r137119767
 
 

 ##
 File path: src/java/org/apache/nutch/indexer/IndexWriterConfig.java
 ##
 @@ -0,0 +1,95 @@
+package org.apache.nutch.indexer;
 
 Review comment:
   Please add license header
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> SolrIndexer to write to multiple servers.
> -
>
> Key: NUTCH-1480
> URL: https://issues.apache.org/jira/browse/NUTCH-1480
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Attachments: adding-support-for-sharding-indexer-for-solr.patch, 
> NUTCH-1480-1.6.1.patch
>
>
> SolrUtils should return an array of SolrServers and read the SolrUrl as a 
> comma delimited list of URL's using Configuration.getString(). SolrWriter 
> should be able to handle this list of SolrServers.
> This is useful if you want to send documents to multiple servers if no 
> replication is available or if you want to send documents to multiple NOCs.
> edit:
> This does not replace NUTCH-1377 but complements it. With NUTCH-1377 this 
> issue allows you to index to multiple SolrCloud clusters at the same time.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.

2017-09-04 Thread Jorge Luis Betancourt Gonzalez (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16152957#comment-16152957
 ] 

Jorge Luis Betancourt Gonzalez commented on NUTCH-1480:
---

[~markus17] do you mind taking a look at the linked PR? I think that the PR 
covers more than the original intent of this issue, since you've already worked 
in something similar, I think that your input would be really valuable on this 
case.

> SolrIndexer to write to multiple servers.
> -
>
> Key: NUTCH-1480
> URL: https://issues.apache.org/jira/browse/NUTCH-1480
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Attachments: adding-support-for-sharding-indexer-for-solr.patch, 
> NUTCH-1480-1.6.1.patch
>
>
> SolrUtils should return an array of SolrServers and read the SolrUrl as a 
> comma delimited list of URL's using Configuration.getString(). SolrWriter 
> should be able to handle this list of SolrServers.
> This is useful if you want to send documents to multiple servers if no 
> replication is available or if you want to send documents to multiple NOCs.
> edit:
> This does not replace NUTCH-1377 but complements it. With NUTCH-1377 this 
> issue allows you to index to multiple SolrCloud clusters at the same time.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.

2017-09-04 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16152885#comment-16152885
 ] 

ASF GitHub Bot commented on NUTCH-1480:
---

odisleysi commented on issue #218: fix for NUTCH-1480 contributed by r0ann3l
URL: https://github.com/apache/nutch/pull/218#issuecomment-327018446
 
 
   I like this idea. I work in a project that needs to save documents in solr 
for searching and elasticsearch for statistics. This solve the problem.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> SolrIndexer to write to multiple servers.
> -
>
> Key: NUTCH-1480
> URL: https://issues.apache.org/jira/browse/NUTCH-1480
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Attachments: adding-support-for-sharding-indexer-for-solr.patch, 
> NUTCH-1480-1.6.1.patch
>
>
> SolrUtils should return an array of SolrServers and read the SolrUrl as a 
> comma delimited list of URL's using Configuration.getString(). SolrWriter 
> should be able to handle this list of SolrServers.
> This is useful if you want to send documents to multiple servers if no 
> replication is available or if you want to send documents to multiple NOCs.
> edit:
> This does not replace NUTCH-1377 but complements it. With NUTCH-1377 this 
> issue allows you to index to multiple SolrCloud clusters at the same time.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.

2017-09-04 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16152883#comment-16152883
 ] 

ASF GitHub Bot commented on NUTCH-1480:
---

odisleysi commented on issue #218: fix for NUTCH-1480 contributed by r0ann3l
URL: https://github.com/apache/nutch/pull/218#issuecomment-327018446
 
 
   I like this idea. I work in a project that needs to save documents in solr 
for searching and in elasticsearch for statistics. This solve the problem.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> SolrIndexer to write to multiple servers.
> -
>
> Key: NUTCH-1480
> URL: https://issues.apache.org/jira/browse/NUTCH-1480
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Attachments: adding-support-for-sharding-indexer-for-solr.patch, 
> NUTCH-1480-1.6.1.patch
>
>
> SolrUtils should return an array of SolrServers and read the SolrUrl as a 
> comma delimited list of URL's using Configuration.getString(). SolrWriter 
> should be able to handle this list of SolrServers.
> This is useful if you want to send documents to multiple servers if no 
> replication is available or if you want to send documents to multiple NOCs.
> edit:
> This does not replace NUTCH-1377 but complements it. With NUTCH-1377 this 
> issue allows you to index to multiple SolrCloud clusters at the same time.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.

2017-08-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16145516#comment-16145516
 ] 

ASF GitHub Bot commented on NUTCH-1480:
---

jorgelbg commented on issue #218: fix for NUTCH-1480 contributed by r0ann3l
URL: https://github.com/apache/nutch/pull/218#issuecomment-325707916
 
 
   This PR includes more changes than the original ticket and breaks BC with 
custom indexers. @r0ann3l could you squash all the changes into a single 
commit? that would help in the review process since this PR has a lot of 
changes. 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> SolrIndexer to write to multiple servers.
> -
>
> Key: NUTCH-1480
> URL: https://issues.apache.org/jira/browse/NUTCH-1480
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Attachments: adding-support-for-sharding-indexer-for-solr.patch, 
> NUTCH-1480-1.6.1.patch
>
>
> SolrUtils should return an array of SolrServers and read the SolrUrl as a 
> comma delimited list of URL's using Configuration.getString(). SolrWriter 
> should be able to handle this list of SolrServers.
> This is useful if you want to send documents to multiple servers if no 
> replication is available or if you want to send documents to multiple NOCs.
> edit:
> This does not replace NUTCH-1377 but complements it. With NUTCH-1377 this 
> issue allows you to index to multiple SolrCloud clusters at the same time.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.

2017-08-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16144081#comment-16144081
 ] 

ASF GitHub Bot commented on NUTCH-1480:
---

r0ann3l opened a new pull request #218: fix for NUTCH-1480 contributed by 
r0ann3l
URL: https://github.com/apache/nutch/pull/218
 
 
   With this patch now we can have many instances of the same IndexWriter 
class, but with different configurations. Also, we can copy, rename or remove 
fields of documents for every index writer individually. Besides, the 
parameters needed by the index writers will be into separated XML files, so 
them will be not into nutch-site.xml anymore.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> SolrIndexer to write to multiple servers.
> -
>
> Key: NUTCH-1480
> URL: https://issues.apache.org/jira/browse/NUTCH-1480
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Attachments: adding-support-for-sharding-indexer-for-solr.patch, 
> NUTCH-1480-1.6.1.patch
>
>
> SolrUtils should return an array of SolrServers and read the SolrUrl as a 
> comma delimited list of URL's using Configuration.getString(). SolrWriter 
> should be able to handle this list of SolrServers.
> This is useful if you want to send documents to multiple servers if no 
> replication is available or if you want to send documents to multiple NOCs.
> edit:
> This does not replace NUTCH-1377 but complements it. With NUTCH-1377 this 
> issue allows you to index to multiple SolrCloud clusters at the same time.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.

2017-08-23 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16138433#comment-16138433
 ] 

Roannel Fernández Hernández commented on NUTCH-1480:


I’m testing a solution which use this file [1] to configure the index writers. 
On this XML file, we could put into every tag "writer" the parameters used by 
the writer and a mapping for every field of the Nutch documents. With this new 
way of using the writers in Nutch, we could have so many field mappings, not 
only for the Solr index writer, but also for every index writer that we have. 
Also we will be able to define different configurations for index writers, even 
for the same IndexWriter class. This solution is applied to all types of index 
writers, not just for Solr index writer.

The structure of [1] is described in [2].

[1] 
https://github.com/r0ann3l/nutch/blob/NUTCH-1480/conf/index-writers.xml.template
[2] https://github.com/r0ann3l/nutch/blob/NUTCH-1480/conf/index-writers.xsd

> SolrIndexer to write to multiple servers.
> -
>
> Key: NUTCH-1480
> URL: https://issues.apache.org/jira/browse/NUTCH-1480
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Attachments: adding-support-for-sharding-indexer-for-solr.patch, 
> NUTCH-1480-1.6.1.patch
>
>
> SolrUtils should return an array of SolrServers and read the SolrUrl as a 
> comma delimited list of URL's using Configuration.getString(). SolrWriter 
> should be able to handle this list of SolrServers.
> This is useful if you want to send documents to multiple servers if no 
> replication is available or if you want to send documents to multiple NOCs.
> edit:
> This does not replace NUTCH-1377 but complements it. With NUTCH-1377 this 
> issue allows you to index to multiple SolrCloud clusters at the same time.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.

2013-01-17 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13556106#comment-13556106
 ] 

Markus Jelsma commented on NUTCH-1480:
--

I think we should update to 4.0 in 1.7 (can't find the relevant issue). 4.x can 
work just as 3.x in some stand-alone mode and NUTCH-1377 still allows to index 
to a single node (or multiple single nodes with this patch included).

SolrJ 4.x should work fine with 3.x series servers as iirc no javabin changes 
have been made in the past time.

> SolrIndexer to write to multiple servers.
> -
>
> Key: NUTCH-1480
> URL: https://issues.apache.org/jira/browse/NUTCH-1480
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.7
>
> Attachments: NUTCH-1480-1.6.1.patch
>
>
> SolrUtils should return an array of SolrServers and read the SolrUrl as a 
> comma delimited list of URL's using Configuration.getString(). SolrWriter 
> should be able to handle this list of SolrServers.
> This is useful if you want to send documents to multiple servers if no 
> replication is available or if you want to send documents to multiple NOCs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.

2013-01-17 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13556100#comment-13556100
 ] 

Julien Nioche commented on NUTCH-1480:
--

probably depends on whether we want to support both SOLR 3.x and SOLR 4.x. Got 
your point about indexing to multiple clouds, thanks! 


> SolrIndexer to write to multiple servers.
> -
>
> Key: NUTCH-1480
> URL: https://issues.apache.org/jira/browse/NUTCH-1480
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.7
>
> Attachments: NUTCH-1480-1.6.1.patch
>
>
> SolrUtils should return an array of SolrServers and read the SolrUrl as a 
> comma delimited list of URL's using Configuration.getString(). SolrWriter 
> should be able to handle this list of SolrServers.
> This is useful if you want to send documents to multiple servers if no 
> replication is available or if you want to send documents to multiple NOCs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.

2013-01-17 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13556098#comment-13556098
 ] 

Markus Jelsma commented on NUTCH-1480:
--

We leave all the shard key hashing up to SolrCloud and the Cloud aware SolrJ 
client we use, see NUTCH-1377. We use both patches to and provide two lists of 
zookeeper addresses to index to multiple clouds the same time. As far as i am 
concerned NUTCH-945 is obsolete due to SolrCloud handling automatically.

> SolrIndexer to write to multiple servers.
> -
>
> Key: NUTCH-1480
> URL: https://issues.apache.org/jira/browse/NUTCH-1480
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.7
>
> Attachments: NUTCH-1480-1.6.1.patch
>
>
> SolrUtils should return an array of SolrServers and read the SolrUrl as a 
> comma delimited list of URL's using Configuration.getString(). SolrWriter 
> should be able to handle this list of SolrServers.
> This is useful if you want to send documents to multiple servers if no 
> replication is available or if you want to send documents to multiple NOCs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.

2013-01-17 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13556090#comment-13556090
 ] 

Julien Nioche commented on NUTCH-1480:
--

I'd rather it was implemented as an extension of NUTCH-945 where we'd have a 
partitioner that sends to all SOLR instances, which is I believe what 
NUTCH-1480 is about. There are many cases where we'd want to shard according to 
other criteria and NUTCH-945 would provide a more generic framework. Does this 
make sense?

> SolrIndexer to write to multiple servers.
> -
>
> Key: NUTCH-1480
> URL: https://issues.apache.org/jira/browse/NUTCH-1480
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.7
>
> Attachments: NUTCH-1480-1.6.1.patch
>
>
> SolrUtils should return an array of SolrServers and read the SolrUrl as a 
> comma delimited list of URL's using Configuration.getString(). SolrWriter 
> should be able to handle this list of SolrServers.
> This is useful if you want to send documents to multiple servers if no 
> replication is available or if you want to send documents to multiple NOCs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.

2012-11-08 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13493163#comment-13493163
 ] 

Markus Jelsma commented on NUTCH-1480:
--

I'm fine with that too, i don't really care as long it's obvious enough for the 
user. Any more thoughts to share? Opinions?

> SolrIndexer to write to multiple servers.
> -
>
> Key: NUTCH-1480
> URL: https://issues.apache.org/jira/browse/NUTCH-1480
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.6
>
> Attachments: NUTCH-1480-1.6.1.patch
>
>
> SolrUtils should return an array of SolrServers and read the SolrUrl as a 
> comma delimited list of URL's using Configuration.getString(). SolrWriter 
> should be able to handle this list of SolrServers.
> This is useful if you want to send documents to multiple servers if no 
> replication is available or if you want to send documents to multiple NOCs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.

2012-11-04 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13490316#comment-13490316
 ] 

Lewis John McGibbney commented on NUTCH-1480:
-

A suggestion fore the dedup job is to read in the comma separated solr 
cores/servers before dropping everything after the first server within the 
list. Some simple logging could then indicate that dedup was only attempted and 
subsequently executed on the first server...

> SolrIndexer to write to multiple servers.
> -
>
> Key: NUTCH-1480
> URL: https://issues.apache.org/jira/browse/NUTCH-1480
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.6
>
> Attachments: NUTCH-1480-1.6.1.patch
>
>
> SolrUtils should return an array of SolrServers and read the SolrUrl as a 
> comma delimited list of URL's using Configuration.getString(). SolrWriter 
> should be able to handle this list of SolrServers.
> This is useful if you want to send documents to multiple servers if no 
> replication is available or if you want to send documents to multiple NOCs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.

2012-11-01 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488856#comment-13488856
 ] 

Markus Jelsma commented on NUTCH-1480:
--

Jul, yes, doing it via SolrJ and Zookeeper is much cheaper. We don't need to 
obtain hash ranges ourselves.

Lewis, didn't think about that. That's a problem indeed. Although it is 
technically possible we must not dedup with two indices at the same time, it's 
going to be trouble. Even with this patch it will still work until someone 
decides to use the crawl command with two Solrservers comma-separated. Although 
i don't believe users of the crawl command will write to multiple locations we 
must prevent them from doing so. Perhaps a simple check for a comma in the solr 
URL and an exception?

Something else? Drop deduplication altogether and rely on Solr's internal 
deduplication (which doesn't work for clouds though)? Other clever ideas?



> SolrIndexer to write to multiple servers.
> -
>
> Key: NUTCH-1480
> URL: https://issues.apache.org/jira/browse/NUTCH-1480
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.6
>
> Attachments: NUTCH-1480-1.6.1.patch
>
>
> SolrUtils should return an array of SolrServers and read the SolrUrl as a 
> comma delimited list of URL's using Configuration.getString(). SolrWriter 
> should be able to handle this list of SolrServers.
> This is useful if you want to send documents to multiple servers if no 
> replication is available or if you want to send documents to multiple NOCs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.

2012-11-01 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488852#comment-13488852
 ] 

Lewis John McGibbney commented on NUTCH-1480:
-

Mmm. Additionally there is a problem with this when you run a continuous or 
large crawl with the crawl script. By default it attempts to dedup and the job 
fails. Funatemtally though, the patch works with trunk (and Solr 4.0) as you 
describe... after I've applied the patch in NUTCH-1486  

> SolrIndexer to write to multiple servers.
> -
>
> Key: NUTCH-1480
> URL: https://issues.apache.org/jira/browse/NUTCH-1480
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.6
>
> Attachments: NUTCH-1480-1.6.1.patch
>
>
> SolrUtils should return an array of SolrServers and read the SolrUrl as a 
> comma delimited list of URL's using Configuration.getString(). SolrWriter 
> should be able to handle this list of SolrServers.
> This is useful if you want to send documents to multiple servers if no 
> replication is available or if you want to send documents to multiple NOCs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.

2012-11-01 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488786#comment-13488786
 ] 

Julien Nioche commented on NUTCH-1480:
--

nope. I meant implementing the distribution to the shards on the Nutch side 
without relying on the CloudSolrServer. Having said that we want to move to 
SOLR4 and if we get that from SOLR for cheap then that's even better

> SolrIndexer to write to multiple servers.
> -
>
> Key: NUTCH-1480
> URL: https://issues.apache.org/jira/browse/NUTCH-1480
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.6
>
> Attachments: NUTCH-1480-1.6.1.patch
>
>
> SolrUtils should return an array of SolrServers and read the SolrUrl as a 
> comma delimited list of URL's using Configuration.getString(). SolrWriter 
> should be able to handle this list of SolrServers.
> This is useful if you want to send documents to multiple servers if no 
> replication is available or if you want to send documents to multiple NOCs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.

2012-11-01 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488744#comment-13488744
 ] 

Markus Jelsma commented on NUTCH-1480:
--

I think you mean the justed linked issue NUTCH-1377? I just happen to work on 
that issue. Using the CloudSolrServer will send the docs to the correct shard 
already. The way it works now is sending millions of records to a single node 
which then distributes it again, a waste of IO. That issue will work with this 
issue so i may want to push them in together. It's just a matter of returning 
the correct SolrServer instance and working around the HTTPCLient issues.

agreed on deduplication. It would also be hard for this issue to work with 
dedup because not all indices may be identical.

> SolrIndexer to write to multiple servers.
> -
>
> Key: NUTCH-1480
> URL: https://issues.apache.org/jira/browse/NUTCH-1480
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.6
>
> Attachments: NUTCH-1480-1.6.1.patch
>
>
> SolrUtils should return an array of SolrServers and read the SolrUrl as a 
> comma delimited list of URL's using Configuration.getString(). SolrWriter 
> should be able to handle this list of SolrServers.
> This is useful if you want to send documents to multiple servers if no 
> replication is available or if you want to send documents to multiple NOCs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.

2012-11-01 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488738#comment-13488738
 ] 

Julien Nioche commented on NUTCH-1480:
--

OK thanks. What about having a mechanism for specifying a way of distributing 
the docs with the replicate-to-all being one of the options? Could do 
consistent hashing maybe? I expect that most people would want to shard.

off topic re-deduplication : I think we've hit the limits of the current 
mechanism which I assume was based on the one we had when Nutch was managing 
its own Lucene indices. It's not reasonable to pump ALL the docs from SOLR into 
Hadoop to dedup and I'd rather have map reduce jobs to find the duplicates 
based on the crawldb and send the deletion commands to SOLR. And this would 
work for ElasticSearch as well. Am pretty sure there is a JIRA for this 
somewhere 

> SolrIndexer to write to multiple servers.
> -
>
> Key: NUTCH-1480
> URL: https://issues.apache.org/jira/browse/NUTCH-1480
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.6
>
> Attachments: NUTCH-1480-1.6.1.patch
>
>
> SolrUtils should return an array of SolrServers and read the SolrUrl as a 
> comma delimited list of URL's using Configuration.getString(). SolrWriter 
> should be able to handle this list of SolrServers.
> This is useful if you want to send documents to multiple servers if no 
> replication is available or if you want to send documents to multiple NOCs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.

2012-11-01 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488732#comment-13488732
 ] 

Markus Jelsma commented on NUTCH-1480:
--

Julien, yes. The Nutch tools are modified to work with a List now. 
One or more SolrServers are returned depending on how many you specify 
comma-separated. Communication to Solr is simply done by iterating over the 
List and repeating the commands such as add or delete or commit.

> SolrIndexer to write to multiple servers.
> -
>
> Key: NUTCH-1480
> URL: https://issues.apache.org/jira/browse/NUTCH-1480
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.6
>
> Attachments: NUTCH-1480-1.6.1.patch
>
>
> SolrUtils should return an array of SolrServers and read the SolrUrl as a 
> comma delimited list of URL's using Configuration.getString(). SolrWriter 
> should be able to handle this list of SolrServers.
> This is useful if you want to send documents to multiple servers if no 
> replication is available or if you want to send documents to multiple NOCs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.

2012-11-01 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488730#comment-13488730
 ] 

Markus Jelsma commented on NUTCH-1480:
--

Hi Lewis, Solr does not have a notion of pseudo distributed mode. You can 
simply run one server with two cores, two servers on different ports or 
different machines or have a SolrCloud cluster in multiple NOCs, all have a 
unique URL. A SolrCloud cluster is roughly the same as multiple stand-alone 
servers working together.

A core is similar to a table in a SQL server.

See  http://wiki.apache.org/solr/CoreAdmin#Example on how to quickly set up two 
cores in a single Solr server. Copy Nutch' schema and you have two Nutch 
indices you can write to in a single go.

> SolrIndexer to write to multiple servers.
> -
>
> Key: NUTCH-1480
> URL: https://issues.apache.org/jira/browse/NUTCH-1480
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.6
>
> Attachments: NUTCH-1480-1.6.1.patch
>
>
> SolrUtils should return an array of SolrServers and read the SolrUrl as a 
> comma delimited list of URL's using Configuration.getString(). SolrWriter 
> should be able to handle this list of SolrServers.
> This is useful if you want to send documents to multiple servers if no 
> replication is available or if you want to send documents to multiple NOCs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.

2012-11-01 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488728#comment-13488728
 ] 

Julien Nioche commented on NUTCH-1480:
--

Hi Lewis

bq. Can I run multiple Solr servers in psudo distributed mode?

SOLR is completely separated from Hadoop and has nothing to do with local vs 
distrib. You can run serveral instances of SOLR on the same machine if that is 
your question. Just invoke a different port when starting it from the command 
line with a separate SOLR home.

Markus,

Just to make sure I understand - this sends ALL the documents to ALL the SOLR 
instances specified, right? 

> SolrIndexer to write to multiple servers.
> -
>
> Key: NUTCH-1480
> URL: https://issues.apache.org/jira/browse/NUTCH-1480
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.6
>
> Attachments: NUTCH-1480-1.6.1.patch
>
>
> SolrUtils should return an array of SolrServers and read the SolrUrl as a 
> comma delimited list of URL's using Configuration.getString(). SolrWriter 
> should be able to handle this list of SolrServers.
> This is useful if you want to send documents to multiple servers if no 
> replication is available or if you want to send documents to multiple NOCs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.

2012-11-01 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488721#comment-13488721
 ] 

Lewis John McGibbney commented on NUTCH-1480:
-

Hi Markus. Can I run multiple Solr servers in psudo distributed mode? If so can 
you provide a link the a (solr) wiki entry and I will try this out. Also can 
you clarify between core and servers? I don't see the server terminology listed 
[0] on the Solr wiki. Thanks
[0] http://wiki.apache.org/solr/SolrTerminology 

> SolrIndexer to write to multiple servers.
> -
>
> Key: NUTCH-1480
> URL: https://issues.apache.org/jira/browse/NUTCH-1480
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.6
>
> Attachments: NUTCH-1480-1.6.1.patch
>
>
> SolrUtils should return an array of SolrServers and read the SolrUrl as a 
> comma delimited list of URL's using Configuration.getString(). SolrWriter 
> should be able to handle this list of SolrServers.
> This is useful if you want to send documents to multiple servers if no 
> replication is available or if you want to send documents to multiple NOCs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira