[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.
[ https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16538658#comment-16538658 ] Hudson commented on NUTCH-1480: --- SUCCESS: Integrated in Jenkins build Nutch-trunk #3543 (See [https://builds.apache.org/job/Nutch-trunk/3543/]) fixes for NUTCH-1514: Support for NUTCH-1480. (r0ann3l: [https://github.com/apache/nutch/commit/0176883bd663088da99ab54840987092066dc5ac]) * (edit) src/plugin/indexer-csv/src/java/org/apache/nutch/indexwriter/csv/CSVIndexWriter.java Fix unit tests for changes related to NUTCH-1480 (snagel: [https://github.com/apache/nutch/commit/06221e069227b6e2f7b6b13eb0df6cb98ba21a46]) * (edit) src/plugin/indexer-csv/src/test/org/apache/nutch/indexwriter/csv/TestCSVIndexWriter.java > SolrIndexer to write to multiple servers. > - > > Key: NUTCH-1480 > URL: https://issues.apache.org/jira/browse/NUTCH-1480 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.15 > > Attachments: NUTCH-1480-1.6.1.patch, > adding-support-for-sharding-indexer-for-solr.patch > > > SolrUtils should return an array of SolrServers and read the SolrUrl as a > comma delimited list of URL's using Configuration.getString(). SolrWriter > should be able to handle this list of SolrServers. > This is useful if you want to send documents to multiple servers if no > replication is available or if you want to send documents to multiple NOCs. > edit: > This does not replace NUTCH-1377 but complements it. With NUTCH-1377 this > issue allows you to index to multiple SolrCloud clusters at the same time. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.
[ https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16507142#comment-16507142 ] ASF GitHub Bot commented on NUTCH-1480: --- odisleysi commented on issue #218: fix for NUTCH-1480 contributed by r0ann3l URL: https://github.com/apache/nutch/pull/218#issuecomment-327018446 I like this idea. I work in a project that needs to save documents in solr for searching and elasticsearch for statistics. This solve the problem. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > SolrIndexer to write to multiple servers. > - > > Key: NUTCH-1480 > URL: https://issues.apache.org/jira/browse/NUTCH-1480 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.15 > > Attachments: NUTCH-1480-1.6.1.patch, > adding-support-for-sharding-indexer-for-solr.patch > > > SolrUtils should return an array of SolrServers and read the SolrUrl as a > comma delimited list of URL's using Configuration.getString(). SolrWriter > should be able to handle this list of SolrServers. > This is useful if you want to send documents to multiple servers if no > replication is available or if you want to send documents to multiple NOCs. > edit: > This does not replace NUTCH-1377 but complements it. With NUTCH-1377 this > issue allows you to index to multiple SolrCloud clusters at the same time. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.
[ https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16499028#comment-16499028 ] Hudson commented on NUTCH-1480: --- SUCCESS: Integrated in Jenkins build Nutch-trunk #3528 (See [https://builds.apache.org/job/Nutch-trunk/3528/]) Fixes for NUTCH-2580: Support for NUTCH-1480. (roannel.fdez: [https://github.com/apache/nutch/commit/5d7d8167e350edd5bc37454cd73a412c570a13b1]) * (edit) conf/index-writers.xml.template * (edit) src/java/org/apache/nutch/indexer/IndexWriterParams.java > SolrIndexer to write to multiple servers. > - > > Key: NUTCH-1480 > URL: https://issues.apache.org/jira/browse/NUTCH-1480 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.15 > > Attachments: NUTCH-1480-1.6.1.patch, > adding-support-for-sharding-indexer-for-solr.patch > > > SolrUtils should return an array of SolrServers and read the SolrUrl as a > comma delimited list of URL's using Configuration.getString(). SolrWriter > should be able to handle this list of SolrServers. > This is useful if you want to send documents to multiple servers if no > replication is available or if you want to send documents to multiple NOCs. > edit: > This does not replace NUTCH-1377 but complements it. With NUTCH-1377 this > issue allows you to index to multiple SolrCloud clusters at the same time. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.
[ https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16498408#comment-16498408 ] Hudson commented on NUTCH-1480: --- SUCCESS: Integrated in Jenkins build Nutch-trunk #3527 (See [https://builds.apache.org/job/Nutch-trunk/3527/]) Fixes for NUTCH-1480: Multiple index writer instances with different (roannel.fdez: [https://github.com/apache/nutch/commit/e4a7f871b1b03f901279e24cc1c626e5c1b67643]) * (edit) src/plugin/indexer-dummy/src/java/org/apache/nutch/indexwriter/dummy/DummyIndexWriter.java * (edit) src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrConstants.java * (add) conf/index-writers.xsd * (edit) src/java/org/apache/nutch/indexer/IndexWriter.java * (edit) src/java/org/apache/nutch/indexer/NutchField.java * (edit) src/java/org/apache/nutch/indexer/NutchDocument.java * (edit) src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java * (edit) src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrUtils.java * (add) src/java/org/apache/nutch/indexer/IndexWriterConfig.java * (delete) src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrMappingReader.java * (add) conf/index-writers.xml.template * (edit) src/plugin/indexer-rabbit/src/java/org/apache/nutch/indexwriter/rabbit/RabbitMQConstants.java * (edit) src/plugin/indexer-rabbit/src/java/org/apache/nutch/indexwriter/rabbit/RabbitIndexWriter.java * (edit) src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrIndexWriter.java * (edit) src/plugin/indexer-cloudsearch/src/java/org/apache/nutch/indexwriter/cloudsearch/CloudSearchIndexWriter.java * (edit) src/java/org/apache/nutch/indexer/IndexWriters.java * (add) src/java/org/apache/nutch/indexer/MappingReader.java * (edit) src/plugin/indexer-elastic-rest/src/java/org/apache/nutch/indexwriter/elasticrest/ElasticRestIndexWriter.java Fixes for NUTCH-1480: Some improvements based on reviewers feedback. (roannel.fdez: [https://github.com/apache/nutch/commit/86cd375e267036596f19376e2499e1d1c4ccdcbb]) * (edit) src/java/org/apache/nutch/indexer/MappingReader.java * (edit) src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrIndexWriter.java * (edit) src/java/org/apache/nutch/indexer/IndexWriters.java * (edit) src/plugin/indexer-rabbit/src/java/org/apache/nutch/indexwriter/rabbit/RabbitIndexWriter.java * (edit) conf/index-writers.xml.template * (edit) src/java/org/apache/nutch/indexer/IndexWriterConfig.java * (edit) conf/index-writers.xsd Fixes for NUTCH-1480: Sections for all indexer-* plugins, relaxed (roannel.fdez: [https://github.com/apache/nutch/commit/84246a9e8fb183a28983a70d3d30d7d9a474ce58]) * (edit) src/java/org/apache/nutch/indexer/IndexWriters.java * (add) src/plugin/indexer-dummy/src/java/org/apache/nutch/indexwriter/dummy/DummyConstants.java * (edit) src/plugin/indexer-rabbit/src/java/org/apache/nutch/indexwriter/rabbit/RabbitIndexWriter.java * (edit) src/plugin/indexer-cloudsearch/src/java/org/apache/nutch/indexwriter/cloudsearch/CloudSearchConstants.java * (edit) src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java * (add) src/java/org/apache/nutch/indexer/IndexWriterParams.java * (edit) src/plugin/indexer-elastic-rest/src/java/org/apache/nutch/indexwriter/elasticrest/ElasticRestConstants.java * (edit) conf/index-writers.xml.template * (edit) src/plugin/indexer-dummy/src/java/org/apache/nutch/indexwriter/dummy/DummyIndexWriter.java * (edit) src/java/org/apache/nutch/indexer/IndexWriter.java * (edit) src/plugin/indexer-cloudsearch/src/java/org/apache/nutch/indexwriter/cloudsearch/CloudSearchIndexWriter.java * (edit) src/java/org/apache/nutch/indexer/IndexWriterConfig.java * (edit) src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrIndexWriter.java * (edit) src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticConstants.java * (edit) src/plugin/indexer-elastic-rest/src/java/org/apache/nutch/indexwriter/elasticrest/ElasticRestIndexWriter.java Fixes for NUTCH-1480: Changes: - Logs for IndexerOutputFormat class to (roannel.fdez: [https://github.com/apache/nutch/commit/7e9d1df08817c54d50eed3945136033a7fd7af00]) * (edit) conf/log4j.properties * (edit) src/java/org/apache/nutch/indexer/IndexerOutputFormat.java * (edit) src/java/org/apache/nutch/indexer/IndexingJob.java * (edit) src/java/org/apache/nutch/indexer/IndexerMapReduce.java * (edit) src/java/org/apache/nutch/util/ObjectCache.java Fixes for NUTCH-1480: Support for NUTCH-2484 and NUTCH-2380. (roannel.fdez: [https://github.com/apache/nutch/commit/d45510c186b3dbee3c3f7882c90ab3d28409a0b8]) * (edit) src/plugin/indexer-elastic-rest/src/java/org/apache/nutch/indexwriter/elasticrest/ElasticRestIndexWriter.java * (edit) conf/index-writers.xml.template * (edit) src/plugin/indexer-elastic-rest/s
[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.
[ https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16498332#comment-16498332 ] ASF GitHub Bot commented on NUTCH-1480: --- sebastian-nagel closed pull request #218: fix for NUTCH-1480 contributed by r0ann3l URL: https://github.com/apache/nutch/pull/218 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/conf/index-writers.xml.template b/conf/index-writers.xml.template new file mode 100644 index 0..118c8bc88 --- /dev/null +++ b/conf/index-writers.xml.template @@ -0,0 +1,144 @@ + + +http://lucene.apache.org/nutch"; + xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"; + xsi:schemaLocation="http://lucene.apache.org/nutch index-writers.xsd"> + + + + + + http://localhost:8983/solr/nutch"/> + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/conf/index-writers.xsd b/conf/index-writers.xsd new file mode 100644 index 0..50ab1f313 --- /dev/null +++ b/conf/index-writers.xsd @@ -0,0 +1,179 @@ + + +http://www.w3.org/2001/XMLSchema"; + targetNamespace="http://lucene.apache.org/nutch"; + xmlns="http://lucene.apache.org/nutch"; + elementFormDefault="qualified"> + + + +Root tag of index-writers.xml document. It's a wrapper for the all index writers. + + + + + + + + Contains all the configuration of a particular index writer. + + + + + + + + + + + +This tag contains all the parameters that will be passed to the index writer implementation. + + + + + + +It's a wrapper for the allowed actions over a document before it's indexed. + + + + + + + + Writer's ID. + + + + + + + The class of the index writer implementation which will be used. + + + + + + + + + +One single parameter that will be pass to the index writer implementation. + + + + + + + + + + Parameter's name. It is used to identify the parameter. + + + + + + + Parameter's value. + + + + + + + + + +Action of copy fields. Multiple comma-separated targets can be specified. + + + + + + +Action of rename fields. + + + + + + +Action of remove fields. + + + + + + + + One single field that will be mapped. + + + + + + + + + Field's name before it's mapped. + + + + + + + + + + + + + + + + + +One single field that will be mapped. + + + + + + + + + + + + Field's name before it's mapped. + + + + + + + Field's name after the action is applied. + + + + + + + diff --git a/conf/log4j.properties b/conf/log4j.properties index 6fad2b5d3..1939014dc 100644 --- a/conf/log4j.properties +++ b/conf/log4j.prop
[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.
[ https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16498326#comment-16498326 ] ASF GitHub Bot commented on NUTCH-1480: --- sebastian-nagel commented on a change in pull request #218: fix for NUTCH-1480 contributed by r0ann3l URL: https://github.com/apache/nutch/pull/218#discussion_r192467967 ## File path: src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrConstants.java ## @@ -19,19 +19,23 @@ public interface SolrConstants { public static final String SOLR_PREFIX = "solr."; Review comment: It's still used for {{ZOOKEEPER_HOSTS = SOLR_PREFIX + "zookeeper.hosts"}} This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > SolrIndexer to write to multiple servers. > - > > Key: NUTCH-1480 > URL: https://issues.apache.org/jira/browse/NUTCH-1480 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Attachments: NUTCH-1480-1.6.1.patch, > adding-support-for-sharding-indexer-for-solr.patch > > > SolrUtils should return an array of SolrServers and read the SolrUrl as a > comma delimited list of URL's using Configuration.getString(). SolrWriter > should be able to handle this list of SolrServers. > This is useful if you want to send documents to multiple servers if no > replication is available or if you want to send documents to multiple NOCs. > edit: > This does not replace NUTCH-1377 but complements it. With NUTCH-1377 this > issue allows you to index to multiple SolrCloud clusters at the same time. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.
[ https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16498327#comment-16498327 ] ASF GitHub Bot commented on NUTCH-1480: --- sebastian-nagel commented on a change in pull request #218: fix for NUTCH-1480 contributed by r0ann3l URL: https://github.com/apache/nutch/pull/218#discussion_r192467967 ## File path: src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrConstants.java ## @@ -19,19 +19,23 @@ public interface SolrConstants { public static final String SOLR_PREFIX = "solr."; Review comment: It's still used for `ZOOKEEPER_HOSTS = SOLR_PREFIX + "zookeeper.hosts"` This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > SolrIndexer to write to multiple servers. > - > > Key: NUTCH-1480 > URL: https://issues.apache.org/jira/browse/NUTCH-1480 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Attachments: NUTCH-1480-1.6.1.patch, > adding-support-for-sharding-indexer-for-solr.patch > > > SolrUtils should return an array of SolrServers and read the SolrUrl as a > comma delimited list of URL's using Configuration.getString(). SolrWriter > should be able to handle this list of SolrServers. > This is useful if you want to send documents to multiple servers if no > replication is available or if you want to send documents to multiple NOCs. > edit: > This does not replace NUTCH-1377 but complements it. With NUTCH-1377 this > issue allows you to index to multiple SolrCloud clusters at the same time. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.
[ https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16496678#comment-16496678 ] ASF GitHub Bot commented on NUTCH-1480: --- sebastian-nagel commented on issue #218: fix for NUTCH-1480 contributed by r0ann3l URL: https://github.com/apache/nutch/pull/218#issuecomment-393561347 Thanks, @r0ann3l! +1 - I've tested the solution in local and pseudo-distributed mode and was able to index into Solr (a single index). If there are no objections I'll commit/merge soon. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > SolrIndexer to write to multiple servers. > - > > Key: NUTCH-1480 > URL: https://issues.apache.org/jira/browse/NUTCH-1480 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Attachments: NUTCH-1480-1.6.1.patch, > adding-support-for-sharding-indexer-for-solr.patch > > > SolrUtils should return an array of SolrServers and read the SolrUrl as a > comma delimited list of URL's using Configuration.getString(). SolrWriter > should be able to handle this list of SolrServers. > This is useful if you want to send documents to multiple servers if no > replication is available or if you want to send documents to multiple NOCs. > edit: > This does not replace NUTCH-1377 but complements it. With NUTCH-1377 this > issue allows you to index to multiple SolrCloud clusters at the same time. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.
[ https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16482083#comment-16482083 ] ASF GitHub Bot commented on NUTCH-1480: --- r0ann3l commented on issue #218: fix for NUTCH-1480 contributed by r0ann3l URL: https://github.com/apache/nutch/pull/218#issuecomment-390522266 Hi @sebastian-nagel, I changed the use of `ObjectCache.class` to an internal CACHE object, to avoid changing the behavior of other functionalities. In this case it is not necessary to have a different instance of `IndexWriters.class` for each `Configuration.class`. This is beacuse the index writer's configuration is handled in other individual file. Also, I fixed an issue in `TestElasticIndexWriter.class` (associated with the use of `IndexWriterParams.class`), which causes the unit test fail. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > SolrIndexer to write to multiple servers. > - > > Key: NUTCH-1480 > URL: https://issues.apache.org/jira/browse/NUTCH-1480 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Attachments: NUTCH-1480-1.6.1.patch, > adding-support-for-sharding-indexer-for-solr.patch > > > SolrUtils should return an array of SolrServers and read the SolrUrl as a > comma delimited list of URL's using Configuration.getString(). SolrWriter > should be able to handle this list of SolrServers. > This is useful if you want to send documents to multiple servers if no > replication is available or if you want to send documents to multiple NOCs. > edit: > This does not replace NUTCH-1377 but complements it. With NUTCH-1377 this > issue allows you to index to multiple SolrCloud clusters at the same time. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.
[ https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16328935#comment-16328935 ] ASF GitHub Bot commented on NUTCH-1480: --- r0ann3l commented on issue #218: fix for NUTCH-1480 contributed by r0ann3l URL: https://github.com/apache/nutch/pull/218#issuecomment-358355731 Hi @sebastian-nagel: In this case I propose to use an internal CACHE object, as the PluginRepository does, to store the IndexWriters object. The code could be something like this: ``` private static final WeakHashMap CACHE = new WeakHashMap<>(); public static synchronized IndexWriters get(Configuration conf) { String uuid = NutchConfiguration.getUUID(conf); if (uuid == null) { uuid = "nonNutchConf@" + conf.hashCode(); } return CACHE.computeIfAbsent(uuid, k -> new IndexWriters(conf)); } ``` What do you think? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > SolrIndexer to write to multiple servers. > - > > Key: NUTCH-1480 > URL: https://issues.apache.org/jira/browse/NUTCH-1480 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Attachments: NUTCH-1480-1.6.1.patch, > adding-support-for-sharding-indexer-for-solr.patch > > > SolrUtils should return an array of SolrServers and read the SolrUrl as a > comma delimited list of URL's using Configuration.getString(). SolrWriter > should be able to handle this list of SolrServers. > This is useful if you want to send documents to multiple servers if no > replication is available or if you want to send documents to multiple NOCs. > edit: > This does not replace NUTCH-1377 but complements it. With NUTCH-1377 this > issue allows you to index to multiple SolrCloud clusters at the same time. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.
[ https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16309960#comment-16309960 ] ASF GitHub Bot commented on NUTCH-1480: --- lewismc commented on issue #218: fix for NUTCH-1480 contributed by r0ann3l URL: https://github.com/apache/nutch/pull/218#issuecomment-355074694 @r0ann3l can you please update this PR inline with master? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > SolrIndexer to write to multiple servers. > - > > Key: NUTCH-1480 > URL: https://issues.apache.org/jira/browse/NUTCH-1480 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Attachments: NUTCH-1480-1.6.1.patch, > adding-support-for-sharding-indexer-for-solr.patch > > > SolrUtils should return an array of SolrServers and read the SolrUrl as a > comma delimited list of URL's using Configuration.getString(). SolrWriter > should be able to handle this list of SolrServers. > This is useful if you want to send documents to multiple servers if no > replication is available or if you want to send documents to multiple NOCs. > edit: > This does not replace NUTCH-1377 but complements it. With NUTCH-1377 this > issue allows you to index to multiple SolrCloud clusters at the same time. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.
[ https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16271543#comment-16271543 ] ASF GitHub Bot commented on NUTCH-1480: --- sebastian-nagel commented on issue #218: fix for NUTCH-1480 contributed by r0ann3l URL: https://github.com/apache/nutch/pull/218#issuecomment-347996663 Hi @r0ann3l, did you verify that the change using `NutchConfiguration.getUUID(conf)` changes the behavior. Cf. [NUTCH-2407](https://issues.apache.org/jira/browse/NUTCH-2407?focusedCommentId=16180780&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16180780) which let me doubt it as it's a only random UUID for each Configuration object. Using the UUID in the ObjectCache makes the unit tests fail (TestGenerator): in fact the ObjectCache now returns the same object even if the configuration is different. We need actually really implement a hash value for Configuration objects. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > SolrIndexer to write to multiple servers. > - > > Key: NUTCH-1480 > URL: https://issues.apache.org/jira/browse/NUTCH-1480 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Attachments: NUTCH-1480-1.6.1.patch, > adding-support-for-sharding-indexer-for-solr.patch > > > SolrUtils should return an array of SolrServers and read the SolrUrl as a > comma delimited list of URL's using Configuration.getString(). SolrWriter > should be able to handle this list of SolrServers. > This is useful if you want to send documents to multiple servers if no > replication is available or if you want to send documents to multiple NOCs. > edit: > This does not replace NUTCH-1377 but complements it. With NUTCH-1377 this > issue allows you to index to multiple SolrCloud clusters at the same time. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.
[ https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16265473#comment-16265473 ] ASF GitHub Bot commented on NUTCH-1480: --- r0ann3l commented on issue #218: fix for NUTCH-1480 contributed by r0ann3l URL: https://github.com/apache/nutch/pull/218#issuecomment-346834693 Hi @sebastian-nagel, thank you very much for your comments!!! I agree with your suggestions and I included the changes you propose from your fork. About indexer-dummy, I also tried to make it work, but it was not possible. In theory, you can build as many instances of `IndexWriters` as you want, that you will always get the same instance, because it gotten from cache. So, the first issue I found was the `ObjectCache` uses the `Configuration` object itself as the key, and this object is not the same in each call. This causes that there are two instances of `IndexWriters` writing to same file, as you say. So, I replaced the key of `ObjectCache` with the UUID of the `Configuration` object. Now, we have only one instance of `IndexWriters`, but there is another problem: when you try to commit the writers in `IndexingJob.index(IndexingJob.java151)` it is already closed from `IndexerOutputFormat.getRecordWriter(IndexerOutputFormat.java:44)`. Therefore, I moved the `commit()` call from `IndexingJob` to `IndexerOutputFormat`, just before the `close()` method is called. I also, moved the indexers description from `IndexingJob` to `IndexerOutputFormat`, to avoid to build `IndexWriters` instance twice. Thanks This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > SolrIndexer to write to multiple servers. > - > > Key: NUTCH-1480 > URL: https://issues.apache.org/jira/browse/NUTCH-1480 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Attachments: NUTCH-1480-1.6.1.patch, > adding-support-for-sharding-indexer-for-solr.patch > > > SolrUtils should return an array of SolrServers and read the SolrUrl as a > comma delimited list of URL's using Configuration.getString(). SolrWriter > should be able to handle this list of SolrServers. > This is useful if you want to send documents to multiple servers if no > replication is available or if you want to send documents to multiple NOCs. > edit: > This does not replace NUTCH-1377 but complements it. With NUTCH-1377 this > issue allows you to index to multiple SolrCloud clusters at the same time. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.
[ https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16265302#comment-16265302 ] ASF GitHub Bot commented on NUTCH-1480: --- r0ann3l commented on issue #218: fix for NUTCH-1480 contributed by r0ann3l URL: https://github.com/apache/nutch/pull/218#issuecomment-346834693 Hi @sebastian, thank you very much for your comments!!! I agree with your suggestions and I included the changes you propose from your fork. About indexer-dummy, I also tried to make it work, but it was not possible. In theory, you can build as many instances of `IndexWriters` as you want, that you will always get the same instance, because it gotten from cache. So, the first issue I found was the `ObjectCache` uses the `Configuration` object itself as the key, and this object is not the same in each call. This causes that there are two instances of `IndexWriters` writing to same file, as you say. So, I replaced the key of `ObjectCache` with the UUID of the `Configuration` object. Now, we have only one instance of `IndexWriters`, but there is another problem: when you try to commit the writers in `IndexingJob.index(IndexingJob.java151)` it is already closed from `IndexerOutputFormat.getRecordWriter(IndexerOutputFormat.java:44)`. Therefore, I moved the `commit()` call from `IndexingJob` to `IndexerOutputFormat`, just before the `close()` method is called. I also, moved the indexers description from `IndexingJob` to `IndexerOutputFormat`, to avoid to build `IndexWriters` instance twice. Thanks This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > SolrIndexer to write to multiple servers. > - > > Key: NUTCH-1480 > URL: https://issues.apache.org/jira/browse/NUTCH-1480 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Attachments: NUTCH-1480-1.6.1.patch, > adding-support-for-sharding-indexer-for-solr.patch > > > SolrUtils should return an array of SolrServers and read the SolrUrl as a > comma delimited list of URL's using Configuration.getString(). SolrWriter > should be able to handle this list of SolrServers. > This is useful if you want to send documents to multiple servers if no > replication is available or if you want to send documents to multiple NOCs. > edit: > This does not replace NUTCH-1377 but complements it. With NUTCH-1377 this > issue allows you to index to multiple SolrCloud clusters at the same time. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.
[ https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16257012#comment-16257012 ] ASF GitHub Bot commented on NUTCH-1480: --- sebastian-nagel commented on issue #218: fix for NUTCH-1480 contributed by r0ann3l URL: https://github.com/apache/nutch/pull/218#issuecomment-345254791 Hi @r0ann3l, thanks! I've continued testing, and was able to feed two Solr indexes in parallel. Great! Afaics, all requested changes have been made (also that of @lewismc). To make the configuration work out of the box, I would suggest 3 changes: - use only field names defined in the default schema.xml `ERROR: [doc=http://nutch.apache.org/] unknown field 'search' - default Solr core name should be "nutch" as described in the [tutorial](https://wiki.apache.org/nutch/NutchTutorial) I've tried to fix these issues in "[a fork of NUTCH-1480](https://github.com/sebastian-nagel/nutch/commits/NUTCH-1480)". Feel free to cherry pick it from there. I've also tried to make indexer-dummy work. Without success, the file is created but then overwritten: - there are two instances of `IndexWriters` active, each having a separate instance of DummyIndexWriter. - the instance created from `IndexerOutputFormat.getRecordWriter(IndexerOutputFormat.java:39)` writes into the file - but later on the instance created from `IndexWriters.open(IndexWriters.java:187)` opens the file anew, at the end there is an empty file. Because it's two instances there is no possibility to check whether the file writer is already instantiated. I see two potential solutions: 1. the IndexWriter interface method `open(job, name)` was defined with file indexers in mind (cf. NUTCH-1541/[CSVIndexWriter](https://github.com/sebastian-nagel/nutch/blob/NUTCH-1541/src/plugin/indexer-csv/src/java/org/apache/nutch/indexwriter/csv/CSVIndexWriter.java#L233)), an index writer can then decide to do nothing when called with name "commit". 2. do not call the `commit()` method explicitly (ev. also remove it from the interface: it does not safely work in distributed mode because it's not run in the reducers (see the comment in RabbitIndexWriter). I tend to the second solution. It would also solve the problem of having two IndexWriters instances active. What do you think? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > SolrIndexer to write to multiple servers. > - > > Key: NUTCH-1480 > URL: https://issues.apache.org/jira/browse/NUTCH-1480 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Attachments: NUTCH-1480-1.6.1.patch, > adding-support-for-sharding-indexer-for-solr.patch > > > SolrUtils should return an array of SolrServers and read the SolrUrl as a > comma delimited list of URL's using Configuration.getString(). SolrWriter > should be able to handle this list of SolrServers. > This is useful if you want to send documents to multiple servers if no > replication is available or if you want to send documents to multiple NOCs. > edit: > This does not replace NUTCH-1377 but complements it. With NUTCH-1377 this > issue allows you to index to multiple SolrCloud clusters at the same time. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.
[ https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16244013#comment-16244013 ] ASF GitHub Bot commented on NUTCH-1480: --- r0ann3l commented on issue #218: fix for NUTCH-1480 contributed by r0ann3l URL: https://github.com/apache/nutch/pull/218#issuecomment-342832072 Thanks @sebastian-nagel for your review. Sections for all indexer-* plugins were added, so they work out-of-the-box as you required in your comments. Also, it is not mandatory to specify fields for the actions (the schema is relaxed). I included a new change, to avoid duplicate values in a field when someone tries to copy to the same field, like: ``` ``` In addition, I added a new class (IndexWriterParams) to facilitate the process of obtaining and parsing values from the index-writers.xml file. Now, an instance of IndexWriterParams is passed to each IndexWriter instead of HashMap. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > SolrIndexer to write to multiple servers. > - > > Key: NUTCH-1480 > URL: https://issues.apache.org/jira/browse/NUTCH-1480 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Attachments: NUTCH-1480-1.6.1.patch, > adding-support-for-sharding-indexer-for-solr.patch > > > SolrUtils should return an array of SolrServers and read the SolrUrl as a > comma delimited list of URL's using Configuration.getString(). SolrWriter > should be able to handle this list of SolrServers. > This is useful if you want to send documents to multiple servers if no > replication is available or if you want to send documents to multiple NOCs. > edit: > This does not replace NUTCH-1377 but complements it. With NUTCH-1377 this > issue allows you to index to multiple SolrCloud clusters at the same time. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.
[ https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16172924#comment-16172924 ] ASF GitHub Bot commented on NUTCH-1480: --- sebastian-nagel commented on a change in pull request #218: fix for NUTCH-1480 contributed by r0ann3l URL: https://github.com/apache/nutch/pull/218#discussion_r139908742 ## File path: conf/index-writers.xml.template ## @@ -0,0 +1,75 @@ + + +http://lucene.apache.org/nutch"; + xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"; + xsi:schemaLocation="http://lucene.apache.org/nutch index-writers.xsd"> + + + + + http://localhost:8983/solr/core_name"/> + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Review comment: Add stub sections for all indexer-* plugins so that they work out-of-the-box without modifications of the index-writers.xml required, e.g. for indexer-dummy: ``` ``` That's long for a dummy section, but the schema (index-writers.xsd) and the IndexWriters class requires all the elements and attributes. Maybe it's better to "relax" the schema, make elements/attributes optional and make IndexWriters not fail with NPEs. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > SolrIndexer to write to multiple servers. > - > > Key: NUTCH-1480 > URL: https://issues.apache.org/jira/browse/NUTCH-1480 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Attachments: adding-support-for-sharding-indexer-for-solr.patch, > NUTCH-1480-1.6.1.patch > > > SolrUtils should return an array of SolrServers and read the SolrUrl as a > comma delimited list of URL's using Configuration.getString(). SolrWriter > should be able to handle this list of SolrServers. > This is useful if you want to send documents to multiple servers if no > replication is available or if you want to send documents to multiple NOCs. > edit: > This does not replace NUTCH-1377 but complements it. With NUTCH-1377 this > issue allows you to index to multiple SolrCloud clusters at the same time. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.
[ https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16172899#comment-16172899 ] ASF GitHub Bot commented on NUTCH-1480: --- sebastian-nagel commented on issue #218: fix for NUTCH-1480 contributed by r0ann3l URL: https://github.com/apache/nutch/pull/218#issuecomment-330784463 Looks good! I've tried to use indexer-dummy with this PR applied - it took long to configure the index-writers.xml properly, so we should definitely add "stub" sections for all index writers which are (still) based on configuration properties. All index writers should work out-of-the-box! This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > SolrIndexer to write to multiple servers. > - > > Key: NUTCH-1480 > URL: https://issues.apache.org/jira/browse/NUTCH-1480 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Attachments: adding-support-for-sharding-indexer-for-solr.patch, > NUTCH-1480-1.6.1.patch > > > SolrUtils should return an array of SolrServers and read the SolrUrl as a > comma delimited list of URL's using Configuration.getString(). SolrWriter > should be able to handle this list of SolrServers. > This is useful if you want to send documents to multiple servers if no > replication is available or if you want to send documents to multiple NOCs. > edit: > This does not replace NUTCH-1377 but complements it. With NUTCH-1377 this > issue allows you to index to multiple SolrCloud clusters at the same time. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.
[ https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16166732#comment-16166732 ] ASF GitHub Bot commented on NUTCH-1480: --- lewismc commented on issue #218: fix for NUTCH-1480 contributed by r0ann3l URL: https://github.com/apache/nutch/pull/218#issuecomment-329563051 Tested this on Solr 6 and works well... any comments folks? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > SolrIndexer to write to multiple servers. > - > > Key: NUTCH-1480 > URL: https://issues.apache.org/jira/browse/NUTCH-1480 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Attachments: adding-support-for-sharding-indexer-for-solr.patch, > NUTCH-1480-1.6.1.patch > > > SolrUtils should return an array of SolrServers and read the SolrUrl as a > comma delimited list of URL's using Configuration.getString(). SolrWriter > should be able to handle this list of SolrServers. > This is useful if you want to send documents to multiple servers if no > replication is available or if you want to send documents to multiple NOCs. > edit: > This does not replace NUTCH-1377 but complements it. With NUTCH-1377 this > issue allows you to index to multiple SolrCloud clusters at the same time. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.
[ https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16164899#comment-16164899 ] ASF GitHub Bot commented on NUTCH-1480: --- r0ann3l commented on issue #218: fix for NUTCH-1480 contributed by r0ann3l URL: https://github.com/apache/nutch/pull/218#issuecomment-329219778 Thanks @lewismc and @jorgelbg for your reviews. All your comments have been fixed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > SolrIndexer to write to multiple servers. > - > > Key: NUTCH-1480 > URL: https://issues.apache.org/jira/browse/NUTCH-1480 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Attachments: adding-support-for-sharding-indexer-for-solr.patch, > NUTCH-1480-1.6.1.patch > > > SolrUtils should return an array of SolrServers and read the SolrUrl as a > comma delimited list of URL's using Configuration.getString(). SolrWriter > should be able to handle this list of SolrServers. > This is useful if you want to send documents to multiple servers if no > replication is available or if you want to send documents to multiple NOCs. > edit: > This does not replace NUTCH-1377 but complements it. With NUTCH-1377 this > issue allows you to index to multiple SolrCloud clusters at the same time. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.
[ https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16164857#comment-16164857 ] ASF GitHub Bot commented on NUTCH-1480: --- r0ann3l commented on a change in pull request #218: fix for NUTCH-1480 contributed by r0ann3l URL: https://github.com/apache/nutch/pull/218#discussion_r138659435 ## File path: src/plugin/indexer-rabbit/src/java/org/apache/nutch/indexwriter/rabbit/RabbitMQConstants.java ## @@ -17,28 +17,26 @@ package org.apache.nutch.indexwriter.rabbit; interface RabbitMQConstants { -String RABBIT_PREFIX = "rabbitmq.indexer"; Review comment: Hi @lewismc. The prefix is not necessary anymore. The new structure allows us to have the same key of a parameter to many index writers without ambiguity or confusion. The prefix makes a parameter key larger and really I do not believe that is necessary. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > SolrIndexer to write to multiple servers. > - > > Key: NUTCH-1480 > URL: https://issues.apache.org/jira/browse/NUTCH-1480 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Attachments: adding-support-for-sharding-indexer-for-solr.patch, > NUTCH-1480-1.6.1.patch > > > SolrUtils should return an array of SolrServers and read the SolrUrl as a > comma delimited list of URL's using Configuration.getString(). SolrWriter > should be able to handle this list of SolrServers. > This is useful if you want to send documents to multiple servers if no > replication is available or if you want to send documents to multiple NOCs. > edit: > This does not replace NUTCH-1377 but complements it. With NUTCH-1377 this > issue allows you to index to multiple SolrCloud clusters at the same time. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.
[ https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16154578#comment-16154578 ] ASF GitHub Bot commented on NUTCH-1480: --- jorgelbg commented on a change in pull request #218: fix for NUTCH-1480 contributed by r0ann3l URL: https://github.com/apache/nutch/pull/218#discussion_r137142167 ## File path: conf/index-writers.xsd ## @@ -0,0 +1,179 @@ + + +http://www.w3.org/2001/XMLSchema"; + targetNamespace="http://lucene.apache.org/nutch"; + xmlns="http://lucene.apache.org/nutch"; + elementFormDefault="qualified"> + + + +Root tag of index-writers.xml document. It's a wrapper for the all index writers. + + + + + + + + Contains the all configuration of a particular index writer. Review comment: typo? Also it would be a good idea to have empty lines at the end of this file and the `conf/index-writers.xml.template` for git/diff compatiblity. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > SolrIndexer to write to multiple servers. > - > > Key: NUTCH-1480 > URL: https://issues.apache.org/jira/browse/NUTCH-1480 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Attachments: adding-support-for-sharding-indexer-for-solr.patch, > NUTCH-1480-1.6.1.patch > > > SolrUtils should return an array of SolrServers and read the SolrUrl as a > comma delimited list of URL's using Configuration.getString(). SolrWriter > should be able to handle this list of SolrServers. > This is useful if you want to send documents to multiple servers if no > replication is available or if you want to send documents to multiple NOCs. > edit: > This does not replace NUTCH-1377 but complements it. With NUTCH-1377 this > issue allows you to index to multiple SolrCloud clusters at the same time. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.
[ https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16154573#comment-16154573 ] ASF GitHub Bot commented on NUTCH-1480: --- jorgelbg commented on a change in pull request #218: fix for NUTCH-1480 contributed by r0ann3l URL: https://github.com/apache/nutch/pull/218#discussion_r137141458 ## File path: src/plugin/indexer-rabbit/src/java/org/apache/nutch/indexwriter/rabbit/RabbitIndexWriter.java ## @@ -31,164 +31,182 @@ import org.slf4j.Logger; import org.slf4j.LoggerFactory; +import java.lang.invoke.MethodHandles; import java.util.*; import java.util.concurrent.TimeoutException; public class RabbitIndexWriter implements IndexWriter { -private String serverHost; -private int serverPort; -private String serverVirtualHost; -private String serverUsername; -private String serverPassword; + private String serverHost; + private int serverPort; + private String serverVirtualHost; + private String serverUsername; + private String serverPassword; -private String exchangeServer; -private String exchangeType; + private String exchangeServer; + private String exchangeType; -private String queueName; -private boolean queueDurable; -private String queueRoutingKey; + private String queueName; + private boolean queueDurable; + private String queueRoutingKey; -private int commitSize; + private int commitSize; -public static final Logger LOG = LoggerFactory.getLogger(RabbitIndexWriter.class); + private static final Logger LOG = LoggerFactory + .getLogger(MethodHandles.lookup().lookupClass()); -private Configuration config; + private Configuration config; -private RabbitMessage rabbitMessage = new RabbitMessage(); + private RabbitMessage rabbitMessage = new RabbitMessage(); -private Channel channel; -private Connection connection; + private Channel channel; + private Connection connection; -@Override -public Configuration getConf() { -return config; -} + @Override + public Configuration getConf() { +return config; + } -@Override -public void setConf(Configuration conf) { -config = conf; + @Override + public void setConf(Configuration conf) { +config = conf; + } -serverHost = conf.get(RabbitMQConstants.SERVER_HOST, "localhost"); -serverPort = conf.getInt(RabbitMQConstants.SERVER_PORT, 15672); -serverVirtualHost = conf.get(RabbitMQConstants.SERVER_VIRTUAL_HOST, null); + @Override + public void open(JobConf JobConf, String name) throws IOException { +//Implementation not required + } -serverUsername = conf.get(RabbitMQConstants.SERVER_USERNAME, "admin"); -serverPassword = conf.get(RabbitMQConstants.SERVER_PASSWORD, "admin"); + /** + * Initializes the internal variables from a given index writer configuration. + * + * @param parameters Params from the index writer configuration. + * @throws IOException Some exception thrown by writer. + */ + @Override + public void open(Map parameters) throws IOException { +serverHost = parameters.getOrDefault(RabbitMQConstants.SERVER_HOST, "localhost"); +serverPort = Integer.parseInt(parameters.getOrDefault(RabbitMQConstants.SERVER_PORT, "5672")); +serverVirtualHost = parameters.getOrDefault(RabbitMQConstants.SERVER_VIRTUAL_HOST, null); -exchangeServer = conf.get(RabbitMQConstants.EXCHANGE_SERVER, "nutch.exchange"); -exchangeType = conf.get(RabbitMQConstants.EXCHANGE_TYPE, "direct"); +serverUsername = parameters.getOrDefault(RabbitMQConstants.SERVER_USERNAME, "admin"); +serverPassword = parameters.getOrDefault(RabbitMQConstants.SERVER_PASSWORD, "admin"); -queueName = conf.get(RabbitMQConstants.QUEUE_NAME, "nutch.queue"); -queueDurable = conf.getBoolean(RabbitMQConstants.QUEUE_DURABLE, true); -queueRoutingKey = conf.get(RabbitMQConstants.QUEUE_ROUTING_KEY, "nutch.key"); +exchangeServer = parameters.getOrDefault(RabbitMQConstants.EXCHANGE_SERVER, "nutch.exchange"); +exchangeType = parameters.getOrDefault(RabbitMQConstants.EXCHANGE_TYPE, "direct"); -commitSize = conf.getInt(RabbitMQConstants.COMMIT_SIZE, 250); -} +queueName = parameters.getOrDefault(RabbitMQConstants.QUEUE_NAME, "nutch.queue"); +queueDurable = Boolean.parseBoolean(parameters.getOrDefault(RabbitMQConstants.QUEUE_DURABLE, "true")); +queueRoutingKey = parameters.getOrDefault(RabbitMQConstants.QUEUE_ROUTING_KEY, "nutch.key"); -@Override -public void open(JobConf JobConf, String name) throws IOException { -ConnectionFactory factory = new ConnectionFactory(); -factory.setHost(serverHost); -factory.setPort(serverPort); +commitSize = Integer.parseInt(parameters.getOrDefault(RabbitMQConstants.COMMIT_SIZE, "25
[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.
[ https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16154358#comment-16154358 ] ASF GitHub Bot commented on NUTCH-1480: --- lewismc commented on a change in pull request #218: fix for NUTCH-1480 contributed by r0ann3l URL: https://github.com/apache/nutch/pull/218#discussion_r137120181 ## File path: src/java/org/apache/nutch/indexer/IndexWriters.java ## @@ -16,132 +16,245 @@ */ package org.apache.nutch.indexer; -import java.io.IOException; -import java.lang.invoke.MethodHandles; -import java.util.HashMap; - import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.mapred.JobConf; -import org.apache.nutch.indexer.NutchDocument; import org.apache.nutch.plugin.Extension; import org.apache.nutch.plugin.ExtensionPoint; import org.apache.nutch.plugin.PluginRepository; import org.apache.nutch.plugin.PluginRuntimeException; import org.apache.nutch.util.ObjectCache; import org.slf4j.Logger; import org.slf4j.LoggerFactory; +import org.w3c.dom.Document; +import org.w3c.dom.Element; +import org.w3c.dom.NodeList; +import org.xml.sax.InputSource; +import org.xml.sax.SAXException; + +import javax.xml.parsers.DocumentBuilder; +import javax.xml.parsers.DocumentBuilderFactory; +import javax.xml.parsers.ParserConfigurationException; +import java.io.IOException; +import java.io.InputStream; +import java.lang.invoke.MethodHandles; +import java.util.HashMap; +import java.util.List; +import java.util.Map; -/** Creates and caches {@link IndexWriter} implementing plugins. */ +/** + * Creates and caches {@link IndexWriter} implementing plugins. + */ public class IndexWriters { private static final Logger LOG = LoggerFactory - .getLogger(MethodHandles.lookup().lookupClass()); + .getLogger(MethodHandles.lookup().lookupClass()); - private IndexWriter[] indexWriters; + private HashMap indexWriters; public IndexWriters(Configuration conf) { ObjectCache objectCache = ObjectCache.get(conf); + synchronized (objectCache) { - this.indexWriters = (IndexWriter[]) objectCache - .getObject(IndexWriter.class.getName()); + this.indexWriters = (HashMap) objectCache + .getObject(IndexWriterWrapper.class.getName()); + + //It's not cached yet if (this.indexWriters == null) { try { ExtensionPoint point = PluginRepository.get(conf).getExtensionPoint( - IndexWriter.X_POINT_ID); - if (point == null) + IndexWriter.X_POINT_ID); + + if (point == null) { throw new RuntimeException(IndexWriter.X_POINT_ID + " not found."); + } + Extension[] extensions = point.getExtensions(); - HashMap indexerMap = new HashMap<>(); - for (int i = 0; i < extensions.length; i++) { -Extension extension = extensions[i]; -IndexWriter writer = (IndexWriter) extension.getExtensionInstance(); -LOG.info("Adding " + writer.getClass().getName()); -if (!indexerMap.containsKey(writer.getClass().getName())) { - indexerMap.put(writer.getClass().getName(), writer); + + HashMap extensionMap = new HashMap<>(); + for (Extension extension : extensions) { +LOG.info("Index writer " + extension.getClazz() + " identified."); Review comment: Please use parameterized logging This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > SolrIndexer to write to multiple servers. > - > > Key: NUTCH-1480 > URL: https://issues.apache.org/jira/browse/NUTCH-1480 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Attachments: adding-support-for-sharding-indexer-for-solr.patch, > NUTCH-1480-1.6.1.patch > > > SolrUtils should return an array of SolrServers and read the SolrUrl as a > comma delimited list of URL's using Configuration.getString(). SolrWriter > should be able to handle this list of SolrServers. > This is useful if you want to send documents to multiple servers if no > replication is available or if you want to send documents to multiple NOCs. > edit: > This does not replace NUTCH-1377 but complements it. With NUTCH-1377 this > issue allows you to index to multiple SolrCloud clusters at the same time. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.
[ https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16154359#comment-16154359 ] ASF GitHub Bot commented on NUTCH-1480: --- lewismc commented on a change in pull request #218: fix for NUTCH-1480 contributed by r0ann3l URL: https://github.com/apache/nutch/pull/218#discussion_r137120754 ## File path: src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrConstants.java ## @@ -19,19 +19,23 @@ public interface SolrConstants { public static final String SOLR_PREFIX = "solr."; Review comment: Any reason to remove all of these as well? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > SolrIndexer to write to multiple servers. > - > > Key: NUTCH-1480 > URL: https://issues.apache.org/jira/browse/NUTCH-1480 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Attachments: adding-support-for-sharding-indexer-for-solr.patch, > NUTCH-1480-1.6.1.patch > > > SolrUtils should return an array of SolrServers and read the SolrUrl as a > comma delimited list of URL's using Configuration.getString(). SolrWriter > should be able to handle this list of SolrServers. > This is useful if you want to send documents to multiple servers if no > replication is available or if you want to send documents to multiple NOCs. > edit: > This does not replace NUTCH-1377 but complements it. With NUTCH-1377 this > issue allows you to index to multiple SolrCloud clusters at the same time. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.
[ https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16154354#comment-16154354 ] ASF GitHub Bot commented on NUTCH-1480: --- lewismc commented on a change in pull request #218: fix for NUTCH-1480 contributed by r0ann3l URL: https://github.com/apache/nutch/pull/218#discussion_r137121058 ## File path: src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrIndexWriter.java ## @@ -75,19 +70,61 @@ private int totalUpdates = 0; private boolean delete = false; + @Override public void open(JobConf job, String name) throws IOException { -solrClients = SolrUtils.getSolrClients(job); -init(solrClients, job); +//Implementation not required } - // package protected for tests - void init(List solrClients, JobConf job) throws IOException { -batchSize = job.getInt(SolrConstants.COMMIT_SIZE, 1000); -solrMapping = SolrMappingReader.getInstance(job); -delete = job.getBoolean(IndexerMapReduce.INDEXER_DELETE, false); + /** + * Initializes the internal variables from a given index writer configuration. + * + * @param parameters Params from the index writer configuration. + * @throws IOException Some exception thrown by writer. + */ + @Override + public void open(Map parameters) throws IOException { +String type = parameters.getOrDefault("type", "http"); + +String[] urls = StringUtils.getStrings(parameters.get("url")); + +if (urls == null) { + String message = "Missing SOLR URL.\n" + describe(); + LOG.error(message); + throw new RuntimeException(message); +} + +this.solrClients = new ArrayList<>(); + +switch (type) { + case "http": +for (String url : urls) { + solrClients.add(SolrUtils.getHttpSolrClient(url)); +} +break; + case "cloud": +for (String url : urls) { + CloudSolrClient sc = SolrUtils.getCloudSolrClient(url); + sc.setDefaultCollection(parameters.get(SolrConstants.COLLECTION)); + solrClients.add(sc); +} +break; + case "concurrent": Review comment: Can you throw unsupported Exception at this stage? and also a default case? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > SolrIndexer to write to multiple servers. > - > > Key: NUTCH-1480 > URL: https://issues.apache.org/jira/browse/NUTCH-1480 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Attachments: adding-support-for-sharding-indexer-for-solr.patch, > NUTCH-1480-1.6.1.patch > > > SolrUtils should return an array of SolrServers and read the SolrUrl as a > comma delimited list of URL's using Configuration.getString(). SolrWriter > should be able to handle this list of SolrServers. > This is useful if you want to send documents to multiple servers if no > replication is available or if you want to send documents to multiple NOCs. > edit: > This does not replace NUTCH-1377 but complements it. With NUTCH-1377 this > issue allows you to index to multiple SolrCloud clusters at the same time. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.
[ https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16154357#comment-16154357 ] ASF GitHub Bot commented on NUTCH-1480: --- lewismc commented on a change in pull request #218: fix for NUTCH-1480 contributed by r0ann3l URL: https://github.com/apache/nutch/pull/218#discussion_r137120319 ## File path: src/java/org/apache/nutch/indexer/MappingReader.java ## @@ -0,0 +1,93 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.nutch.indexer; + +import org.w3c.dom.Element; +import org.w3c.dom.Node; +import org.w3c.dom.NodeList; + +import java.util.*; Review comment: Do not use wildcard This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > SolrIndexer to write to multiple servers. > - > > Key: NUTCH-1480 > URL: https://issues.apache.org/jira/browse/NUTCH-1480 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Attachments: adding-support-for-sharding-indexer-for-solr.patch, > NUTCH-1480-1.6.1.patch > > > SolrUtils should return an array of SolrServers and read the SolrUrl as a > comma delimited list of URL's using Configuration.getString(). SolrWriter > should be able to handle this list of SolrServers. > This is useful if you want to send documents to multiple servers if no > replication is available or if you want to send documents to multiple NOCs. > edit: > This does not replace NUTCH-1377 but complements it. With NUTCH-1377 this > issue allows you to index to multiple SolrCloud clusters at the same time. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.
[ https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16154355#comment-16154355 ] ASF GitHub Bot commented on NUTCH-1480: --- lewismc commented on a change in pull request #218: fix for NUTCH-1480 contributed by r0ann3l URL: https://github.com/apache/nutch/pull/218#discussion_r137119842 ## File path: src/java/org/apache/nutch/indexer/IndexWriterConfig.java ## @@ -0,0 +1,95 @@ +package org.apache.nutch.indexer; + +import org.w3c.dom.Element; +import org.w3c.dom.NodeList; + +import java.util.*; Review comment: Please use explicit imports instead of wildcard This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > SolrIndexer to write to multiple servers. > - > > Key: NUTCH-1480 > URL: https://issues.apache.org/jira/browse/NUTCH-1480 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Attachments: adding-support-for-sharding-indexer-for-solr.patch, > NUTCH-1480-1.6.1.patch > > > SolrUtils should return an array of SolrServers and read the SolrUrl as a > comma delimited list of URL's using Configuration.getString(). SolrWriter > should be able to handle this list of SolrServers. > This is useful if you want to send documents to multiple servers if no > replication is available or if you want to send documents to multiple NOCs. > edit: > This does not replace NUTCH-1377 but complements it. With NUTCH-1377 this > issue allows you to index to multiple SolrCloud clusters at the same time. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.
[ https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16154353#comment-16154353 ] ASF GitHub Bot commented on NUTCH-1480: --- lewismc commented on a change in pull request #218: fix for NUTCH-1480 contributed by r0ann3l URL: https://github.com/apache/nutch/pull/218#discussion_r137120636 ## File path: src/plugin/indexer-rabbit/src/java/org/apache/nutch/indexwriter/rabbit/RabbitMQConstants.java ## @@ -17,28 +17,26 @@ package org.apache.nutch.indexwriter.rabbit; interface RabbitMQConstants { -String RABBIT_PREFIX = "rabbitmq.indexer"; Review comment: Why did you remove all of these? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > SolrIndexer to write to multiple servers. > - > > Key: NUTCH-1480 > URL: https://issues.apache.org/jira/browse/NUTCH-1480 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Attachments: adding-support-for-sharding-indexer-for-solr.patch, > NUTCH-1480-1.6.1.patch > > > SolrUtils should return an array of SolrServers and read the SolrUrl as a > comma delimited list of URL's using Configuration.getString(). SolrWriter > should be able to handle this list of SolrServers. > This is useful if you want to send documents to multiple servers if no > replication is available or if you want to send documents to multiple NOCs. > edit: > This does not replace NUTCH-1377 but complements it. With NUTCH-1377 this > issue allows you to index to multiple SolrCloud clusters at the same time. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.
[ https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16154356#comment-16154356 ] ASF GitHub Bot commented on NUTCH-1480: --- lewismc commented on a change in pull request #218: fix for NUTCH-1480 contributed by r0ann3l URL: https://github.com/apache/nutch/pull/218#discussion_r137119767 ## File path: src/java/org/apache/nutch/indexer/IndexWriterConfig.java ## @@ -0,0 +1,95 @@ +package org.apache.nutch.indexer; Review comment: Please add license header This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > SolrIndexer to write to multiple servers. > - > > Key: NUTCH-1480 > URL: https://issues.apache.org/jira/browse/NUTCH-1480 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Attachments: adding-support-for-sharding-indexer-for-solr.patch, > NUTCH-1480-1.6.1.patch > > > SolrUtils should return an array of SolrServers and read the SolrUrl as a > comma delimited list of URL's using Configuration.getString(). SolrWriter > should be able to handle this list of SolrServers. > This is useful if you want to send documents to multiple servers if no > replication is available or if you want to send documents to multiple NOCs. > edit: > This does not replace NUTCH-1377 but complements it. With NUTCH-1377 this > issue allows you to index to multiple SolrCloud clusters at the same time. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.
[ https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16152957#comment-16152957 ] Jorge Luis Betancourt Gonzalez commented on NUTCH-1480: --- [~markus17] do you mind taking a look at the linked PR? I think that the PR covers more than the original intent of this issue, since you've already worked in something similar, I think that your input would be really valuable on this case. > SolrIndexer to write to multiple servers. > - > > Key: NUTCH-1480 > URL: https://issues.apache.org/jira/browse/NUTCH-1480 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Attachments: adding-support-for-sharding-indexer-for-solr.patch, > NUTCH-1480-1.6.1.patch > > > SolrUtils should return an array of SolrServers and read the SolrUrl as a > comma delimited list of URL's using Configuration.getString(). SolrWriter > should be able to handle this list of SolrServers. > This is useful if you want to send documents to multiple servers if no > replication is available or if you want to send documents to multiple NOCs. > edit: > This does not replace NUTCH-1377 but complements it. With NUTCH-1377 this > issue allows you to index to multiple SolrCloud clusters at the same time. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.
[ https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16152885#comment-16152885 ] ASF GitHub Bot commented on NUTCH-1480: --- odisleysi commented on issue #218: fix for NUTCH-1480 contributed by r0ann3l URL: https://github.com/apache/nutch/pull/218#issuecomment-327018446 I like this idea. I work in a project that needs to save documents in solr for searching and elasticsearch for statistics. This solve the problem. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > SolrIndexer to write to multiple servers. > - > > Key: NUTCH-1480 > URL: https://issues.apache.org/jira/browse/NUTCH-1480 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Attachments: adding-support-for-sharding-indexer-for-solr.patch, > NUTCH-1480-1.6.1.patch > > > SolrUtils should return an array of SolrServers and read the SolrUrl as a > comma delimited list of URL's using Configuration.getString(). SolrWriter > should be able to handle this list of SolrServers. > This is useful if you want to send documents to multiple servers if no > replication is available or if you want to send documents to multiple NOCs. > edit: > This does not replace NUTCH-1377 but complements it. With NUTCH-1377 this > issue allows you to index to multiple SolrCloud clusters at the same time. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.
[ https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16152883#comment-16152883 ] ASF GitHub Bot commented on NUTCH-1480: --- odisleysi commented on issue #218: fix for NUTCH-1480 contributed by r0ann3l URL: https://github.com/apache/nutch/pull/218#issuecomment-327018446 I like this idea. I work in a project that needs to save documents in solr for searching and in elasticsearch for statistics. This solve the problem. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > SolrIndexer to write to multiple servers. > - > > Key: NUTCH-1480 > URL: https://issues.apache.org/jira/browse/NUTCH-1480 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Attachments: adding-support-for-sharding-indexer-for-solr.patch, > NUTCH-1480-1.6.1.patch > > > SolrUtils should return an array of SolrServers and read the SolrUrl as a > comma delimited list of URL's using Configuration.getString(). SolrWriter > should be able to handle this list of SolrServers. > This is useful if you want to send documents to multiple servers if no > replication is available or if you want to send documents to multiple NOCs. > edit: > This does not replace NUTCH-1377 but complements it. With NUTCH-1377 this > issue allows you to index to multiple SolrCloud clusters at the same time. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.
[ https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16145516#comment-16145516 ] ASF GitHub Bot commented on NUTCH-1480: --- jorgelbg commented on issue #218: fix for NUTCH-1480 contributed by r0ann3l URL: https://github.com/apache/nutch/pull/218#issuecomment-325707916 This PR includes more changes than the original ticket and breaks BC with custom indexers. @r0ann3l could you squash all the changes into a single commit? that would help in the review process since this PR has a lot of changes. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > SolrIndexer to write to multiple servers. > - > > Key: NUTCH-1480 > URL: https://issues.apache.org/jira/browse/NUTCH-1480 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Attachments: adding-support-for-sharding-indexer-for-solr.patch, > NUTCH-1480-1.6.1.patch > > > SolrUtils should return an array of SolrServers and read the SolrUrl as a > comma delimited list of URL's using Configuration.getString(). SolrWriter > should be able to handle this list of SolrServers. > This is useful if you want to send documents to multiple servers if no > replication is available or if you want to send documents to multiple NOCs. > edit: > This does not replace NUTCH-1377 but complements it. With NUTCH-1377 this > issue allows you to index to multiple SolrCloud clusters at the same time. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.
[ https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16144081#comment-16144081 ] ASF GitHub Bot commented on NUTCH-1480: --- r0ann3l opened a new pull request #218: fix for NUTCH-1480 contributed by r0ann3l URL: https://github.com/apache/nutch/pull/218 With this patch now we can have many instances of the same IndexWriter class, but with different configurations. Also, we can copy, rename or remove fields of documents for every index writer individually. Besides, the parameters needed by the index writers will be into separated XML files, so them will be not into nutch-site.xml anymore. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > SolrIndexer to write to multiple servers. > - > > Key: NUTCH-1480 > URL: https://issues.apache.org/jira/browse/NUTCH-1480 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Attachments: adding-support-for-sharding-indexer-for-solr.patch, > NUTCH-1480-1.6.1.patch > > > SolrUtils should return an array of SolrServers and read the SolrUrl as a > comma delimited list of URL's using Configuration.getString(). SolrWriter > should be able to handle this list of SolrServers. > This is useful if you want to send documents to multiple servers if no > replication is available or if you want to send documents to multiple NOCs. > edit: > This does not replace NUTCH-1377 but complements it. With NUTCH-1377 this > issue allows you to index to multiple SolrCloud clusters at the same time. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.
[ https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16138433#comment-16138433 ] Roannel Fernández Hernández commented on NUTCH-1480: I’m testing a solution which use this file [1] to configure the index writers. On this XML file, we could put into every tag "writer" the parameters used by the writer and a mapping for every field of the Nutch documents. With this new way of using the writers in Nutch, we could have so many field mappings, not only for the Solr index writer, but also for every index writer that we have. Also we will be able to define different configurations for index writers, even for the same IndexWriter class. This solution is applied to all types of index writers, not just for Solr index writer. The structure of [1] is described in [2]. [1] https://github.com/r0ann3l/nutch/blob/NUTCH-1480/conf/index-writers.xml.template [2] https://github.com/r0ann3l/nutch/blob/NUTCH-1480/conf/index-writers.xsd > SolrIndexer to write to multiple servers. > - > > Key: NUTCH-1480 > URL: https://issues.apache.org/jira/browse/NUTCH-1480 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Attachments: adding-support-for-sharding-indexer-for-solr.patch, > NUTCH-1480-1.6.1.patch > > > SolrUtils should return an array of SolrServers and read the SolrUrl as a > comma delimited list of URL's using Configuration.getString(). SolrWriter > should be able to handle this list of SolrServers. > This is useful if you want to send documents to multiple servers if no > replication is available or if you want to send documents to multiple NOCs. > edit: > This does not replace NUTCH-1377 but complements it. With NUTCH-1377 this > issue allows you to index to multiple SolrCloud clusters at the same time. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.
[ https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13556106#comment-13556106 ] Markus Jelsma commented on NUTCH-1480: -- I think we should update to 4.0 in 1.7 (can't find the relevant issue). 4.x can work just as 3.x in some stand-alone mode and NUTCH-1377 still allows to index to a single node (or multiple single nodes with this patch included). SolrJ 4.x should work fine with 3.x series servers as iirc no javabin changes have been made in the past time. > SolrIndexer to write to multiple servers. > - > > Key: NUTCH-1480 > URL: https://issues.apache.org/jira/browse/NUTCH-1480 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.7 > > Attachments: NUTCH-1480-1.6.1.patch > > > SolrUtils should return an array of SolrServers and read the SolrUrl as a > comma delimited list of URL's using Configuration.getString(). SolrWriter > should be able to handle this list of SolrServers. > This is useful if you want to send documents to multiple servers if no > replication is available or if you want to send documents to multiple NOCs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.
[ https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13556100#comment-13556100 ] Julien Nioche commented on NUTCH-1480: -- probably depends on whether we want to support both SOLR 3.x and SOLR 4.x. Got your point about indexing to multiple clouds, thanks! > SolrIndexer to write to multiple servers. > - > > Key: NUTCH-1480 > URL: https://issues.apache.org/jira/browse/NUTCH-1480 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.7 > > Attachments: NUTCH-1480-1.6.1.patch > > > SolrUtils should return an array of SolrServers and read the SolrUrl as a > comma delimited list of URL's using Configuration.getString(). SolrWriter > should be able to handle this list of SolrServers. > This is useful if you want to send documents to multiple servers if no > replication is available or if you want to send documents to multiple NOCs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.
[ https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13556098#comment-13556098 ] Markus Jelsma commented on NUTCH-1480: -- We leave all the shard key hashing up to SolrCloud and the Cloud aware SolrJ client we use, see NUTCH-1377. We use both patches to and provide two lists of zookeeper addresses to index to multiple clouds the same time. As far as i am concerned NUTCH-945 is obsolete due to SolrCloud handling automatically. > SolrIndexer to write to multiple servers. > - > > Key: NUTCH-1480 > URL: https://issues.apache.org/jira/browse/NUTCH-1480 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.7 > > Attachments: NUTCH-1480-1.6.1.patch > > > SolrUtils should return an array of SolrServers and read the SolrUrl as a > comma delimited list of URL's using Configuration.getString(). SolrWriter > should be able to handle this list of SolrServers. > This is useful if you want to send documents to multiple servers if no > replication is available or if you want to send documents to multiple NOCs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.
[ https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13556090#comment-13556090 ] Julien Nioche commented on NUTCH-1480: -- I'd rather it was implemented as an extension of NUTCH-945 where we'd have a partitioner that sends to all SOLR instances, which is I believe what NUTCH-1480 is about. There are many cases where we'd want to shard according to other criteria and NUTCH-945 would provide a more generic framework. Does this make sense? > SolrIndexer to write to multiple servers. > - > > Key: NUTCH-1480 > URL: https://issues.apache.org/jira/browse/NUTCH-1480 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.7 > > Attachments: NUTCH-1480-1.6.1.patch > > > SolrUtils should return an array of SolrServers and read the SolrUrl as a > comma delimited list of URL's using Configuration.getString(). SolrWriter > should be able to handle this list of SolrServers. > This is useful if you want to send documents to multiple servers if no > replication is available or if you want to send documents to multiple NOCs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.
[ https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13493163#comment-13493163 ] Markus Jelsma commented on NUTCH-1480: -- I'm fine with that too, i don't really care as long it's obvious enough for the user. Any more thoughts to share? Opinions? > SolrIndexer to write to multiple servers. > - > > Key: NUTCH-1480 > URL: https://issues.apache.org/jira/browse/NUTCH-1480 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.6 > > Attachments: NUTCH-1480-1.6.1.patch > > > SolrUtils should return an array of SolrServers and read the SolrUrl as a > comma delimited list of URL's using Configuration.getString(). SolrWriter > should be able to handle this list of SolrServers. > This is useful if you want to send documents to multiple servers if no > replication is available or if you want to send documents to multiple NOCs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.
[ https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13490316#comment-13490316 ] Lewis John McGibbney commented on NUTCH-1480: - A suggestion fore the dedup job is to read in the comma separated solr cores/servers before dropping everything after the first server within the list. Some simple logging could then indicate that dedup was only attempted and subsequently executed on the first server... > SolrIndexer to write to multiple servers. > - > > Key: NUTCH-1480 > URL: https://issues.apache.org/jira/browse/NUTCH-1480 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.6 > > Attachments: NUTCH-1480-1.6.1.patch > > > SolrUtils should return an array of SolrServers and read the SolrUrl as a > comma delimited list of URL's using Configuration.getString(). SolrWriter > should be able to handle this list of SolrServers. > This is useful if you want to send documents to multiple servers if no > replication is available or if you want to send documents to multiple NOCs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.
[ https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488856#comment-13488856 ] Markus Jelsma commented on NUTCH-1480: -- Jul, yes, doing it via SolrJ and Zookeeper is much cheaper. We don't need to obtain hash ranges ourselves. Lewis, didn't think about that. That's a problem indeed. Although it is technically possible we must not dedup with two indices at the same time, it's going to be trouble. Even with this patch it will still work until someone decides to use the crawl command with two Solrservers comma-separated. Although i don't believe users of the crawl command will write to multiple locations we must prevent them from doing so. Perhaps a simple check for a comma in the solr URL and an exception? Something else? Drop deduplication altogether and rely on Solr's internal deduplication (which doesn't work for clouds though)? Other clever ideas? > SolrIndexer to write to multiple servers. > - > > Key: NUTCH-1480 > URL: https://issues.apache.org/jira/browse/NUTCH-1480 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.6 > > Attachments: NUTCH-1480-1.6.1.patch > > > SolrUtils should return an array of SolrServers and read the SolrUrl as a > comma delimited list of URL's using Configuration.getString(). SolrWriter > should be able to handle this list of SolrServers. > This is useful if you want to send documents to multiple servers if no > replication is available or if you want to send documents to multiple NOCs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.
[ https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488852#comment-13488852 ] Lewis John McGibbney commented on NUTCH-1480: - Mmm. Additionally there is a problem with this when you run a continuous or large crawl with the crawl script. By default it attempts to dedup and the job fails. Funatemtally though, the patch works with trunk (and Solr 4.0) as you describe... after I've applied the patch in NUTCH-1486 > SolrIndexer to write to multiple servers. > - > > Key: NUTCH-1480 > URL: https://issues.apache.org/jira/browse/NUTCH-1480 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.6 > > Attachments: NUTCH-1480-1.6.1.patch > > > SolrUtils should return an array of SolrServers and read the SolrUrl as a > comma delimited list of URL's using Configuration.getString(). SolrWriter > should be able to handle this list of SolrServers. > This is useful if you want to send documents to multiple servers if no > replication is available or if you want to send documents to multiple NOCs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.
[ https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488786#comment-13488786 ] Julien Nioche commented on NUTCH-1480: -- nope. I meant implementing the distribution to the shards on the Nutch side without relying on the CloudSolrServer. Having said that we want to move to SOLR4 and if we get that from SOLR for cheap then that's even better > SolrIndexer to write to multiple servers. > - > > Key: NUTCH-1480 > URL: https://issues.apache.org/jira/browse/NUTCH-1480 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.6 > > Attachments: NUTCH-1480-1.6.1.patch > > > SolrUtils should return an array of SolrServers and read the SolrUrl as a > comma delimited list of URL's using Configuration.getString(). SolrWriter > should be able to handle this list of SolrServers. > This is useful if you want to send documents to multiple servers if no > replication is available or if you want to send documents to multiple NOCs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.
[ https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488744#comment-13488744 ] Markus Jelsma commented on NUTCH-1480: -- I think you mean the justed linked issue NUTCH-1377? I just happen to work on that issue. Using the CloudSolrServer will send the docs to the correct shard already. The way it works now is sending millions of records to a single node which then distributes it again, a waste of IO. That issue will work with this issue so i may want to push them in together. It's just a matter of returning the correct SolrServer instance and working around the HTTPCLient issues. agreed on deduplication. It would also be hard for this issue to work with dedup because not all indices may be identical. > SolrIndexer to write to multiple servers. > - > > Key: NUTCH-1480 > URL: https://issues.apache.org/jira/browse/NUTCH-1480 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.6 > > Attachments: NUTCH-1480-1.6.1.patch > > > SolrUtils should return an array of SolrServers and read the SolrUrl as a > comma delimited list of URL's using Configuration.getString(). SolrWriter > should be able to handle this list of SolrServers. > This is useful if you want to send documents to multiple servers if no > replication is available or if you want to send documents to multiple NOCs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.
[ https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488738#comment-13488738 ] Julien Nioche commented on NUTCH-1480: -- OK thanks. What about having a mechanism for specifying a way of distributing the docs with the replicate-to-all being one of the options? Could do consistent hashing maybe? I expect that most people would want to shard. off topic re-deduplication : I think we've hit the limits of the current mechanism which I assume was based on the one we had when Nutch was managing its own Lucene indices. It's not reasonable to pump ALL the docs from SOLR into Hadoop to dedup and I'd rather have map reduce jobs to find the duplicates based on the crawldb and send the deletion commands to SOLR. And this would work for ElasticSearch as well. Am pretty sure there is a JIRA for this somewhere > SolrIndexer to write to multiple servers. > - > > Key: NUTCH-1480 > URL: https://issues.apache.org/jira/browse/NUTCH-1480 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.6 > > Attachments: NUTCH-1480-1.6.1.patch > > > SolrUtils should return an array of SolrServers and read the SolrUrl as a > comma delimited list of URL's using Configuration.getString(). SolrWriter > should be able to handle this list of SolrServers. > This is useful if you want to send documents to multiple servers if no > replication is available or if you want to send documents to multiple NOCs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.
[ https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488732#comment-13488732 ] Markus Jelsma commented on NUTCH-1480: -- Julien, yes. The Nutch tools are modified to work with a List now. One or more SolrServers are returned depending on how many you specify comma-separated. Communication to Solr is simply done by iterating over the List and repeating the commands such as add or delete or commit. > SolrIndexer to write to multiple servers. > - > > Key: NUTCH-1480 > URL: https://issues.apache.org/jira/browse/NUTCH-1480 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.6 > > Attachments: NUTCH-1480-1.6.1.patch > > > SolrUtils should return an array of SolrServers and read the SolrUrl as a > comma delimited list of URL's using Configuration.getString(). SolrWriter > should be able to handle this list of SolrServers. > This is useful if you want to send documents to multiple servers if no > replication is available or if you want to send documents to multiple NOCs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.
[ https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488730#comment-13488730 ] Markus Jelsma commented on NUTCH-1480: -- Hi Lewis, Solr does not have a notion of pseudo distributed mode. You can simply run one server with two cores, two servers on different ports or different machines or have a SolrCloud cluster in multiple NOCs, all have a unique URL. A SolrCloud cluster is roughly the same as multiple stand-alone servers working together. A core is similar to a table in a SQL server. See http://wiki.apache.org/solr/CoreAdmin#Example on how to quickly set up two cores in a single Solr server. Copy Nutch' schema and you have two Nutch indices you can write to in a single go. > SolrIndexer to write to multiple servers. > - > > Key: NUTCH-1480 > URL: https://issues.apache.org/jira/browse/NUTCH-1480 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.6 > > Attachments: NUTCH-1480-1.6.1.patch > > > SolrUtils should return an array of SolrServers and read the SolrUrl as a > comma delimited list of URL's using Configuration.getString(). SolrWriter > should be able to handle this list of SolrServers. > This is useful if you want to send documents to multiple servers if no > replication is available or if you want to send documents to multiple NOCs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.
[ https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488728#comment-13488728 ] Julien Nioche commented on NUTCH-1480: -- Hi Lewis bq. Can I run multiple Solr servers in psudo distributed mode? SOLR is completely separated from Hadoop and has nothing to do with local vs distrib. You can run serveral instances of SOLR on the same machine if that is your question. Just invoke a different port when starting it from the command line with a separate SOLR home. Markus, Just to make sure I understand - this sends ALL the documents to ALL the SOLR instances specified, right? > SolrIndexer to write to multiple servers. > - > > Key: NUTCH-1480 > URL: https://issues.apache.org/jira/browse/NUTCH-1480 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.6 > > Attachments: NUTCH-1480-1.6.1.patch > > > SolrUtils should return an array of SolrServers and read the SolrUrl as a > comma delimited list of URL's using Configuration.getString(). SolrWriter > should be able to handle this list of SolrServers. > This is useful if you want to send documents to multiple servers if no > replication is available or if you want to send documents to multiple NOCs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.
[ https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488721#comment-13488721 ] Lewis John McGibbney commented on NUTCH-1480: - Hi Markus. Can I run multiple Solr servers in psudo distributed mode? If so can you provide a link the a (solr) wiki entry and I will try this out. Also can you clarify between core and servers? I don't see the server terminology listed [0] on the Solr wiki. Thanks [0] http://wiki.apache.org/solr/SolrTerminology > SolrIndexer to write to multiple servers. > - > > Key: NUTCH-1480 > URL: https://issues.apache.org/jira/browse/NUTCH-1480 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.6 > > Attachments: NUTCH-1480-1.6.1.patch > > > SolrUtils should return an array of SolrServers and read the SolrUrl as a > comma delimited list of URL's using Configuration.getString(). SolrWriter > should be able to handle this list of SolrServers. > This is useful if you want to send documents to multiple servers if no > replication is available or if you want to send documents to multiple NOCs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira