[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.

ASF GitHub Bot (JIRA) Fri, 17 Nov 2017 06:17:21 -0800

    [ 
https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16257012#comment-16257012
 ]


ASF GitHub Bot commented on NUTCH-1480:
---------------------------------------

sebastian-nagel commented on issue #218: fix for NUTCH-1480 contributed by 
r0ann3l
URL: https://github.com/apache/nutch/pull/218#issuecomment-345254791
 
 
   Hi @r0ann3l, thanks! I've continued testing, and was able to feed two Solr 
indexes in parallel. Great! Afaics, all requested changes have been made (also 
that of @lewismc).
   
   To make the configuration work out of the box, I would suggest 3 changes:
   - use only field names defined in the default schema.xml
   `ERROR: [doc=http://nutch.apache.org/] unknown field 'search'
   - default Solr core name should be "nutch" as described in the 
[tutorial](https://wiki.apache.org/nutch/NutchTutorial)
   
   I've tried to fix these issues in "[a fork of 
NUTCH-1480](https://github.com/sebastian-nagel/nutch/commits/NUTCH-1480)". Feel 
free to cherry pick it from there.
   
   I've also tried to make indexer-dummy work. Without success, the file is 
created but then overwritten:
   
   - there are two instances of `IndexWriters` active, each having a separate 
instance of DummyIndexWriter.
      - the instance created from 
`IndexerOutputFormat.getRecordWriter(IndexerOutputFormat.java:39)` writes into 
the file
      - but later on the instance created from 
`IndexWriters.open(IndexWriters.java:187)` opens the file anew, at the end 
there is an empty file. Because it's two instances there is no possibility to 
check whether the file writer is already instantiated.
   
   I see two potential solutions:
   1. the IndexWriter interface method `open(job, name)` was defined with file 
indexers in mind (cf. 
NUTCH-1541/[CSVIndexWriter](https://github.com/sebastian-nagel/nutch/blob/NUTCH-1541/src/plugin/indexer-csv/src/java/org/apache/nutch/indexwriter/csv/CSVIndexWriter.java#L233)),
 an index writer can then decide to do nothing when called with name "commit".
   2. do not call the `commit()` method explicitly (ev. also remove it from the 
interface: it does not safely work in distributed mode because it's not run in 
the reducers (see the comment in RabbitIndexWriter).
   
   I tend to the second solution. It would also solve the problem of having two 
IndexWriters instances active. What do you think?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> SolrIndexer to write to multiple servers.
> -----------------------------------------
>
>                 Key: NUTCH-1480
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1480
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>         Attachments: NUTCH-1480-1.6.1.patch, 
> adding-support-for-sharding-indexer-for-solr.patch
>
>
> SolrUtils should return an array of SolrServers and read the SolrUrl as a 
> comma delimited list of URL's using Configuration.getString(). SolrWriter 
> should be able to handle this list of SolrServers.
> This is useful if you want to send documents to multiple servers if no 
> replication is available or if you want to send documents to multiple NOCs.
> edit:
> This does not replace NUTCH-1377 but complements it. With NUTCH-1377 this 
> issue allows you to index to multiple SolrCloud clusters at the same time.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.

Reply via email to