[
https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16257012#comment-16257012
]
ASF GitHub Bot commented on NUTCH-1480:
---------------------------------------
sebastian-nagel commented on issue #218: fix for NUTCH-1480 contributed by
r0ann3l
URL: https://github.com/apache/nutch/pull/218#issuecomment-345254791
Hi @r0ann3l, thanks! I've continued testing, and was able to feed two Solr
indexes in parallel. Great! Afaics, all requested changes have been made (also
that of @lewismc).
To make the configuration work out of the box, I would suggest 3 changes:
- use only field names defined in the default schema.xml
`ERROR: [doc=http://nutch.apache.org/] unknown field 'search'
- default Solr core name should be "nutch" as described in the
[tutorial](https://wiki.apache.org/nutch/NutchTutorial)
I've tried to fix these issues in "[a fork of
NUTCH-1480](https://github.com/sebastian-nagel/nutch/commits/NUTCH-1480)". Feel
free to cherry pick it from there.
I've also tried to make indexer-dummy work. Without success, the file is
created but then overwritten:
- there are two instances of `IndexWriters` active, each having a separate
instance of DummyIndexWriter.
- the instance created from
`IndexerOutputFormat.getRecordWriter(IndexerOutputFormat.java:39)` writes into
the file
- but later on the instance created from
`IndexWriters.open(IndexWriters.java:187)` opens the file anew, at the end
there is an empty file. Because it's two instances there is no possibility to
check whether the file writer is already instantiated.
I see two potential solutions:
1. the IndexWriter interface method `open(job, name)` was defined with file
indexers in mind (cf.
NUTCH-1541/[CSVIndexWriter](https://github.com/sebastian-nagel/nutch/blob/NUTCH-1541/src/plugin/indexer-csv/src/java/org/apache/nutch/indexwriter/csv/CSVIndexWriter.java#L233)),
an index writer can then decide to do nothing when called with name "commit".
2. do not call the `commit()` method explicitly (ev. also remove it from the
interface: it does not safely work in distributed mode because it's not run in
the reducers (see the comment in RabbitIndexWriter).
I tend to the second solution. It would also solve the problem of having two
IndexWriters instances active. What do you think?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> SolrIndexer to write to multiple servers.
> -----------------------------------------
>
> Key: NUTCH-1480
> URL: https://issues.apache.org/jira/browse/NUTCH-1480
> Project: Nutch
> Issue Type: Improvement
> Components: indexer
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Priority: Minor
> Attachments: NUTCH-1480-1.6.1.patch,
> adding-support-for-sharding-indexer-for-solr.patch
>
>
> SolrUtils should return an array of SolrServers and read the SolrUrl as a
> comma delimited list of URL's using Configuration.getString(). SolrWriter
> should be able to handle this list of SolrServers.
> This is useful if you want to send documents to multiple servers if no
> replication is available or if you want to send documents to multiple NOCs.
> edit:
> This does not replace NUTCH-1377 but complements it. With NUTCH-1377 this
> issue allows you to index to multiple SolrCloud clusters at the same time.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)