[
https://issues.apache.org/jira/browse/SOLR-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13642626#comment-13642626
]
Shawn Heisey commented on SOLR-2356:
------------------------------------
bq. Actually, DIH started as a standalone webapp inside AOL. We changed it
because we didn't want to duplicate the schema in two places and also because
we wanted to have it available by default in Solr installations. Another web
app means you need to procure hardware, plan capacity/failover, create firewall
holes etc
Even if it were a standalone app, I would still run it on the same hardware
that runs Solr, though I might run it on the secondary servers that aren't
normally seeing query load.
Why would you need to have the schema in two places? My DIH config doesn't
mention any field names, because all Solr field names match the MySQL field
names. Even if they didn't, I'm not sure why it would be any different than
any other SolrJ client, which doesn't need to have a copy of the schema.
bq. Talking to multiple collections was never a goal for DIH – I'm not sure
what value it will bring.
I've got a sharded index; currently production is on version 3.5.0. Full
rebuilds are done with DIH because my SolrJ application can't touch it for
indexing speed. It does fine for keeping the index up to date with
once-a-minute updates, but for indexing millions of documents per shard, DIH
beats it handily even though it's only single-threaded.
If I could put all the DIH stuff into one central place, it might make it
easier to manage. I do intend to look at the DIH code to see how it manages to
run so fast, so I can hopefully improve my own code, but I can never find the
time.
> indexing using DataImportHandler does not use entire CPU capacities
> -------------------------------------------------------------------
>
> Key: SOLR-2356
> URL: https://issues.apache.org/jira/browse/SOLR-2356
> Project: Solr
> Issue Type: Improvement
> Components: update
> Affects Versions: 4.0-ALPHA
> Environment: intel xeon processor (4 cores), Debian Linux Lenny,
> OpenJDK 64bits server v1.6.0
> Reporter: colby
> Priority: Minor
> Labels: test
> Original Estimate: 168h
> Remaining Estimate: 168h
>
> When I use a DataImportHandler to index a large number of documents (~35M),
> cpu usage doesn't go over than 100% cpu (i.e. just one core).
> When I configure 4 threads for the <entity> tag, the cpu usage is splitted to
> 25% per core but never use 400% of cpu (i.e 100% of the 4 cores)
> I use solr embedded with jetty server.
> Is there a way to tune this feature in order to use all cores and improve
> indexing performances ?
> Because for the moment, an extra script (PHP) gives better indexing
> performances than DIH.
> thanks
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]