[ 
https://issues.apache.org/jira/browse/SOLR-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13642626#comment-13642626
 ] 

Shawn Heisey commented on SOLR-2356:
------------------------------------

bq. Actually, DIH started as a standalone webapp inside AOL. We changed it 
because we didn't want to duplicate the schema in two places and also because 
we wanted to have it available by default in Solr installations.  Another web 
app means you need to procure hardware, plan capacity/failover, create firewall 
holes etc

Even if it were a standalone app, I would still run it on the same hardware 
that runs Solr, though I might run it on the secondary servers that aren't 
normally seeing query load.

Why would you need to have the schema in two places?  My DIH config doesn't 
mention any field names, because all Solr field names match the MySQL field 
names.  Even if they didn't, I'm not sure why it would be any different than 
any other SolrJ client, which doesn't need to have a copy of the schema.

bq. Talking to multiple collections was never a goal for DIH – I'm not sure 
what value it will bring. 

I've got a sharded index; currently production is on version 3.5.0.  Full 
rebuilds are done with DIH because my SolrJ application can't touch it for 
indexing speed.  It does fine for keeping the index up to date with 
once-a-minute updates, but for indexing millions of documents per shard, DIH 
beats it handily even though it's only single-threaded.

If I could put all the DIH stuff into one central place, it might make it 
easier to manage.  I do intend to look at the DIH code to see how it manages to 
run so fast, so I can hopefully improve my own code, but I can never find the 
time.

                
> indexing using DataImportHandler does not use entire CPU capacities
> -------------------------------------------------------------------
>
>                 Key: SOLR-2356
>                 URL: https://issues.apache.org/jira/browse/SOLR-2356
>             Project: Solr
>          Issue Type: Improvement
>          Components: update
>    Affects Versions: 4.0-ALPHA
>         Environment: intel xeon processor (4 cores), Debian Linux Lenny, 
> OpenJDK 64bits server v1.6.0
>            Reporter: colby
>            Priority: Minor
>              Labels: test
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> When I use a DataImportHandler to index a large number of documents (~35M), 
> cpu usage doesn't go over than 100% cpu (i.e. just one core).
> When I configure 4 threads for the <entity> tag, the cpu usage is splitted to 
> 25% per core but never use 400% of cpu (i.e 100% of the 4 cores)
> I use solr embedded with jetty server.
> Is there a way to tune this feature in order to use all cores and improve 
> indexing performances ?
> Because for the moment, an extra script (PHP) gives better indexing 
> performances than DIH.
> thanks

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to