When I run the same sql on DB it takes only 1 sec. And 6-7 documents are getting indexed per second.
As I've 4 node solrCloud setup, can I run 4 import handler to index the same data? Will it not over write? 10-20k is very high in numbers, where can I get the actual size of document. Rgds AJ > On 22-Mar-2016, at 05:32, Shawn Heisey <apa...@elyograg.org> wrote: > >> On 3/20/2016 6:11 PM, Amit Jha wrote: >> In my case I am using DIH to index the data and Query is having 2 join >> statements. To index 70K documents it is taking 3-4Hours. Document size >> would be around 10-20KB. DB is MSSQL and using solr4.2.10 in cloud mode. > > My source data is in a MySQL database. I use DIH for full rebuilds and > SolrJ for maintenance. > > My index is sharded, but I'm not running SolrCloud. When using DIH, all > of my shards build at once, and each one achieves about 750 docs per > second. With six large shards, rebuilding a 146 million document index > takes 9-10 hours. It produces a total index size in the ballpark of 170GB. > > DIH has a performance limitation -- it's single-threaded. I obtain the > speeds that I do because all of my shards import at the same time -- six > dataimport instances running at the same time, each one with a single > thread, importing a little more than 24 million documents. I have > discovered that Solr is the bottleneck on my setup. The data retrieval > from MySQL can proceed much faster than Solr can handle with a single > indexing thread. My situation is a little bit unusual -- as Erick > mentioned, usually the bottleneck is data retrieval, not Solr. > > At this point, if I want to make bulk indexing go faster, I need to > build a SolrJ application that can index with multiple threads to each > Solr core at the same time. This is on my roadmap, but it's not going > to be a trivial project. > > At 10-20K, your documents are large, but not excessively so. If 70000 > documents takes 3-4 hours, then there's one of a few problems happening. > > 1) your database is VERY slow. > 2) your analysis chain in schema.xml is running SUPER slow analysis > components. > 3) Your server or its configuration is not providing enough resources > (CPU/RAM/IO) so Solr can run efficiently. > > #2 seems rather unlikely, so I would suspect one of the other two. > > ---- > > I have seen one situation related to the Microsoft side of your setup > that might cause a problem like this. If any of your machines are > running on Windows Server 2012 and you have bridged NICs (usually for > failover in the event of a switch failure), then you will need to break > the bridge and just run one NIC. > > The performance improvement on the network when a bridged NIC is removed > from Server 2012 is enough to blow your mind, especially if the access > is over a high-latency network link, like a VPN or WAN connection. The > same setup on Server 2003 or Server 2008 has very good performance. > Microsoft seems to have a bug with bridged NICs in Server 2012. Last > time I tried to figure out whether it could be fixed, I ran into this > problem: > > https://xkcd.com/979/ > > Thanks, > Shawn >