Andrzej Bialecki wrote: > > On 2010-06-02 12:42, Grant Ingersoll wrote: >> >> On Jun 1, 2010, at 9:54 PM, Blargy wrote: >> >>> >>> We have around 5 million items in our index and each item has a >>> description >>> located on a separate physical database. These item descriptions vary in >>> size and for the most part are quite large. Currently we are only >>> indexing >>> items and not their corresponding description and a full import takes >>> around >>> 4 hours. Ideally we want to index both our items and their descriptions >>> but >>> after some quick profiling I determined that a full import would take in >>> excess of 24 hours. >>> >>> - How would I profile the indexing process to determine if the >>> bottleneck is >>> Solr or our Database. >> >> As a data point, I routinely see clients index 5M items on normal >> hardware in approx. 1 hour (give or take 30 minutes). >> >> When you say "quite large", what do you mean? Are we talking books here >> or maybe a couple pages of text or just a couple KB of data? >> >> How long does it take you to get that data out (and, from the sounds of >> it, merge it with your item) w/o going to Solr? >> >>> - In either case, how would one speed up this process? Is there a way to >>> run >>> parallel import processes and then merge them together at the end? >>> Possibly >>> use some sort of distributed computing? >> >> DataImportHandler now supports multiple threads. The absolute fastest >> way that I know of to index is via multiple threads sending batches of >> documents at a time (at least 100). Often, from DBs one can split up the >> table via SQL statements that can then be fetched separately. You may >> want to write your own multithreaded client to index. > > SOLR-1301 is also an option if you are familiar with Hadoop ... > > > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > > >
I haven't worked with Hadoop before but I'm willing to try anything to cut down this full import time. I see this currently uses the embedded solr server for indexing... would I have to scrap my DIH importing then? -- View this message in context: http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p865103.html Sent from the Solr - User mailing list archive at Nabble.com.