Re: Large RDBMS dataset

Erick Erickson Wed, 14 Dec 2011 10:48:41 -0800

You can also consider using SolrJ to do this. I posted a small example a couple
of days ago.


Best
Erick

On Wed, Dec 14, 2011 at 10:39 AM, Gora Mohanty <g...@mimirtech.com> wrote:
> On Wed, Dec 14, 2011 at 3:48 PM, Finotti Simone <tech...@yoox.com> wrote:
>> Hello,
>> I have a very large dataset (> 1 Mrecords) on the RDBMS which I want my Solr 
>> application to pull data from.
> [...]
>
>> It works, but it takes 1'38" to parse 100 records: it means 1 rec/s! That 
>> means that digesting the whole dataset would take 1 Ms (=> 12 days).
>
> Depending on the size of the data that you are pulling from
> the database, 1M records is not really that large a number.
> We were doing ~75GB of stored data from ~7million records
> in about 9h, including quite complicated transfomers. I would
> imagine that there is much room for improvement in your case
> also. Some notes on this:
> * If you have servers to throw at the problem, and a sensible
>  way to shard your RDBMS data, use parallel indexing to
>  multiple Solr cores, maybe on multiple servers, followed by
>  a merge. In our experience, given enough RAM and adequate
>  provisioning of database servers, indexing speed scales linearly
>  with the total no. of cores.
> * Replicate your database, manually if needed. Look at the load
>  on a database server during the indexing process, and provision
>  enough database servers to match the no. of Solr indexing servers.
> * This point is leading into flamewar territory, but consider switching
>   databases. From our (admittedly non-rigorous measurements),
>   mysql was at least a factor of 2-3 faster than MS-SQL, with the
>   same dataset.
> * Look at cloud-computing. If finances permit, one should be able
>  to shrink indexing times to almost any desired level. E.g., for the
>  dataset that we used, I have little doubt that we could have shrunk
>  the time down to less than 1h, at an affordable cost on Amazon EC2.
>  Unfortunately, we have not yet had the opportunity to try this.
>
>> The problem is that for each record in "fd", Solr makes three distinct 
>> SELECT on the other three tables. Of course, this is absolutely inefficient.
>>
>> Is there a way to have Solr loading every record in the four tables and join 
>> them when they are already loaded in memory?
>
> For various reasons, we did not investigate this in depth,
> but you could also look at Solr's CachedSqlEntityProcessor.
>
> Regards,
> Gora

Re: Large RDBMS dataset

Reply via email to