Re: mergeFactor / indexing speed

Avlesh Singh Thu, 06 Aug 2009 10:36:02 -0700

>
> Do you think it's possible to return (in the nested entity) rows
> independent of the unique id, and let the processor decide when a document
> is complete?
>
I don't think so.


In my case, I had 9 (JDBC) entities for each document. Most of these
entities returned a single column and limited number rows for each document.
I observed a significant improvement in performance by using an aggregation
query in my parent query. e.g. in MySQL, I used group_concat() function to
aggregate all the values (separated using some delimiter) into a single
column of the parent query's resultset. I would then use a RegexTransformer
to split this data on the previously used delimiter to populate in a
multi-valued field.
I actually got rid of 5 entities out of 9 in my data-config. It reduced the
import time significantly too.

Cheers
Avlesh

On Thu, Aug 6, 2009 at 10:15 PM, Chantal Ackermann <
chantal.ackerm...@btelligent.de> wrote:

> Hi all,
>
> to keep this thread up to date... ;-)
>
>
> d) jdbc batch size
> changed to 10. (Was default: 500, then 1000)
>
> The problem with my dih setup is that the root entity query returns a huge
> set (all ids that shall be indexed). A larger fetchsize would be good for
> that query.
> The nested entity, however, returns only up 9 rows, ever. The constraints
> are so strict (by id) that there is no way that any additional data could be
> pre-fetched.
> (Actually, anynone using DIH with nested entities should run into that
> problem?)
>
> After changing to 10, I cannot see that this low batch size slowed the
> indexer down (significantly).
>
> As I would like to stick with DIH (instead of dumping the data into CSV and
> import it then) here is my question:
>
> Do you think it's possible to return (in the nested entity) rows
> independent of the unique id, and let the processor decide when a document
> is complete?
> The examples in the wiki always use an ID to get the data for the nested
> entity, so I'm not sure it was planned with that in mind. But as I'm already
> handling multiple db rows for one document, it might not be too difficult to
> change to handling the unique id correctly, as well?
> Of course, I would need something like a look ahead to know whether the
> next row is already part of the next document.
>
>
> Cheers,
> Chantal
>
>
>
> Concerning the other settings (just fyi):
>
> a) mergeFactor 10 (and also tried 100)
> I don't think that changed anything to the worse, rather to the better. So,
> I'll stick with 10 from now on.
>
> b) ramBufferSizeMB
> tried 512, 1024. RAM usage went up when I increased from 256 to 512. Not
> sure about 1024. I'll stick to 512.
>
>
>

Re: mergeFactor / indexing speed

Reply via email to