Re: DataImportHandler Questions-Load data in parallel and temp tables

Shalin Shekhar Mangar Mon, 27 Apr 2009 15:46:09 -0700

On Tue, Apr 28, 2009 at 3:43 AM, Amit Nithian <anith...@gmail.com> wrote:


> All,
> I have a few questions regarding the data import handler. We have some
> pretty gnarly SQL queries to load our indices and our current loader
> implementation is extremely fragile. I am looking to migrate over to the
> DIH; however, I am looking to use SolrJ + EmbeddedSolr + some custom stuff
> to remotely load the indices so that my index loader and main search engine
> are separated.


Currently if you want to use DIH then the Solr master doubles up as the
index loader as well.


>
> Currently, unless I am missing something, the data gathering from the
> entity
> and the data processing (i.e. conversion to a Solr Document) is done
> sequentially and I was looking to make this execute in parallel so that I
> can have multiple threads processing different parts of the resultset and
> loading documents into Solr. Secondly, I need to create temporary tables to
> store results of a few queries and use them later for inner joins was
> wondering how to best go about this?
>
> I am thinking to add support in DIH for the following:
> 1) Temporary tables (maybe call it temporary entities)? --Specific only to
> SQL though unless it can be generalized to other sources.


Pretty specific to DBs. However, isn't this something that can be done in
your database with views?


>
> 2) Parallel support


Parallelizing import of root-entities might be the easiest to attempt.
There's also an issue open to write to Solr (tokenization/analysis) in a
separate thread. Look at https://issues.apache.org/jira/browse/SOLR-1089

We actually wrote a multi-threaded DIH during the initial iterations. But we
discarded it because we found that the bottleneck was usually the database
(too many queries) or Lucene indexing itself (analysis, tokenization) etc.
The improvement was ~10% but it made the code substantially more complex.

The only scenario in which it helped a lot was when importing from HTTP or a
remote database (slow networks). But if you think it can help in your
scenario, I'd say go for it.


>
>  - Including some mechanism to get the number of records (whether it be
> count or the MAX(custom_id)-MIN(custom_id))


Not sure what you mean here.


>
> 3) Support in DIH or Solr to post documents to a remote index (i.e. create
> a
> new UpdateHandler instead of DirectUpdateHandler2).
>

Solrj integration would be helpful to many I think. There's an issue open.
Look at https://issues.apache.org/jira/browse/SOLR-853

-- 
Regards,
Shalin Shekhar Mangar.

Re: DataImportHandler Questions-Load data in parallel and temp tables

Reply via email to