Re: Data Import handler and join select

2014-08-08 Thread Alejandro Marqués Rodríguez
First of all thank you very much for the answer, James. It is very complete
and it gives us several alternatives :)

I think we will try first the cache approach, as, after solving this
problem https://issues.apache.org/jira/browse/SOLR-5954 the performance has
been improved, so along with the cache solution we may achieve the expected
performance.

We've also tried modifying the transformers and we've got it working the
way we were looking for, though the solutions you propose seem to be much
cleaner.

Regarding indexing through solrj it was our first idea, the problem is when
we started the project, the DIH seemed to fit our needs perfectly, until we
tried with real data and realized about the performance issues, so, now
maybe it's a bit late for us trying to change everything :( If we have no
other option we will go that way but we need to try less drastic solutions
first.

Thanks!


2014-08-07 18:11 GMT+02:00 Dyer, James james.d...@ingramcontent.com:

 Alejandro,

 You can use a sub-entity with a cache using DIH.  This will solve the
 n+1-select problem and make it run quickly.  Unfortunately, the only
 built-in cache implementation is in-memory so it doesn't scale.  There is a
 fast, disk-backed cache using bdb-je, which I use in production.  See
 https://issues.apache.org/jira/browse/SOLR-2613 .  You will need to build
 this youself and include it on the classpath, and obtain a copy of bdb-je
 from Oracle.  While bdb-je is open source, its license is incompatible with
 ASL so this will never officially be part of Solr.

 Once you have a disk-backed cache, you can specify it on the child entity
 like this:
 entity name=parent query=select id, ... from parent table
 entity
 name=child
 query=select foreignKey, ... from child_table
 cacheKey=foreignKey
 cacheLookup=parent.id
 processor=SqlEntityProcessor
 transformer=...
 cacheImpl=BerkleyBackedCache
 /
 /entity

 If you don't want to go down this path, you can achieve this all with one
 query, if you include and ORDER BY to sort by whatever field is used as
 Solr's uniqueKey, and add a dummy row at the end with a UNION:

 SELECT p.uniqueKey, ..., 'A' as lastInd from PRODUCTS p
 INNER JOIN DESCRIPTIONS d ON p.uniqueKey = d.productKey
 UNION SELECT 0 as uniqueKey, ... , 'B' as lastInd from dual
 ORDER BY uniqueKey, lastInd

 Then your transformer would need to keep the lastUniqueKey in an
 instance variable and keep a running map of everything its seen for that
 key.  When the key changes, or if on the last row, send that map as the
 document.  Otherwise, the transformer returns null.  This will collect data
 from each row seen onto one document.

 Keep in mind also, that in a lot of cases like this, it might just be
 easiest to write a program that uses solrj to send your documents rather
 than trying to make DIH's features fit your use-case.

 James Dyer
 Ingram Content Group
 (615) 213-4311


 -Original Message-
 From: Alejandro Marqués Rodríguez [mailto:
 amarq...@paradigmatecnologico.com]
 Sent: Thursday, August 07, 2014 1:43 AM
 To: solr-user@lucene.apache.org
 Subject: Data Import handler and join select

 Hi,

 I have one problem while indexing with data import hadler while doing a
 join select. I have two tables, one with products and another one with
 descriptions for each product in several languages.

 So it would be:

 Products: ID, NAME, BRAND, PRICE, ...
 Descriptions: ID, LANGUAGE, DESCRIPTION

 I would like to have every product indexed as a document with a multivalued
 field language which contains every language that has an associated
 description and several dinamic fields description_ one for each
 language.

 So it would be for example:

 Id: 1
 Name: Product
 Brand: Brand
 Price: 10
 Languages: [es,en]
 Description_es: Descripción en español
 Description_en: English description

 Our first approach was using sub-entities for the data import handler and
 after implementing some transformers we had everything indexed as we
 wanted. The sub-entity process added the descriptions for each language to
 the solr document and then indexed them.

 The problem was performance. I've read that using sub-entities affected
 performance greatly, so we changed our process in order to use a join
 instead.

 Performance was greatly improved this way but now we have a problem. Each
 time a row is processed a solr document is generated and indexed into solr,
 but the data is not added to any previous data, but it replaces it.

 If we had the previous example the query resulting from the join would be:

 Id - Name - Brand - Price - Language - Description
 1 - Product - Brand - 10 - es - Descripción en español
 1 - Product - Brand - 10 - en - English description

 So when indexing as both have the same id the only information I get is the
 second row.

 Is there any way for data import handler to manage this and allow 

RE: Data Import handler and join select

2014-08-07 Thread Dyer, James
Alejandro,

You can use a sub-entity with a cache using DIH.  This will solve the 
n+1-select problem and make it run quickly.  Unfortunately, the only built-in 
cache implementation is in-memory so it doesn't scale.  There is a fast, 
disk-backed cache using bdb-je, which I use in production.  See 
https://issues.apache.org/jira/browse/SOLR-2613 .  You will need to build this 
youself and include it on the classpath, and obtain a copy of bdb-je from 
Oracle.  While bdb-je is open source, its license is incompatible with ASL so 
this will never officially be part of Solr.

Once you have a disk-backed cache, you can specify it on the child entity like 
this:
entity name=parent query=select id, ... from parent table
entity 
name=child 
query=select foreignKey, ... from child_table
cacheKey=foreignKey 
cacheLookup=parent.id
processor=SqlEntityProcessor 
transformer=...
cacheImpl=BerkleyBackedCache
/
/entity

If you don't want to go down this path, you can achieve this all with one 
query, if you include and ORDER BY to sort by whatever field is used as Solr's 
uniqueKey, and add a dummy row at the end with a UNION:

SELECT p.uniqueKey, ..., 'A' as lastInd from PRODUCTS p 
INNER JOIN DESCRIPTIONS d ON p.uniqueKey = d.productKey
UNION SELECT 0 as uniqueKey, ... , 'B' as lastInd from dual 
ORDER BY uniqueKey, lastInd

Then your transformer would need to keep the lastUniqueKey in an instance 
variable and keep a running map of everything its seen for that key.  When the 
key changes, or if on the last row, send that map as the document.  Otherwise, 
the transformer returns null.  This will collect data from each row seen onto 
one document.

Keep in mind also, that in a lot of cases like this, it might just be easiest 
to write a program that uses solrj to send your documents rather than trying to 
make DIH's features fit your use-case.  

James Dyer
Ingram Content Group
(615) 213-4311


-Original Message-
From: Alejandro Marqués Rodríguez [mailto:amarq...@paradigmatecnologico.com] 
Sent: Thursday, August 07, 2014 1:43 AM
To: solr-user@lucene.apache.org
Subject: Data Import handler and join select

Hi,

I have one problem while indexing with data import hadler while doing a
join select. I have two tables, one with products and another one with
descriptions for each product in several languages.

So it would be:

Products: ID, NAME, BRAND, PRICE, ...
Descriptions: ID, LANGUAGE, DESCRIPTION

I would like to have every product indexed as a document with a multivalued
field language which contains every language that has an associated
description and several dinamic fields description_ one for each language.

So it would be for example:

Id: 1
Name: Product
Brand: Brand
Price: 10
Languages: [es,en]
Description_es: Descripción en español
Description_en: English description

Our first approach was using sub-entities for the data import handler and
after implementing some transformers we had everything indexed as we
wanted. The sub-entity process added the descriptions for each language to
the solr document and then indexed them.

The problem was performance. I've read that using sub-entities affected
performance greatly, so we changed our process in order to use a join
instead.

Performance was greatly improved this way but now we have a problem. Each
time a row is processed a solr document is generated and indexed into solr,
but the data is not added to any previous data, but it replaces it.

If we had the previous example the query resulting from the join would be:

Id - Name - Brand - Price - Language - Description
1 - Product - Brand - 10 - es - Descripción en español
1 - Product - Brand - 10 - en - English description

So when indexing as both have the same id the only information I get is the
second row.

Is there any way for data import handler to manage this and allow the
documents to be indexed updating any previous data?

Thanks in advance



-- 
Alejandro Marqués Rodríguez

Paradigma Tecnológico
http://www.paradigmatecnologico.com
Avenida de Europa, 26. Ática 5. 3ª Planta
28224 Pozuelo de Alarcón
Tel.: 91 352 59 42