Re: Data Import handler and join select
First of all thank you very much for the answer, James. It is very complete and it gives us several alternatives :) I think we will try first the cache approach, as, after solving this problem https://issues.apache.org/jira/browse/SOLR-5954 the performance has been improved, so along with the cache solution we may achieve the expected performance. We've also tried modifying the transformers and we've got it working the way we were looking for, though the solutions you propose seem to be much cleaner. Regarding indexing through solrj it was our first idea, the problem is when we started the project, the DIH seemed to fit our needs perfectly, until we tried with real data and realized about the performance issues, so, now maybe it's a bit late for us trying to change everything :( If we have no other option we will go that way but we need to try less drastic solutions first. Thanks! 2014-08-07 18:11 GMT+02:00 Dyer, James : > Alejandro, > > You can use a sub-entity with a cache using DIH. This will solve the > "n+1-select" problem and make it run quickly. Unfortunately, the only > built-in cache implementation is in-memory so it doesn't scale. There is a > fast, disk-backed cache using bdb-je, which I use in production. See > https://issues.apache.org/jira/browse/SOLR-2613 . You will need to build > this youself and include it on the classpath, and obtain a copy of bdb-je > from Oracle. While bdb-je is open source, its license is incompatible with > ASL so this will never officially be part of Solr. > > Once you have a disk-backed cache, you can specify it on the child entity > like this: > > name="child" > query="select foreignKey, ... from child_table" > cacheKey="foreignKey" > cacheLookup="parent.id" > processor="SqlEntityProcessor" > transformer="..." > cacheImpl="BerkleyBackedCache" > /> > > > If you don't want to go down this path, you can achieve this all with one > query, if you include and ORDER BY to sort by whatever field is used as > Solr's uniqueKey, and add a dummy row at the end with a UNION: > > SELECT p.uniqueKey, ..., 'A' as lastInd from PRODUCTS p > INNER JOIN DESCRIPTIONS d ON p.uniqueKey = d.productKey > UNION SELECT 0 as uniqueKey, ... , 'B' as lastInd from dual > ORDER BY uniqueKey, lastInd > > Then your transformer would need to keep the "lastUniqueKey" in an > instance variable and keep a running map of everything its seen for that > key. When the key changes, or if on the last row, send that map as the > document. Otherwise, the transformer returns null. This will collect data > from each row seen onto one document. > > Keep in mind also, that in a lot of cases like this, it might just be > easiest to write a program that uses solrj to send your documents rather > than trying to make DIH's features fit your use-case. > > James Dyer > Ingram Content Group > (615) 213-4311 > > > -Original Message- > From: Alejandro Marqués Rodríguez [mailto: > amarq...@paradigmatecnologico.com] > Sent: Thursday, August 07, 2014 1:43 AM > To: solr-user@lucene.apache.org > Subject: Data Import handler and join select > > Hi, > > I have one problem while indexing with data import hadler while doing a > join select. I have two tables, one with products and another one with > descriptions for each product in several languages. > > So it would be: > > Products: ID, NAME, BRAND, PRICE, ... > Descriptions: ID, LANGUAGE, DESCRIPTION > > I would like to have every product indexed as a document with a multivalued > field "language" which contains every language that has an associated > description and several dinamic fields "description_" one for each > language. > > So it would be for example: > > Id: 1 > Name: Product > Brand: Brand > Price: 10 > Languages: [es,en] > Description_es: Descripción en español > Description_en: English description > > Our first approach was using sub-entities for the data import handler and > after implementing some transformers we had everything indexed as we > wanted. The sub-entity process added the descriptions for each language to > the solr document and then indexed them. > > The problem was performance. I've read that using sub-entities affected > performance greatly, so we changed our process in order to use a join > instead. > > Performance was greatly improved this way but now we have a problem. Each > time a row is processed a solr document is generated and indexed into so
RE: Data Import handler and join select
Alejandro, You can use a sub-entity with a cache using DIH. This will solve the "n+1-select" problem and make it run quickly. Unfortunately, the only built-in cache implementation is in-memory so it doesn't scale. There is a fast, disk-backed cache using bdb-je, which I use in production. See https://issues.apache.org/jira/browse/SOLR-2613 . You will need to build this youself and include it on the classpath, and obtain a copy of bdb-je from Oracle. While bdb-je is open source, its license is incompatible with ASL so this will never officially be part of Solr. Once you have a disk-backed cache, you can specify it on the child entity like this: If you don't want to go down this path, you can achieve this all with one query, if you include and ORDER BY to sort by whatever field is used as Solr's uniqueKey, and add a dummy row at the end with a UNION: SELECT p.uniqueKey, ..., 'A' as lastInd from PRODUCTS p INNER JOIN DESCRIPTIONS d ON p.uniqueKey = d.productKey UNION SELECT 0 as uniqueKey, ... , 'B' as lastInd from dual ORDER BY uniqueKey, lastInd Then your transformer would need to keep the "lastUniqueKey" in an instance variable and keep a running map of everything its seen for that key. When the key changes, or if on the last row, send that map as the document. Otherwise, the transformer returns null. This will collect data from each row seen onto one document. Keep in mind also, that in a lot of cases like this, it might just be easiest to write a program that uses solrj to send your documents rather than trying to make DIH's features fit your use-case. James Dyer Ingram Content Group (615) 213-4311 -Original Message- From: Alejandro Marqués Rodríguez [mailto:amarq...@paradigmatecnologico.com] Sent: Thursday, August 07, 2014 1:43 AM To: solr-user@lucene.apache.org Subject: Data Import handler and join select Hi, I have one problem while indexing with data import hadler while doing a join select. I have two tables, one with products and another one with descriptions for each product in several languages. So it would be: Products: ID, NAME, BRAND, PRICE, ... Descriptions: ID, LANGUAGE, DESCRIPTION I would like to have every product indexed as a document with a multivalued field "language" which contains every language that has an associated description and several dinamic fields "description_" one for each language. So it would be for example: Id: 1 Name: Product Brand: Brand Price: 10 Languages: [es,en] Description_es: Descripción en español Description_en: English description Our first approach was using sub-entities for the data import handler and after implementing some transformers we had everything indexed as we wanted. The sub-entity process added the descriptions for each language to the solr document and then indexed them. The problem was performance. I've read that using sub-entities affected performance greatly, so we changed our process in order to use a join instead. Performance was greatly improved this way but now we have a problem. Each time a row is processed a solr document is generated and indexed into solr, but the data is not added to any previous data, but it replaces it. If we had the previous example the query resulting from the join would be: Id - Name - Brand - Price - Language - Description 1 - Product - Brand - 10 - es - Descripción en español 1 - Product - Brand - 10 - en - English description So when indexing as both have the same id the only information I get is the second row. Is there any way for data import handler to manage this and allow the documents to be indexed updating any previous data? Thanks in advance -- Alejandro Marqués Rodríguez Paradigma Tecnológico http://www.paradigmatecnologico.com Avenida de Europa, 26. Ática 5. 3ª Planta 28224 Pozuelo de Alarcón Tel.: 91 352 59 42
Data Import handler and join select
Hi, I have one problem while indexing with data import hadler while doing a join select. I have two tables, one with products and another one with descriptions for each product in several languages. So it would be: Products: ID, NAME, BRAND, PRICE, ... Descriptions: ID, LANGUAGE, DESCRIPTION I would like to have every product indexed as a document with a multivalued field "language" which contains every language that has an associated description and several dinamic fields "description_" one for each language. So it would be for example: Id: 1 Name: Product Brand: Brand Price: 10 Languages: [es,en] Description_es: Descripción en español Description_en: English description Our first approach was using sub-entities for the data import handler and after implementing some transformers we had everything indexed as we wanted. The sub-entity process added the descriptions for each language to the solr document and then indexed them. The problem was performance. I've read that using sub-entities affected performance greatly, so we changed our process in order to use a join instead. Performance was greatly improved this way but now we have a problem. Each time a row is processed a solr document is generated and indexed into solr, but the data is not added to any previous data, but it replaces it. If we had the previous example the query resulting from the join would be: Id - Name - Brand - Price - Language - Description 1 - Product - Brand - 10 - es - Descripción en español 1 - Product - Brand - 10 - en - English description So when indexing as both have the same id the only information I get is the second row. Is there any way for data import handler to manage this and allow the documents to be indexed updating any previous data? Thanks in advance -- Alejandro Marqués Rodríguez Paradigma Tecnológico http://www.paradigmatecnologico.com Avenida de Europa, 26. Ática 5. 3ª Planta 28224 Pozuelo de Alarcón Tel.: 91 352 59 42