RE: Fast DIH with 1:M multValue entities
Search the list for my post DIH - deleting documents, high performance (delta) imports, and passing parameters which shows a different approach to 1:M sub entities Ephraim Ofir -Original Message- From: Tim Gilbert [mailto:tim.gilb...@morningstar.com] Sent: Thursday, April 14, 2011 6:02 PM To: solr-user@lucene.apache.org Subject: RE: Fast DIH with 1:M multValue entities How did I miss that? Thanks, I will try that as it seems to be in memory lookup solution I needed. Thanks Erick, Tim -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Thursday, April 14, 2011 10:58 AM To: solr-user@lucene.apache.org Subject: Re: Fast DIH with 1:M multValue entities I'm not sure this applies, but have you looked at http://wiki.apache.org/solr/DataImportHandler#CachedSqlEntityProcessor http://wiki.apache.org/solr/DataImportHandler#CachedSqlEntityProcessor Best Erick On Thu, Apr 14, 2011 at 9:12 AM, Tim Gilbert tim.gilb...@morningstar.comwrote: We are working on importing a large number of records into Solr using DIH. We have one schema with ~2000 fields declared which map off to several database schemas so that typically each document will have ~500 fields in use. We have about 2 million rows which we are importing, and we are seeing 20 minutes in test across 14 different entity's which really map off to one virtual document. Then we added our multiValue stuff and, well, it didn't work out nearly as well. :-) We have several fields which are 1:M and so in our data-config.xml we might have something like this: document name=allfund entity name=FundId dataSource=getFundManager query={call dbo.getFundManager_Id()} field column=FundId name=HS04C / entity name=FundData dataSource=getFundManager query={call dbo.getFundManager_Data(${FundId.FundId})} field column=ManagerName name=OF015 / /entity /entity /document That is a lot of database queries for a small result set which is really slowing things down for us. My question is more to ask advice, so it's a multi-parter :-) 1) Is there a way to declare in DIH an in-memory lookup where we can query for the entire Many side of the query in one database query, and match up on the PK? Then we can declare that field multiValued. 2) Assuming that isn't currently available, I thought denormalizing the 1:M into a delimited list and then using http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDel imiterFilterFactory to tokenize. That would allow us to search on individual bits, and build something into the front-end to handle the display. That means we wouldn't use multiValued and we'd have to modify our db but we'd lose out on some of the abilities. 3) The third option was to open up DIH and try to add the first feature into it ourselves. Am I approaching this the right way? Are there other ways I haven't considered or don't know about? Thanks in advance, Tim
Re: Fast DIH with 1:M multValue entities
I'm not sure this applies, but have you looked at http://wiki.apache.org/solr/DataImportHandler#CachedSqlEntityProcessor http://wiki.apache.org/solr/DataImportHandler#CachedSqlEntityProcessorBest Erick On Thu, Apr 14, 2011 at 9:12 AM, Tim Gilbert tim.gilb...@morningstar.comwrote: We are working on importing a large number of records into Solr using DIH. We have one schema with ~2000 fields declared which map off to several database schemas so that typically each document will have ~500 fields in use. We have about 2 million rows which we are importing, and we are seeing 20 minutes in test across 14 different entity's which really map off to one virtual document. Then we added our multiValue stuff and, well, it didn't work out nearly as well. :-) We have several fields which are 1:M and so in our data-config.xml we might have something like this: document name=allfund entity name=FundId dataSource=getFundManager query={call dbo.getFundManager_Id()} field column=FundId name=HS04C / entity name=FundData dataSource=getFundManager query={call dbo.getFundManager_Data(${FundId.FundId})} field column=ManagerName name=OF015 / /entity /entity /document That is a lot of database queries for a small result set which is really slowing things down for us. My question is more to ask advice, so it's a multi-parter :-) 1) Is there a way to declare in DIH an in-memory lookup where we can query for the entire Many side of the query in one database query, and match up on the PK? Then we can declare that field multiValued. 2) Assuming that isn't currently available, I thought denormalizing the 1:M into a delimited list and then using http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDel imiterFilterFactory to tokenize. That would allow us to search on individual bits, and build something into the front-end to handle the display. That means we wouldn't use multiValued and we'd have to modify our db but we'd lose out on some of the abilities. 3) The third option was to open up DIH and try to add the first feature into it ourselves. Am I approaching this the right way? Are there other ways I haven't considered or don't know about? Thanks in advance, Tim
RE: Fast DIH with 1:M multValue entities
How did I miss that? Thanks, I will try that as it seems to be in memory lookup solution I needed. Thanks Erick, Tim -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Thursday, April 14, 2011 10:58 AM To: solr-user@lucene.apache.org Subject: Re: Fast DIH with 1:M multValue entities I'm not sure this applies, but have you looked at http://wiki.apache.org/solr/DataImportHandler#CachedSqlEntityProcessor http://wiki.apache.org/solr/DataImportHandler#CachedSqlEntityProcessor Best Erick On Thu, Apr 14, 2011 at 9:12 AM, Tim Gilbert tim.gilb...@morningstar.comwrote: We are working on importing a large number of records into Solr using DIH. We have one schema with ~2000 fields declared which map off to several database schemas so that typically each document will have ~500 fields in use. We have about 2 million rows which we are importing, and we are seeing 20 minutes in test across 14 different entity's which really map off to one virtual document. Then we added our multiValue stuff and, well, it didn't work out nearly as well. :-) We have several fields which are 1:M and so in our data-config.xml we might have something like this: document name=allfund entity name=FundId dataSource=getFundManager query={call dbo.getFundManager_Id()} field column=FundId name=HS04C / entity name=FundData dataSource=getFundManager query={call dbo.getFundManager_Data(${FundId.FundId})} field column=ManagerName name=OF015 / /entity /entity /document That is a lot of database queries for a small result set which is really slowing things down for us. My question is more to ask advice, so it's a multi-parter :-) 1) Is there a way to declare in DIH an in-memory lookup where we can query for the entire Many side of the query in one database query, and match up on the PK? Then we can declare that field multiValued. 2) Assuming that isn't currently available, I thought denormalizing the 1:M into a delimited list and then using http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDel imiterFilterFactory to tokenize. That would allow us to search on individual bits, and build something into the front-end to handle the display. That means we wouldn't use multiValued and we'd have to modify our db but we'd lose out on some of the abilities. 3) The third option was to open up DIH and try to add the first feature into it ourselves. Am I approaching this the right way? Are there other ways I haven't considered or don't know about? Thanks in advance, Tim