RE: Fast DIH with 1:M multValue entities

2011-04-17 Thread Ephraim Ofir
Search the list for my post DIH - deleting documents, high performance
(delta) imports, and passing parameters which shows a different
approach to 1:M sub entities

Ephraim Ofir

-Original Message-
From: Tim Gilbert [mailto:tim.gilb...@morningstar.com] 
Sent: Thursday, April 14, 2011 6:02 PM
To: solr-user@lucene.apache.org
Subject: RE: Fast DIH with 1:M multValue entities

How did I miss that?  Thanks, I will try that as it seems to be in
memory lookup solution I needed.

Thanks Erick,

Tim

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Thursday, April 14, 2011 10:58 AM
To: solr-user@lucene.apache.org
Subject: Re: Fast DIH with 1:M multValue entities

I'm not sure this applies, but have you looked at
http://wiki.apache.org/solr/DataImportHandler#CachedSqlEntityProcessor

http://wiki.apache.org/solr/DataImportHandler#CachedSqlEntityProcessor
Best
Erick

On Thu, Apr 14, 2011 at 9:12 AM, Tim Gilbert
tim.gilb...@morningstar.comwrote:

 We are working on importing a large number of records into Solr using
 DIH.  We have one schema with ~2000 fields declared which map off to
 several database schemas so that typically each document will have
~500
 fields in use.  We have about 2 million rows which we are importing,
 and we are seeing  20 minutes in test across 14 different entity's
 which really map off to one virtual document.  Then we added our
 multiValue stuff and, well, it didn't work out nearly as well. :-)



 We have several fields which are 1:M and so in our data-config.xml we
 might have something like this:



 document name=allfund

 entity name=FundId dataSource=getFundManager query={call
 dbo.getFundManager_Id()}

 field column=FundId name=HS04C /

 entity name=FundData dataSource=getFundManager

 query={call dbo.getFundManager_Data(${FundId.FundId})}



 field column=ManagerName name=OF015 /

 /entity

 /entity

 /document



 That is a lot of database queries for a small result set which is
really
 slowing things down for us.



 My question is more to ask advice, so it's a multi-parter :-)



 1)   Is there a way to declare in DIH an in-memory
 lookup where we can query for the entire Many side of the query in one
 database query, and match up on the PK?  Then we can declare that
field
 multiValued.

 2)   Assuming that isn't currently available, I
thought
 denormalizing the 1:M into a delimited list and then using

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDel
 imiterFilterFactory to tokenize.  That would allow us to search on
 individual bits, and build something into the front-end to handle the
 display.  That means we wouldn't use multiValued and we'd have to
modify
 our db but we'd lose out on some of the abilities.

 3)   The third option was to open up DIH and try to
add
 the first feature into it ourselves.



 Am I approaching this the right way?  Are there other ways I haven't
 considered or don't know about?



 Thanks in advance,



 Tim




Re: Fast DIH with 1:M multValue entities

2011-04-14 Thread Erick Erickson
I'm not sure this applies, but have you looked at
http://wiki.apache.org/solr/DataImportHandler#CachedSqlEntityProcessor

http://wiki.apache.org/solr/DataImportHandler#CachedSqlEntityProcessorBest
Erick

On Thu, Apr 14, 2011 at 9:12 AM, Tim Gilbert tim.gilb...@morningstar.comwrote:

 We are working on importing a large number of records into Solr using
 DIH.  We have one schema with ~2000 fields declared which map off to
 several database schemas so that typically each document will have ~500
 fields in use.  We have about 2 million rows which we are importing,
 and we are seeing  20 minutes in test across 14 different entity's
 which really map off to one virtual document.  Then we added our
 multiValue stuff and, well, it didn't work out nearly as well. :-)



 We have several fields which are 1:M and so in our data-config.xml we
 might have something like this:



 document name=allfund

 entity name=FundId dataSource=getFundManager query={call
 dbo.getFundManager_Id()}

 field column=FundId name=HS04C /

 entity name=FundData dataSource=getFundManager

 query={call dbo.getFundManager_Data(${FundId.FundId})}



 field column=ManagerName name=OF015 /

 /entity

 /entity

 /document



 That is a lot of database queries for a small result set which is really
 slowing things down for us.



 My question is more to ask advice, so it's a multi-parter :-)



 1)   Is there a way to declare in DIH an in-memory
 lookup where we can query for the entire Many side of the query in one
 database query, and match up on the PK?  Then we can declare that field
 multiValued.

 2)   Assuming that isn't currently available, I thought
 denormalizing the 1:M into a delimited list and then using
 http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDel
 imiterFilterFactory to tokenize.  That would allow us to search on
 individual bits, and build something into the front-end to handle the
 display.  That means we wouldn't use multiValued and we'd have to modify
 our db but we'd lose out on some of the abilities.

 3)   The third option was to open up DIH and try to add
 the first feature into it ourselves.



 Am I approaching this the right way?  Are there other ways I haven't
 considered or don't know about?



 Thanks in advance,



 Tim




RE: Fast DIH with 1:M multValue entities

2011-04-14 Thread Tim Gilbert
How did I miss that?  Thanks, I will try that as it seems to be in
memory lookup solution I needed.

Thanks Erick,

Tim

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Thursday, April 14, 2011 10:58 AM
To: solr-user@lucene.apache.org
Subject: Re: Fast DIH with 1:M multValue entities

I'm not sure this applies, but have you looked at
http://wiki.apache.org/solr/DataImportHandler#CachedSqlEntityProcessor

http://wiki.apache.org/solr/DataImportHandler#CachedSqlEntityProcessor
Best
Erick

On Thu, Apr 14, 2011 at 9:12 AM, Tim Gilbert
tim.gilb...@morningstar.comwrote:

 We are working on importing a large number of records into Solr using
 DIH.  We have one schema with ~2000 fields declared which map off to
 several database schemas so that typically each document will have
~500
 fields in use.  We have about 2 million rows which we are importing,
 and we are seeing  20 minutes in test across 14 different entity's
 which really map off to one virtual document.  Then we added our
 multiValue stuff and, well, it didn't work out nearly as well. :-)



 We have several fields which are 1:M and so in our data-config.xml we
 might have something like this:



 document name=allfund

 entity name=FundId dataSource=getFundManager query={call
 dbo.getFundManager_Id()}

 field column=FundId name=HS04C /

 entity name=FundData dataSource=getFundManager

 query={call dbo.getFundManager_Data(${FundId.FundId})}



 field column=ManagerName name=OF015 /

 /entity

 /entity

 /document



 That is a lot of database queries for a small result set which is
really
 slowing things down for us.



 My question is more to ask advice, so it's a multi-parter :-)



 1)   Is there a way to declare in DIH an in-memory
 lookup where we can query for the entire Many side of the query in one
 database query, and match up on the PK?  Then we can declare that
field
 multiValued.

 2)   Assuming that isn't currently available, I
thought
 denormalizing the 1:M into a delimited list and then using

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDel
 imiterFilterFactory to tokenize.  That would allow us to search on
 individual bits, and build something into the front-end to handle the
 display.  That means we wouldn't use multiValued and we'd have to
modify
 our db but we'd lose out on some of the abilities.

 3)   The third option was to open up DIH and try to
add
 the first feature into it ourselves.



 Am I approaching this the right way?  Are there other ways I haven't
 considered or don't know about?



 Thanks in advance,



 Tim