[ 
https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13126751#comment-13126751
 ] 

James Dyer commented on SOLR-2382:
----------------------------------

Pulkit,

I take it you have a root entity (possibly from a SQL database) and a child 
entity from a flat text file, and you need to join the two data sources.  There 
are two ways to do this using this caching.  In either case you'll need both 
patches ("entities" & "dihwriter").  Also, if an in-memory cache is not 
adequate, you also need to get BerkleyBackedCache from SOLR-2613 (required by 
the second 2-handler approach).  

The simple way uses a temporary (or ephemeral) cache.  To do this, create a 
single DIH Request Handler and add your cached child entity to data-config.xml. 
 DIH will load and cache "child_entity" every time you do an import.  When the 
import is finished, the cache is deleted.  This allows you to do joins on flat 
files whereas without caching it would not be possible.  The downside is if the 
flat file changes infrequently, or if you're doing delta updates on your index, 
it would be inefficient to load and cache a large flat file every time.  Here's 
a sample data-config.xml:

{noformat}
<dataConfig>
 <dataSource name="SQL" ... />
 <dataSource name="URL" baseUrl="path_to_flat_file" type="URLDataSource" />
 <document name="my_doc">
   <entity name="root_entity" rootEntity="true" dataSource="SQL" pk="root_id" 
query="select root_id, more_data, etc..." >
    <entity
     name="child_entity"
     processor="LineEntityProcessor"
     dataSource="URL"
     transformer="BreakIntoFieldsTransformerIWrote"
     cacheImpl="BerkleyBackedCache"
     cacheBaseDir="temp_location_for_cache"
     cachePk="root_id"
     cacheLookup="root_entity.root_id"
     fieldNames="root_id,  flatfiledata1, etc"
     fieldTypes="BIGDECIMAL, STRING, etc"
    />
   </entity>
 </document>
</dataConfig>
{noformat}

The second approach is to create a second DIH Request Handler in your 
solrconfig.xml for the child entity.  This request handler has its own 
data-config.xml (named dih-flatfile.xml).  You would run this second request 
handler to build a persistent cache for the flat file, prior to running the 
main DIH request handler.  Here's an example of this second DIH Request Handler 
configured in sorlconfig.xml:

{noformat}
<requestHandler name="/dih-flatfile" 
class="org.apache.solr.handler.dataimport.DataImportHandler">
 <lst name="defaults">
  <str name="config">dih-flatfile.xml</str>
  <str name="cacheDeletePriorData">true</str>
  <str name="fieldNames">root_id,  flatfiledata1, etc</str>
  <str name="fieldTypes">BIGDECIMAL, STRING, etc</str>
  <str name="writerImpl">org.apache.solr.handler.dataimport.DIHCacheWriter</str>
  <str name="cacheImpl">BerkleyBackedCache</str>
  <str name="cacheBaseDir">location_of_persistent_caches</str>
  <str name="cacheName">flatfile_cache_name</str>
  <str name="cachePk">root_id</str>
 </lst>
</requestHandler>
{noformat}

And here is what "dih-flatfile.xml" would look like:

{noformat}
<dataConfig>
 <dataSource name="URL" baseUrl="path_to_flat_file" type="URLDataSource" />
 <document name="my_doc_child">
   <entity name="child_entity" processor="LineEntityProcessor" dataSource="URL" 
transformer="BreakIntoFieldsTransformerIWrote" />
 </document>
</dataConfig>
{noformat}

Your main "dataconfig-xml" would look like this:

{noformat}
<dataConfig>
 <dataSource name="SQL" ... />
 <dataSource name="URL" baseUrl="path_to_flat_file" type="URLDataSource" />
 <document name="my_doc">
   <entity name="root_entity" rootEntity="true" dataSource="SQL" pk="root_id" 
query="select root_id, more_data, etc..." >
    <entity
     name="child_entity"
     processor="org.apache.solr.handler.dataimport.DIHCacheProcessor"
     cacheImpl="BerkleyBackedCache"
     cachePk="root_id"
     cacheLookup="root_entity.root_id"
     cacheBaseDir="location_of_persistent_caches"
     cacheName="flatfile_cache_name"
    />
   </entity>
 </document>
</dataConfig>
{noformat}

This second approach offers more flexibility (you can load the persistent cache 
off-hours, re-use it, do delta updates on it, etc) but it is significantly more 
complex.  The worst part is to create a scheduler that will run the child 
entity's DIH request handler, wait until it finishes, then run the main DIH 
Request handler.  But this is moot if you only need to load the child once or 
once in a great while.

Should this all get committed, I will eventually create something on the wiki.  
In the mean time, I hope you find all of this helpful.  For more examples, see 
the xml files these patches add to the 
"solr/contrib/dataimporthandler/src/test-files/dih/solr/conf" folder, and also 
the new unit tests that use these.
                
> DIH Cache Improvements
> ----------------------
>
>                 Key: SOLR-2382
>                 URL: https://issues.apache.org/jira/browse/SOLR-2382
>             Project: Solr
>          Issue Type: New Feature
>          Components: contrib - DataImportHandler
>            Reporter: James Dyer
>            Priority: Minor
>         Attachments: SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter.patch, 
> SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch, 
> SOLR-2382-entities.patch, SOLR-2382-properties.patch, 
> SOLR-2382-properties.patch, SOLR-2382-solrwriter-verbose-fix.patch, 
> SOLR-2382-solrwriter.patch, SOLR-2382-solrwriter.patch, 
> SOLR-2382-solrwriter.patch, SOLR-2382.patch, SOLR-2382.patch, 
> SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, 
> SOLR-2382.patch, SOLR-2382.patch
>
>
> Functionality:
>  1. Provide a pluggable caching framework for DIH so that users can choose a 
> cache implementation that best suits their data and application.
>  
>  2. Provide a means to temporarily cache a child Entity's data without 
> needing to create a special cached implementation of the Entity Processor 
> (such as CachedSqlEntityProcessor).
>  
>  3. Provide a means to write the final (root entity) DIH output to a cache 
> rather than to Solr.  Then provide a way for a subsequent DIH call to use the 
> cache as an Entity input.  Also provide the ability to do delta updates on 
> such persistent caches.
>  
>  4. Provide the ability to partition data across multiple caches that can 
> then be fed back into DIH and indexed either to varying Solr Shards, or to 
> the same Core in parallel.
> Use Cases:
>  1. We needed a flexible & scalable way to temporarily cache child-entity 
> data prior to joining to parent entities.
>   - Using SqlEntityProcessor with Child Entities can cause an "n+1 select" 
> problem.
>   - CachedSqlEntityProcessor only supports an in-memory HashMap as a Caching 
> mechanism and does not scale.
>   - There is no way to cache non-SQL inputs (ex: flat files, xml, etc).
>  
>  2. We needed the ability to gather data from long-running entities by a 
> process that runs separate from our main indexing process.
>   
>  3. We wanted the ability to do a delta import of only the entities that 
> changed.
>   - Lucene/Solr requires entire documents to be re-indexed, even if only a 
> few fields changed.
>   - Our data comes from 50+ complex sql queries and/or flat files.
>   - We do not want to incur overhead re-gathering all of this data if only 1 
> entity's data changed.
>   - Persistent DIH caches solve this problem.
>   
>  4. We want the ability to index several documents in parallel (using 1.4.1, 
> which did not have the "threads" parameter).
>  
>  5. In the future, we may need to use Shards, creating a need to easily 
> partition our source data into Shards.
> Implementation Details:
>  1. De-couple EntityProcessorBase from caching.  
>   - Created a new interface, DIHCache & two implementations:  
>     - SortedMapBackedCache - An in-memory cache, used as default with 
> CachedSqlEntityProcessor (now deprecated).
>     - BerkleyBackedCache - A disk-backed cache, dependent on bdb-je, tested 
> with je-4.1.6.jar
>        - NOTE: the existing Lucene Contrib "db" project uses je-3.3.93.jar.  
> I believe this may be incompatible due to Generic Usage.
>        - NOTE: I did not modify the ant script to automatically get this jar, 
> so to use or evaluate this patch, download bdb-je from 
> http://www.oracle.com/technetwork/database/berkeleydb/downloads/index.html 
>  
>  2. Allow Entity Processors to take a "cacheImpl" parameter to cause the 
> entity data to be cached (see EntityProcessorBase & DIHCacheProperties).
>  
>  3. Partially De-couple SolrWriter from DocBuilder
>   - Created a new interface DIHWriter, & two implementations:
>    - SolrWriter (refactored)
>    - DIHCacheWriter (allows DIH to write ultimately to a Cache).
>    
>  4. Create a new Entity Processor, DIHCacheProcessor, which reads a 
> persistent Cache as DIH Entity Input.
>  
>  5. Support a "partition" parameter with both DIHCacheWriter and 
> DIHCacheProcessor to allow for easy partitioning of source entity data.
>  
>  6. Change the semantics of entity.destroy()
>   - Previously, it was being called on each iteration of 
> DocBuilder.buildDocument().
>   - Now it is does one-time cleanup tasks (like closing or deleting a 
> disk-backed cache) once the entity processor is completed.
>   - The only out-of-the-box entity processor that previously implemented 
> destroy() was LineEntitiyProcessor, so this is not a very invasive change.
> General Notes:
> We are near completion in converting our search functionality from a legacy 
> search engine to Solr.  However, I found that DIH did not support caching to 
> the level of our prior product's data import utility.  In order to get our 
> data into Solr, I created these caching enhancements.  Because I believe this 
> has broad application, and because we would like this feature to be supported 
> by the Community, I have front-ported this, enhanced, to Trunk.  I have also 
> added unit tests and verified that all existing test cases pass.  I believe 
> this patch maintains backwards-compatibility and would be a welcome addition 
> to a future version of Solr.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to