[ 
https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13078384#comment-13078384
 ] 

James Dyer commented on SOLR-2382:
----------------------------------

Lance,

I do not have any scientific benchmarks, but I can tell you how we use 
BerkleyBackedCache and how it performs for us.  

In our main app, we fully re-index all our data every night (13+ million 
records).  Its basically a 2-step process.  First we run ~50 DIH handlers, each 
of which builds a cache from databases, flat files, etc.  The caches partition 
the data 8-ways.  Then a "master" DIH script does all the joining, runs 
transformers on the data, etc.  We have all 8 invocations of this same "master" 
DIH config running simultaneously indexing to the same Solr core, so each DIH 
invocation is processing 1.6 million records directly out of caches, doing all 
the 1-many joins, running transformer code, indexing, etc.  This takes 1-1/2 
hours, so maybe 250-300 solr records get added per second.  We're using fast 
local disks configured with raid-0 on an 8-core 64gb server.  This app is 
running solr 1.4, using the original version of this patch, prior to my 
front-porting it to trunk.  No doubt some of the time is spent contending for 
the Lucene index as all 8 DIH invocations are indexing at the same time
 .

We also have another app that uses Solr4.0 with the patch I originally posted 
back in February, sharing hardware with the main app.  This one has about 10 
entities and uses a simple 1-dih-handler configuration.  The parent entity 
drives directly off the database while all the child entities use 
SqlEntityProcessor with BerkleyBackedCache.  There are only 25,000 fairly 
narrow records and we can re-index everything in about 10 minutes.  This 
includes database time, indexing, running transformers, etc in addition to the 
cache overhead.

The inspiration for this was that we were converting off of Endeca and we were 
relying on Endeca's "Forge" program to join & denormalize all of the data.  
Forge has a very fast disk-backed caching mechanism and I needed to match that 
performance with DIH.  I'm pretty sure what we have here surpasses Forge.  And 
we also get a big bonus in that it lets you persist caches and use them as a 
subsequent input.  With Forge, we had to output the data into huge delimited 
text files and then use that as input for the next step...

Hope this information gives you some idea if this will work for your use case.

> DIH Cache Improvements
> ----------------------
>
>                 Key: SOLR-2382
>                 URL: https://issues.apache.org/jira/browse/SOLR-2382
>             Project: Solr
>          Issue Type: New Feature
>          Components: contrib - DataImportHandler
>            Reporter: James Dyer
>            Priority: Minor
>         Attachments: SOLR-2382-dihwriter.patch, SOLR-2382-entities.patch, 
> SOLR-2382-entities.patch, SOLR-2382-entities.patch, 
> SOLR-2382-properties.patch, SOLR-2382-properties.patch, 
> SOLR-2382-solrwriter-verbose-fix.patch, SOLR-2382-solrwriter.patch, 
> SOLR-2382-solrwriter.patch, SOLR-2382-solrwriter.patch, SOLR-2382.patch, 
> SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, 
> SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch
>
>
> Functionality:
>  1. Provide a pluggable caching framework for DIH so that users can choose a 
> cache implementation that best suits their data and application.
>  
>  2. Provide a means to temporarily cache a child Entity's data without 
> needing to create a special cached implementation of the Entity Processor 
> (such as CachedSqlEntityProcessor).
>  
>  3. Provide a means to write the final (root entity) DIH output to a cache 
> rather than to Solr.  Then provide a way for a subsequent DIH call to use the 
> cache as an Entity input.  Also provide the ability to do delta updates on 
> such persistent caches.
>  
>  4. Provide the ability to partition data across multiple caches that can 
> then be fed back into DIH and indexed either to varying Solr Shards, or to 
> the same Core in parallel.
> Use Cases:
>  1. We needed a flexible & scalable way to temporarily cache child-entity 
> data prior to joining to parent entities.
>   - Using SqlEntityProcessor with Child Entities can cause an "n+1 select" 
> problem.
>   - CachedSqlEntityProcessor only supports an in-memory HashMap as a Caching 
> mechanism and does not scale.
>   - There is no way to cache non-SQL inputs (ex: flat files, xml, etc).
>  
>  2. We needed the ability to gather data from long-running entities by a 
> process that runs separate from our main indexing process.
>   
>  3. We wanted the ability to do a delta import of only the entities that 
> changed.
>   - Lucene/Solr requires entire documents to be re-indexed, even if only a 
> few fields changed.
>   - Our data comes from 50+ complex sql queries and/or flat files.
>   - We do not want to incur overhead re-gathering all of this data if only 1 
> entity's data changed.
>   - Persistent DIH caches solve this problem.
>   
>  4. We want the ability to index several documents in parallel (using 1.4.1, 
> which did not have the "threads" parameter).
>  
>  5. In the future, we may need to use Shards, creating a need to easily 
> partition our source data into Shards.
> Implementation Details:
>  1. De-couple EntityProcessorBase from caching.  
>   - Created a new interface, DIHCache & two implementations:  
>     - SortedMapBackedCache - An in-memory cache, used as default with 
> CachedSqlEntityProcessor (now deprecated).
>     - BerkleyBackedCache - A disk-backed cache, dependent on bdb-je, tested 
> with je-4.1.6.jar
>        - NOTE: the existing Lucene Contrib "db" project uses je-3.3.93.jar.  
> I believe this may be incompatible due to Generic Usage.
>        - NOTE: I did not modify the ant script to automatically get this jar, 
> so to use or evaluate this patch, download bdb-je from 
> http://www.oracle.com/technetwork/database/berkeleydb/downloads/index.html 
>  
>  2. Allow Entity Processors to take a "cacheImpl" parameter to cause the 
> entity data to be cached (see EntityProcessorBase & DIHCacheProperties).
>  
>  3. Partially De-couple SolrWriter from DocBuilder
>   - Created a new interface DIHWriter, & two implementations:
>    - SolrWriter (refactored)
>    - DIHCacheWriter (allows DIH to write ultimately to a Cache).
>    
>  4. Create a new Entity Processor, DIHCacheProcessor, which reads a 
> persistent Cache as DIH Entity Input.
>  
>  5. Support a "partition" parameter with both DIHCacheWriter and 
> DIHCacheProcessor to allow for easy partitioning of source entity data.
>  
>  6. Change the semantics of entity.destroy()
>   - Previously, it was being called on each iteration of 
> DocBuilder.buildDocument().
>   - Now it is does one-time cleanup tasks (like closing or deleting a 
> disk-backed cache) once the entity processor is completed.
>   - The only out-of-the-box entity processor that previously implemented 
> destroy() was LineEntitiyProcessor, so this is not a very invasive change.
> General Notes:
> We are near completion in converting our search functionality from a legacy 
> search engine to Solr.  However, I found that DIH did not support caching to 
> the level of our prior product's data import utility.  In order to get our 
> data into Solr, I created these caching enhancements.  Because I believe this 
> has broad application, and because we would like this feature to be supported 
> by the Community, I have front-ported this, enhanced, to Trunk.  I have also 
> added unit tests and verified that all existing test cases pass.  I believe 
> this patch maintains backwards-compatibility and would be a welcome addition 
> to a future version of Solr.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to