[ https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
James Dyer updated SOLR-2382: ----------------------------- Attachment: SOLR-2382_3x.patch Patch for 3.x includes everything already committed to Trunk as well as various bug fixes (also in Trunk already). (Patch is here for reference only ; changes were actually moved using svn merge) I will commit to 3.x shortly. > DIH Cache Improvements > ---------------------- > > Key: SOLR-2382 > URL: https://issues.apache.org/jira/browse/SOLR-2382 > Project: Solr > Issue Type: New Feature > Components: contrib - DataImportHandler > Reporter: James Dyer > Assignee: James Dyer > Priority: Minor > Fix For: 3.6, 4.0 > > Attachments: SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter.patch, > SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter.patch, > SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter_standalone.patch, > SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch, > SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch, > SOLR-2382-entities.patch, SOLR-2382-entities.patch, > SOLR-2382-properties.patch, SOLR-2382-properties.patch, > SOLR-2382-solrwriter-verbose-fix.patch, SOLR-2382-solrwriter.patch, > SOLR-2382-solrwriter.patch, SOLR-2382-solrwriter.patch, SOLR-2382.patch, > SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, > SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382_3x.patch, > TestCachedSqlEntityProcessor.java-break-where-clause.patch, > TestCachedSqlEntityProcessor.java-fix-where-clause-by-adding-cachePk-and-lookup.patch, > > TestCachedSqlEntityProcessor.java-wrong-pk-detected-due-to-lack-of-where-support.patch, > TestThreaded.java.patch > > > Functionality: > 1. Provide a pluggable caching framework for DIH so that users can choose a > cache implementation that best suits their data and application. > > 2. Provide a means to temporarily cache a child Entity's data without > needing to create a special cached implementation of the Entity Processor > (such as CachedSqlEntityProcessor). > > 3. Provide a means to write the final (root entity) DIH output to a cache > rather than to Solr. Then provide a way for a subsequent DIH call to use the > cache as an Entity input. Also provide the ability to do delta updates on > such persistent caches. > > 4. Provide the ability to partition data across multiple caches that can > then be fed back into DIH and indexed either to varying Solr Shards, or to > the same Core in parallel. > Use Cases: > 1. We needed a flexible & scalable way to temporarily cache child-entity > data prior to joining to parent entities. > - Using SqlEntityProcessor with Child Entities can cause an "n+1 select" > problem. > - CachedSqlEntityProcessor only supports an in-memory HashMap as a Caching > mechanism and does not scale. > - There is no way to cache non-SQL inputs (ex: flat files, xml, etc). > > 2. We needed the ability to gather data from long-running entities by a > process that runs separate from our main indexing process. > > 3. We wanted the ability to do a delta import of only the entities that > changed. > - Lucene/Solr requires entire documents to be re-indexed, even if only a > few fields changed. > - Our data comes from 50+ complex sql queries and/or flat files. > - We do not want to incur overhead re-gathering all of this data if only 1 > entity's data changed. > - Persistent DIH caches solve this problem. > > 4. We want the ability to index several documents in parallel (using 1.4.1, > which did not have the "threads" parameter). > > 5. In the future, we may need to use Shards, creating a need to easily > partition our source data into Shards. > Implementation Details: > 1. De-couple EntityProcessorBase from caching. > - Created a new interface, DIHCache & two implementations: > - SortedMapBackedCache - An in-memory cache, used as default with > CachedSqlEntityProcessor (now deprecated). > - BerkleyBackedCache - A disk-backed cache, dependent on bdb-je, tested > with je-4.1.6.jar > - NOTE: the existing Lucene Contrib "db" project uses je-3.3.93.jar. > I believe this may be incompatible due to Generic Usage. > - NOTE: I did not modify the ant script to automatically get this jar, > so to use or evaluate this patch, download bdb-je from > http://www.oracle.com/technetwork/database/berkeleydb/downloads/index.html > > 2. Allow Entity Processors to take a "cacheImpl" parameter to cause the > entity data to be cached (see EntityProcessorBase & DIHCacheProperties). > > 3. Partially De-couple SolrWriter from DocBuilder > - Created a new interface DIHWriter, & two implementations: > - SolrWriter (refactored) > - DIHCacheWriter (allows DIH to write ultimately to a Cache). > > 4. Create a new Entity Processor, DIHCacheProcessor, which reads a > persistent Cache as DIH Entity Input. > > 5. Support a "partition" parameter with both DIHCacheWriter and > DIHCacheProcessor to allow for easy partitioning of source entity data. > > 6. Change the semantics of entity.destroy() > - Previously, it was being called on each iteration of > DocBuilder.buildDocument(). > - Now it is does one-time cleanup tasks (like closing or deleting a > disk-backed cache) once the entity processor is completed. > - The only out-of-the-box entity processor that previously implemented > destroy() was LineEntitiyProcessor, so this is not a very invasive change. > General Notes: > We are near completion in converting our search functionality from a legacy > search engine to Solr. However, I found that DIH did not support caching to > the level of our prior product's data import utility. In order to get our > data into Solr, I created these caching enhancements. Because I believe this > has broad application, and because we would like this feature to be supported > by the Community, I have front-ported this, enhanced, to Trunk. I have also > added unit tests and verified that all existing test cases pass. I believe > this patch maintains backwards-compatibility and would be a welcome addition > to a future version of Solr. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org