[ https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
James Dyer updated SOLR-2382: ----------------------------- Attachment: SOLR-2382.patch Fix for two bugs in BerkleyBackedCache: - If the passed-in fieldNames & fieldTypes have leading or trailing spaces, opening the cache would fail. - If the cache was set up for Delta updates, then closed & re-opened, adding documents would cause an NPE. > DIH Cache Improvements > ---------------------- > > Key: SOLR-2382 > URL: https://issues.apache.org/jira/browse/SOLR-2382 > Project: Solr > Issue Type: New Feature > Components: contrib - DataImportHandler > Reporter: James Dyer > Priority: Minor > Attachments: SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, > SOLR-2382.patch > > > Functionality: > 1. Provide a pluggable caching framework for DIH so that users can choose a > cache implementation that best suits their data and application. > > 2. Provide a means to temporarily cache a child Entity's data without > needing to create a special cached implementation of the Entity Processor > (such as CachedSqlEntityProcessor). > > 3. Provide a means to write the final (root entity) DIH output to a cache > rather than to Solr. Then provide a way for a subsequent DIH call to use the > cache as an Entity input. Also provide the ability to do delta updates on > such persistent caches. > > 4. Provide the ability to partition data across multiple caches that can > then be fed back into DIH and indexed either to varying Solr Shards, or to > the same Core in parallel. > Use Cases: > 1. We needed a flexible & scalable way to temporarily cache child-entity > data prior to joining to parent entities. > - Using SqlEntityProcessor with Child Entities can cause an "n+1 select" > problem. > - CachedSqlEntityProcessor only supports an in-memory HashMap as a Caching > mechanism and does not scale. > - There is no way to cache non-SQL inputs (ex: flat files, xml, etc). > > 2. We needed the ability to gather data from long-running entities by a > process that runs separate from our main indexing process. > > 3. We wanted the ability to do a delta import of only the entities that > changed. > - Lucene/Solr requires entire documents to be re-indexed, even if only a > few fields changed. > - Our data comes from 50+ complex sql queries and/or flat files. > - We do not want to incur overhead re-gathering all of this data if only 1 > entity's data changed. > - Persistent DIH caches solve this problem. > > 4. We want the ability to index several documents in parallel (using 1.4.1, > which did not have the "threads" parameter). > > 5. In the future, we may need to use Shards, creating a need to easily > partition our source data into Shards. > Implementation Details: > 1. De-couple EntityProcessorBase from caching. > - Created a new interface, DIHCache & two implementations: > - SortedMapBackedCache - An in-memory cache, used as default with > CachedSqlEntityProcessor (now deprecated). > - BerkleyBackedCache - A disk-backed cache, dependent on bdb-je, tested > with je-4.1.6.jar > - NOTE: the existing Lucene Contrib "db" project uses je-3.3.93.jar. > I believe this may be incompatible due to Generic Usage. > - NOTE: I did not modify the ant script to automatically get this jar, > so to use or evaluate this patch, download bdb-je from > http://www.oracle.com/technetwork/database/berkeleydb/downloads/index.html > > 2. Allow Entity Processors to take a "cacheImpl" parameter to cause the > entity data to be cached (see EntityProcessorBase & DIHCacheProperties). > > 3. Partially De-couple SolrWriter from DocBuilder > - Created a new interface DIHWriter, & two implementations: > - SolrWriter (refactored) > - DIHCacheWriter (allows DIH to write ultimately to a Cache). > > 4. Create a new Entity Processor, DIHCacheProcessor, which reads a > persistent Cache as DIH Entity Input. > > 5. Support a "partition" parameter with both DIHCacheWriter and > DIHCacheProcessor to allow for easy partitioning of source entity data. > > 6. Change the semantics of entity.destroy() > - Previously, it was being called on each iteration of > DocBuilder.buildDocument(). > - Now it is does one-time cleanup tasks (like closing or deleting a > disk-backed cache) once the entity processor is completed. > - The only out-of-the-box entity processor that previously implemented > destroy() was LineEntitiyProcessor, so this is not a very invasive change. > General Notes: > We are near completion in converting our search functionality from a legacy > search engine to Solr. However, I found that DIH did not support caching to > the level of our prior product's data import utility. In order to get our > data into Solr, I created these caching enhancements. Because I believe this > has broad application, and because we would like this feature to be supported > by the Community, I have front-ported this, enhanced, to Trunk. I have also > added unit tests and verified that all existing test cases pass. I believe > this patch maintains backwards-compatibility and would be a welcome addition > to a future version of Solr. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org