[ https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13154257#comment-13154257 ]
James Dyer commented on SOLR-2382: ---------------------------------- Noble, I can't speak for every use case, but these were necessary for one of our applications. The whole idea is it lets you load your caches in advance of indexing (DIHCacheWriter), then read back your caches at a later time when you're ready to index (DIHCacheProcessor). - This is especially helpful if you have a lot of different data sources that each contribute a few data elements in each Solr record. (we have at least 40 data sources.) - If you have slow data sources, you can run multiple DIH scripts at the same time and build your caches simultaneously (My app builds 12 DIH Caches at a time as we have some slow legacy databases to content with). - If you have a some data sources that change infrequently and other that are changing all the time, you can build caches for the infrequently-changing data sources, making it unnecessary to re-acquire this data every time you do a delta update (this is actually a very common case. Imagine having Solr loaded with Product metadata. Most of the data would seldom change but things like prices, availability flags, stock numbers, etc, might change all the time.) - The fact that you can do delta imports on caches allows users to optimize the indexing process further. If you have multiple child-entity caches with data that mostly stays the same, but each has churn on a small percentage of the data, being able to just go in and delta update the cache lets you only re-acquire what changed. Otherwise, you have to take every record that had a change in even 1 data source and re-acquire all of the data sources for every record. - These last two points relate to the fact that Lucene cannot do an "update" but only a "replace". Being able to store your system-of-record data in caches alleviates the need to re-acquire all of your data sources every time you need to do an "update" on a few fields. - Some systems do not have a separate system-of-record as the data being indexed to Solr is ephemeral or changes frequently. Having the data in caches gives you the freedom to delta update the information or easily re-index all data at system upgrades, etc. I could see for some users these caches factoring into their disaster recovery strategy. - There is also a feature to partition the data into multiple caches, which would make it easier to subsequently index the data to separate shards. We use this feature to index the data in parallel to the same core (we're using Solr 1.4, which did not have a "threads" parameter), but this would apply to using multiple shards also. Is this convincing enough to go ahead and work towards commit? > DIH Cache Improvements > ---------------------- > > Key: SOLR-2382 > URL: https://issues.apache.org/jira/browse/SOLR-2382 > Project: Solr > Issue Type: New Feature > Components: contrib - DataImportHandler > Reporter: James Dyer > Priority: Minor > Attachments: SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter.patch, > SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter.patch, > SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch, > SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch, > SOLR-2382-entities.patch, SOLR-2382-entities.patch, > SOLR-2382-properties.patch, SOLR-2382-properties.patch, > SOLR-2382-solrwriter-verbose-fix.patch, SOLR-2382-solrwriter.patch, > SOLR-2382-solrwriter.patch, SOLR-2382-solrwriter.patch, SOLR-2382.patch, > SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, > SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch > > > Functionality: > 1. Provide a pluggable caching framework for DIH so that users can choose a > cache implementation that best suits their data and application. > > 2. Provide a means to temporarily cache a child Entity's data without > needing to create a special cached implementation of the Entity Processor > (such as CachedSqlEntityProcessor). > > 3. Provide a means to write the final (root entity) DIH output to a cache > rather than to Solr. Then provide a way for a subsequent DIH call to use the > cache as an Entity input. Also provide the ability to do delta updates on > such persistent caches. > > 4. Provide the ability to partition data across multiple caches that can > then be fed back into DIH and indexed either to varying Solr Shards, or to > the same Core in parallel. > Use Cases: > 1. We needed a flexible & scalable way to temporarily cache child-entity > data prior to joining to parent entities. > - Using SqlEntityProcessor with Child Entities can cause an "n+1 select" > problem. > - CachedSqlEntityProcessor only supports an in-memory HashMap as a Caching > mechanism and does not scale. > - There is no way to cache non-SQL inputs (ex: flat files, xml, etc). > > 2. We needed the ability to gather data from long-running entities by a > process that runs separate from our main indexing process. > > 3. We wanted the ability to do a delta import of only the entities that > changed. > - Lucene/Solr requires entire documents to be re-indexed, even if only a > few fields changed. > - Our data comes from 50+ complex sql queries and/or flat files. > - We do not want to incur overhead re-gathering all of this data if only 1 > entity's data changed. > - Persistent DIH caches solve this problem. > > 4. We want the ability to index several documents in parallel (using 1.4.1, > which did not have the "threads" parameter). > > 5. In the future, we may need to use Shards, creating a need to easily > partition our source data into Shards. > Implementation Details: > 1. De-couple EntityProcessorBase from caching. > - Created a new interface, DIHCache & two implementations: > - SortedMapBackedCache - An in-memory cache, used as default with > CachedSqlEntityProcessor (now deprecated). > - BerkleyBackedCache - A disk-backed cache, dependent on bdb-je, tested > with je-4.1.6.jar > - NOTE: the existing Lucene Contrib "db" project uses je-3.3.93.jar. > I believe this may be incompatible due to Generic Usage. > - NOTE: I did not modify the ant script to automatically get this jar, > so to use or evaluate this patch, download bdb-je from > http://www.oracle.com/technetwork/database/berkeleydb/downloads/index.html > > 2. Allow Entity Processors to take a "cacheImpl" parameter to cause the > entity data to be cached (see EntityProcessorBase & DIHCacheProperties). > > 3. Partially De-couple SolrWriter from DocBuilder > - Created a new interface DIHWriter, & two implementations: > - SolrWriter (refactored) > - DIHCacheWriter (allows DIH to write ultimately to a Cache). > > 4. Create a new Entity Processor, DIHCacheProcessor, which reads a > persistent Cache as DIH Entity Input. > > 5. Support a "partition" parameter with both DIHCacheWriter and > DIHCacheProcessor to allow for easy partitioning of source entity data. > > 6. Change the semantics of entity.destroy() > - Previously, it was being called on each iteration of > DocBuilder.buildDocument(). > - Now it is does one-time cleanup tasks (like closing or deleting a > disk-backed cache) once the entity processor is completed. > - The only out-of-the-box entity processor that previously implemented > destroy() was LineEntitiyProcessor, so this is not a very invasive change. > General Notes: > We are near completion in converting our search functionality from a legacy > search engine to Solr. However, I found that DIH did not support caching to > the level of our prior product's data import utility. In order to get our > data into Solr, I created these caching enhancements. Because I believe this > has broad application, and because we would like this feature to be supported > by the Community, I have front-ported this, enhanced, to Trunk. I have also > added unit tests and verified that all existing test cases pass. I believe > this patch maintains backwards-compatibility and would be a welcome addition > to a future version of Solr. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org