[jira] [Updated] (SOLR-2382) DIH Cache Improvements

Mikhail Khludnev (Updated) (JIRA) Mon, 28 Nov 2011 11:52:04 -0800

     [ 
https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Mikhail Khludnev updated SOLR-2382:
-----------------------------------

    Attachment: 
TestCachedSqlEntityProcessor.java-fix-where-clause-by-adding-cachePk-and-lookup.patch
                TestCachedSqlEntityProcessor.java-break-where-clause.patch

James,

pls find my proof for absence of where="xid=x.id" support. 
TestCachedSqlEntityProcessor.java-break-where-clause.patch it looks puzzling - 
I'm  sorry for that. The test was green due to relying on keys order in the 
map. Wrapping by sorted map breaks that order and lead to peaking up wrong 
primarykey column. pls find explanation below.

from my pov the most cruel thing is 
[lines:27-28|http://svn.apache.org/viewvc/lucene/dev/trunk/solr/contrib/dataimporthandler/src/java/org/apache/solr/handler/dataimport/SortedMapBackedCache.java?view=markup]
 it pick ups just first key from the map as primary key, when it wasn't 
properly detected from attributes. so this condition hides a problem, until 
just face it and address.

left part of where clause isn't used [here at lines 
45-48|http://svn.apache.org/viewvc/lucene/dev/trunk/solr/contrib/dataimporthandler/src/java/org/apache/solr/handler/dataimport/DIHCacheSupport.java?view=markup]
 and "where=""" is ignored again [at lines 
185-190|http://svn.apache.org/viewvc/lucene/dev/trunk/solr/contrib/dataimporthandler/src/java/org/apache/solr/handler/dataimport/SortedMapBackedCache.java?view=markup]

you can see that the second attach 
TestCachedSqlEntityProcessor.java-fix-where-clause-by-adding-cachePk-and-lookup.patch
 fixes the test by adding cachePk and lookup into attributes.

My proposals are:
* fix it. it's not a big deal to came where attr back
* but why the new attributes cachePk and cacheLoop are better than old where 
attribute ? in according to reply I vote for
** decommission where="" or for 
** rolling new cahePk/Lookup attributes back
* can't we add more randomization into 
AbstractDataImportHandlerTestCase.createMap(Object...) to find more similar 
hidden issues. I propose to use concrete map behaviour randomly: hash, sorted, 
sorted-reverse. WDYT?
* the names withWhereClause() and withKeyAndLookup() should be swapped. their 
content contradicts to [the 
names|http://svn.apache.org/viewvc/lucene/dev/trunk/solr/contrib/dataimporthandler/src/test/org/apache/solr/handler/dataimport/TestCachedSqlEntityProcessor.java?view=markup]
{code}
  public void withWhereClause() {
...
        "query", q, DIHCacheSupport.CACHE_PRIMARY_KEY,"id", 
DIHCacheSupport.CACHE_FOREIGN_KEY ,"
...
  public void withKeyAndLookup() {
...
    Map<String, String> entityAttrs = createMap("query", q, "where", "id=x.id",
...
{code}  

                
> DIH Cache Improvements
> ----------------------
>
>                 Key: SOLR-2382
>                 URL: https://issues.apache.org/jira/browse/SOLR-2382
>             Project: Solr
>          Issue Type: New Feature
>          Components: contrib - DataImportHandler
>            Reporter: James Dyer
>            Priority: Minor
>         Attachments: SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter.patch, 
> SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter.patch, 
> SOLR-2382-dihwriter.patch, SOLR-2382-dihwriter_standalone.patch, 
> SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch, 
> SOLR-2382-entities.patch, SOLR-2382-entities.patch, SOLR-2382-entities.patch, 
> SOLR-2382-entities.patch, SOLR-2382-entities.patch, 
> SOLR-2382-properties.patch, SOLR-2382-properties.patch, 
> SOLR-2382-solrwriter-verbose-fix.patch, SOLR-2382-solrwriter.patch, 
> SOLR-2382-solrwriter.patch, SOLR-2382-solrwriter.patch, SOLR-2382.patch, 
> SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, 
> SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, 
> TestCachedSqlEntityProcessor.java-break-where-clause.patch, 
> TestCachedSqlEntityProcessor.java-fix-where-clause-by-adding-cachePk-and-lookup.patch,
>  TestThreaded.java.patch
>
>
> Functionality:
>  1. Provide a pluggable caching framework for DIH so that users can choose a 
> cache implementation that best suits their data and application.
>  
>  2. Provide a means to temporarily cache a child Entity's data without 
> needing to create a special cached implementation of the Entity Processor 
> (such as CachedSqlEntityProcessor).
>  
>  3. Provide a means to write the final (root entity) DIH output to a cache 
> rather than to Solr.  Then provide a way for a subsequent DIH call to use the 
> cache as an Entity input.  Also provide the ability to do delta updates on 
> such persistent caches.
>  
>  4. Provide the ability to partition data across multiple caches that can 
> then be fed back into DIH and indexed either to varying Solr Shards, or to 
> the same Core in parallel.
> Use Cases:
>  1. We needed a flexible & scalable way to temporarily cache child-entity 
> data prior to joining to parent entities.
>   - Using SqlEntityProcessor with Child Entities can cause an "n+1 select" 
> problem.
>   - CachedSqlEntityProcessor only supports an in-memory HashMap as a Caching 
> mechanism and does not scale.
>   - There is no way to cache non-SQL inputs (ex: flat files, xml, etc).
>  
>  2. We needed the ability to gather data from long-running entities by a 
> process that runs separate from our main indexing process.
>   
>  3. We wanted the ability to do a delta import of only the entities that 
> changed.
>   - Lucene/Solr requires entire documents to be re-indexed, even if only a 
> few fields changed.
>   - Our data comes from 50+ complex sql queries and/or flat files.
>   - We do not want to incur overhead re-gathering all of this data if only 1 
> entity's data changed.
>   - Persistent DIH caches solve this problem.
>   
>  4. We want the ability to index several documents in parallel (using 1.4.1, 
> which did not have the "threads" parameter).
>  
>  5. In the future, we may need to use Shards, creating a need to easily 
> partition our source data into Shards.
> Implementation Details:
>  1. De-couple EntityProcessorBase from caching.  
>   - Created a new interface, DIHCache & two implementations:  
>     - SortedMapBackedCache - An in-memory cache, used as default with 
> CachedSqlEntityProcessor (now deprecated).
>     - BerkleyBackedCache - A disk-backed cache, dependent on bdb-je, tested 
> with je-4.1.6.jar
>        - NOTE: the existing Lucene Contrib "db" project uses je-3.3.93.jar.  
> I believe this may be incompatible due to Generic Usage.
>        - NOTE: I did not modify the ant script to automatically get this jar, 
> so to use or evaluate this patch, download bdb-je from 
> http://www.oracle.com/technetwork/database/berkeleydb/downloads/index.html 
>  
>  2. Allow Entity Processors to take a "cacheImpl" parameter to cause the 
> entity data to be cached (see EntityProcessorBase & DIHCacheProperties).
>  
>  3. Partially De-couple SolrWriter from DocBuilder
>   - Created a new interface DIHWriter, & two implementations:
>    - SolrWriter (refactored)
>    - DIHCacheWriter (allows DIH to write ultimately to a Cache).
>    
>  4. Create a new Entity Processor, DIHCacheProcessor, which reads a 
> persistent Cache as DIH Entity Input.
>  
>  5. Support a "partition" parameter with both DIHCacheWriter and 
> DIHCacheProcessor to allow for easy partitioning of source entity data.
>  
>  6. Change the semantics of entity.destroy()
>   - Previously, it was being called on each iteration of 
> DocBuilder.buildDocument().
>   - Now it is does one-time cleanup tasks (like closing or deleting a 
> disk-backed cache) once the entity processor is completed.
>   - The only out-of-the-box entity processor that previously implemented 
> destroy() was LineEntitiyProcessor, so this is not a very invasive change.
> General Notes:
> We are near completion in converting our search functionality from a legacy 
> search engine to Solr.  However, I found that DIH did not support caching to 
> the level of our prior product's data import utility.  In order to get our 
> data into Solr, I created these caching enhancements.  Because I believe this 
> has broad application, and because we would like this feature to be supported 
> by the Community, I have front-ported this, enhanced, to Trunk.  I have also 
> added unit tests and verified that all existing test cases pass.  I believe 
> this patch maintains backwards-compatibility and would be a welcome addition 
> to a future version of Solr.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SOLR-2382) DIH Cache Improvements

Reply via email to