We have a situation where we have data coming from several long-running queries 
hitting multiple relational databases.  Other data comes in fixed-width text 
file feeds, etc.  All of this has to be joined and denormalized and made into 
nice SOLR documents.  I've been wanting to use DIH as it seems to already 
provide 90% of what we need.  The rest can some in the form of custom 
transformers & Entity Processors that I can write...

One big need is to have disk-backed caches.  For instance, a child entity that 
pulls back millions of rows will beat up the db using a regular 
SQLEntityProcessor whereas the CachedSQLEntityProcessor puts everything in 
memory in a HashMap so it will only scale to a point.  For fixed-width text 
files, there doesn't seem to be any Cached implementations at all.

So I've written a custom Entity Processor that creates a temporary Lucene index 
to use as a disk cache.  Initial tests are promising but with one little 
problem.  I need a place to close the Lucene index reader and then delete the 
temporary index.  It seemed easy enough to override the "destroy()" method from 
EntityProcessorBase.  But to my surprise, it seems that both destroy() and 
init() get called every time a new Primary Key is called up from the cache.  
(see DocBuilder.buildDocument()).  Just to be sure I wasn't crazy, I added a 
"destroy()" method to CachedSqlEntityProcessor and found it indeed gets called 
every time a new Primary Key is called from the cache.  In fact, the first 
couple of lines in cacheInit() in EntityProcessorBase seem to be there to cope 
with the fact that both destroy() and init() get called over and over again 
during the lifecycle of the object.

I've also noticed that destroy() isn't actually implemented anywhere in the 
prepacked Entity Processors.  This makes me wonder if it is a mistake.  Should 
DocBuilder be changed to call destroy() only once per lifecycle for each 
EntityProcessor object?  If so I think I can have a patch in JIRA in short 
order.

Otherwise...How do I best accomplish my clean-up tasks?  Advice is greatly 
appreciated.

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311

Reply via email to