[ 
https://issues.apache.org/jira/browse/BLUR-5?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13478526#comment-13478526
 ] 

Aaron McCurry commented on BLUR-5:
----------------------------------

-The BlockDirectory/BlockCache is using getFileCacheName to uniquely identify 
the particular file for a set of cached blocks. The cache name includes not 
only the file name but also the "last modified" date of the file. For 
CachedIndexInput this makes sense - the file shouldn't change, but if it did 
invalidate the cached data.

Background, the reason the last modified exists as part of the key is because 
when you 
1. Create / write a table.  
2. Drop that table.
3. Recreate that table.
The filenames in each shard will potently match an file that logically no 
longer exists.  The result most of the time is a corrupt index.

-On write this is a problem, as we don't know the last modified date and it's 
changing on every write.

Yes.

-Given we can rely on HDFS being append only it seems that we don't have to 
worry about the written parts of a file changing. Therefore we can use the file 
name only as the cache name during write, and on close of the CachedIndexOutput 
we can close the HDFS file, get the last modified date, and use that to update 
the Cache filename to file id mapping to include the last modified, which will 
then be used by the CachedInputIndex.

This sounds like a good solution.

-One concern is that if someone were to start reading the file before it were 
closed that might be a problem, however I don't think that case is possible 
here, but I'm not sure.

Lucene prevents this from occurring.  If Lucene, you write a file close it and 
it can not be modified and you may not read that file until it is closed.  Then 
you read a file, open is called once (typically) and the IndexInput is cloned 
for reuse while the file is lives.  All files have unique names and the names 
are never reused with one except segment.gen (though this may have changed in 
4.0, meaning that segments.gen may not exist anymore).

-This sound like the right approach?

Yes :)
                
> Write through caching for the BlockCache
> ----------------------------------------
>
>                 Key: BLUR-5
>                 URL: https://issues.apache.org/jira/browse/BLUR-5
>             Project: Apache Blur
>          Issue Type: Improvement
>            Reporter: Aaron McCurry
>
> This will allow for better NRT update performance because the writer will not 
> have to read the NRT segments from HDFS.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to