[
https://issues.apache.org/jira/browse/BLUR-5?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13478526#comment-13478526
]
Aaron McCurry commented on BLUR-5:
----------------------------------
-The BlockDirectory/BlockCache is using getFileCacheName to uniquely identify
the particular file for a set of cached blocks. The cache name includes not
only the file name but also the "last modified" date of the file. For
CachedIndexInput this makes sense - the file shouldn't change, but if it did
invalidate the cached data.
Background, the reason the last modified exists as part of the key is because
when you
1. Create / write a table.
2. Drop that table.
3. Recreate that table.
The filenames in each shard will potently match an file that logically no
longer exists. The result most of the time is a corrupt index.
-On write this is a problem, as we don't know the last modified date and it's
changing on every write.
Yes.
-Given we can rely on HDFS being append only it seems that we don't have to
worry about the written parts of a file changing. Therefore we can use the file
name only as the cache name during write, and on close of the CachedIndexOutput
we can close the HDFS file, get the last modified date, and use that to update
the Cache filename to file id mapping to include the last modified, which will
then be used by the CachedInputIndex.
This sounds like a good solution.
-One concern is that if someone were to start reading the file before it were
closed that might be a problem, however I don't think that case is possible
here, but I'm not sure.
Lucene prevents this from occurring. If Lucene, you write a file close it and
it can not be modified and you may not read that file until it is closed. Then
you read a file, open is called once (typically) and the IndexInput is cloned
for reuse while the file is lives. All files have unique names and the names
are never reused with one except segment.gen (though this may have changed in
4.0, meaning that segments.gen may not exist anymore).
-This sound like the right approach?
Yes :)
> Write through caching for the BlockCache
> ----------------------------------------
>
> Key: BLUR-5
> URL: https://issues.apache.org/jira/browse/BLUR-5
> Project: Apache Blur
> Issue Type: Improvement
> Reporter: Aaron McCurry
>
> This will allow for better NRT update performance because the writer will not
> have to read the NRT segments from HDFS.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira