[
https://issues.apache.org/jira/browse/HADOOP-3315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12668640#action_12668640
]
Hong Tang commented on HADOOP-3315:
-----------------------------------
bq. Any advantage to our making a scanner around a start and end key random
accessing or, if I read things properly, there is none since we only fetch
actual blocks when seekTo is called.
There is no performance advantage. But there is a semantic difference. If you
create a range scanner, and call seekTo with a key outside the scan range, it
will return false even if the key exists in the TFile.
bq. And on concurrent access, if we have say, random-accesses concurrent with a
couple of whole-file scans my reading has it that scanners fetch a block just
as it needs it and then works against this fetched copy. The fetch is
'synchronized' which means lots of seeking around in the file but otherwise, it
looks like there is no need for the application to synchronize access to tfile.
Yes, there is no need to synchronize threads accessing to the same TFile, as
long as each have its own scanner. However, concurrent access is not as
performant as it could be due to the current design of HDFS. If multiple
threads scan different regions of the same TFiles, the actual IO calls to
FSDataInputStream are synchronized. I tried positioned read (which would avoid
synchronizing on reads) but the overhead of frequent connection establishment
makes single-threaded case much worse. Connection caching for positioned reads
may help (HADOOP-3672).
bq. Hmm. Looking at doing random accesses and it seems like a bunch of time is
spent in inBlockAdvance advancing sequentially through blocks rather than do
something like a binary search to find desired block location. Also, as we
advance, we create and destroy a bunch of objects such as the stream to hold
the value. Can you comment on why this is (compression should be on tfile block
boundaries, right so nothing to stop hopping into the midst of a tfile)? Thanks.
inBlockAdvance() sequentially goes through key-value pairs INSIDE one
compressed block. Binary searching for desired block is done through
Reader.getBlockContainsKey(). Additionally, the code takes care of the case
when you want to seek to a key in the later part of the block. When advancing,
no objects are created. The closing of the value stream is to force the code to
skip the remaining bytes of the value in case the application consumes part of
the value bytes.
bq. Are you going to upload another patch? If so, I'll keep my +1 for that.
Will do shortly.
> New binary file format
> ----------------------
>
> Key: HADOOP-3315
> URL: https://issues.apache.org/jira/browse/HADOOP-3315
> Project: Hadoop Core
> Issue Type: New Feature
> Components: io
> Reporter: Owen O'Malley
> Assignee: Amir Youssefi
> Fix For: 0.21.0
>
> Attachments: HADOOP-3315_20080908_TFILE_PREVIEW_WITH_LZO_TESTS.patch,
> HADOOP-3315_20080915_TFILE.patch, hadoop-trunk-tfile.patch,
> hadoop-trunk-tfile.patch, TFile Specification 20081217.pdf
>
>
> SequenceFile's block compression format is too complex and requires 4 codecs
> to compress or decompress. It would be good to have a file format that only
> needs
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.