Doug Cutting wrote:
Johan Oskarsson wrote:
I'm considering using the sequence file output of hadoop jobs to
serve data from as it would mean I could skip the conversion from
sequence file -> other file format step.
To do this efficiently I would need the data to be in one file.
I think it should be more efficient to keep things in separate files.
If you use MapFileOutputFormat, there are methods to randomly access
entries from job output:
http://lucene.apache.org/hadoop/api/org/apache/hadoop/mapred/MapFileOutputFormat.html
SequenceFileOutputFormat will also let you open all readers, but
there's no random access, since a SequenceFile has no index.
http://lucene.apache.org/hadoop/api/org/apache/hadoop/mapred/SequenceFileOutputFormat.html
Will these suffice?
Doug
You're probably right that the best way would be to just leave the files
as is. I was mostly worried about reaching limits to the number of open
files,
but did a quick calculation now and I have over estimated how many files
we would have. I think we'd reach other problems before the open files
would become an issue.
I have considered using MapFiles, however the key to do lookups on would
often be different from the key needed when calculating the data and
when using it as input
in other hadoop programs. For example if the key writable is called
UserResource I might have to do lookups when serving on just the user id.
I was planning on doing something similar to a MapFile but with the
addition that I can specify what parts of the key to index on. And just
as MapFiles
it would be read as a SequenceFile when using it as input in other
hadoop programs.
Currently we just output everything as text in one big file and index
that for serving.
It's a simple fixed width index that we use to lookup the start position
for the data for a user id.
This is of course a big waste of disk space and bandwidth/time.
Thanks for taking the time to answer my questions.
/Johan