Hi Scott,
I added https://issues.apache.org/jira/browse/AVRO-991 to track this feature
request.
Thanks,
Frank Grimes
On 2012-01-12, at 10:31 PM, Scott Carey wrote:
On 1/12/12 5:52 PM, Frank Grimes frankgrime...@gmail.com wrote:
Hi Scott,
I've looked into this some more and I now see what you mean about appending
to HDFS directly not being possible with the current DataFileWriter API.
That's unfortunate because we really would like to avoid needing to hit disk
just to write temporary files. (and the associated cleanup)
I kinda like the notion of not requiring HDFS APIs to achieve this merging
of Avro files/streams.
Assuming we wanted to be able to stream multiple files as in my example...
could DataFileStream easily be changed to support that use case?
i.e. allow it to skip/ignore subsequent header and metadata in the stream or
not error out with Invalid sync!?
That may be possible, open a JIRA to discuss further. It should be modified
to 'reset' to the start of a new file or stream and continue from there,
since it needs to read the header and find the new sync value and validate
that the schemas match and the codec is compatible. It may be possible to
detect the end of one file and the start of another if the files are streamed
back to back, but perhaps not reliably.
The avro-tools could be extended to have a command line tool that takes a
list of files (HDFS or local) and writes a new file (HDFS or local)
concatenated and possibly recodec'd.
Thanks,
Frank Grimes
On 2012-01-12, at 3:53 PM, Scott Carey wrote:
On 1/12/12 12:35 PM, Frank Grimes frankgrime...@gmail.com wrote:
So I decided to try writing my own AvroStreamCombiner utility and it seems
to choke when passing multiple input files:
hadoop dfs -cat hdfs://hadoop/machine1.log.avro
hdfs://hadoop/machine2.log.avro | ./deliveryLogAvroStreamCombiner.sh
combined.log.avro
Exception in thread main java.io.IOException: Invalid sync!
at
org.apache.avro.file.DataFileStream.nextRawBlock(DataFileStream.java:293)
at
org.apache.avro.file.DataFileWriter.appendAllFrom(DataFileWriter.java:329)
at DeliveryLogAvroStreamCombiner.main(Unknown Source)
Here's the code in question:
public class DeliveryLogAvroStreamCombiner {
/**
* @param args
*/
public static void main(String[] args) throws Exception {
DataFileStreamDeliveryLogEvent dfs = null;
DataFileWriterDeliveryLogEvent dfw = null;
try {
dfs = new DataFileStreamDeliveryLogEvent(System.in,
new SpecificDatumReaderDeliveryLogEvent());
OutputStream stdout = System.out;
dfw = new DataFileWriterDeliveryLogEvent(new
SpecificDatumWriterDeliveryLogEvent());
dfw.setCodec(CodecFactory.deflateCodec(9));
dfw.setSyncInterval(1024 * 256);
dfw.create(DeliveryLogEvent.SCHEMA$, stdout);
dfw.appendAllFrom(dfs, false);
dfs is from System.in, which has multiple files one after the other. Each
file will need its own DataFileStream (has its own header and metadata).
In Java you could get the list of files, and for each file use HDFS's API
to open the file stream, and append that to your one file.
In bash you could loop over all the source files and append one at a time
(the above fails on the second file). However, in order to append to the
end of a pre-existing file the only API now takes a File, not a seekable
stream, so Avro would need a patch to allow that in HDFS (also, only an
HDFS version that supports appends would work).
Other things of note:
You will probably get better total file size compression by using a larger
sync interval (1M to 4 M) than deflate level 9. Deflate 9 is VERY slow and
almost never compresses more than 1% better than deflate 6, which is much
faster. I suggest experimenting with the 'recodec' option on some of your
files to see what the best size / performance tradeoff is. I doubt that
256K (pre-compression) blocks with level 9 compression is the way to go.
For reference: http://tukaani.org/lzma/benchmarks.html
(gzip uses deflate compression)
-Scott
}
finally {
if (dfs != null) try {dfs.close();} catch (Exception e)
{e.printStackTrace();}
if (dfw != null) try {dfw.close();} catch (Exception e)
{e.printStackTrace();}
}
}
}
Is there any way this could be made to work without needing to download
the individual files to disk and calling append for each of them?
Thanks,
Frank Grimes
On 2012-01-12, at 2:24 PM, Frank Grimes wrote:
Hi Scott,
If I have a map-only job, would I want only one mapper running to pull
all the records from the source input files and stream/append them to the