Re: Can spill to disk be in compressed Avro format to reduce I/O?

2012-01-13 Thread Frank Grimes
Hi Scott,

I added https://issues.apache.org/jira/browse/AVRO-991 to track this feature 
request.

Thanks,

Frank Grimes


On 2012-01-12, at 10:31 PM, Scott Carey wrote:

 
 
 On 1/12/12 5:52 PM, Frank Grimes frankgrime...@gmail.com wrote:
 
 Hi Scott,
 
 I've looked into this some more and I now see what you mean about appending 
 to HDFS directly not being possible with the current DataFileWriter API.
 
 That's unfortunate because we really would like to avoid needing to hit disk 
 just to write temporary files. (and the associated cleanup)
 
 I kinda like the notion of not requiring HDFS APIs to achieve this merging 
 of Avro files/streams.
 
 Assuming we wanted to be able to stream multiple files as in my example... 
 could DataFileStream easily be changed to support that use case?
 i.e. allow it to skip/ignore subsequent header and metadata in the stream or 
 not error out with Invalid sync!?
 
 
 That may be possible, open a JIRA to discuss further.  It should be modified 
 to 'reset' to the start of a new file or stream and continue from there, 
 since it needs to read the header and find the new sync value and validate 
 that the schemas match and the codec is compatible.  It may be possible to 
 detect the end of one file and the start of another if the files are streamed 
 back to back, but perhaps not reliably.
 The avro-tools could be extended to have a command line tool that takes a 
 list of files (HDFS or local) and writes a new file (HDFS or local) 
 concatenated and possibly recodec'd.
 
 
 Thanks,
 
 Frank Grimes
 
 
 On 2012-01-12, at 3:53 PM, Scott Carey wrote:
 
 
 
 On 1/12/12 12:35 PM, Frank Grimes frankgrime...@gmail.com wrote:
 
 So I decided to try writing my own AvroStreamCombiner utility and it seems 
 to choke when passing multiple input files:
 
 hadoop dfs -cat hdfs://hadoop/machine1.log.avro 
 hdfs://hadoop/machine2.log.avro | ./deliveryLogAvroStreamCombiner.sh  
 combined.log.avro
 
 Exception in thread main java.io.IOException: Invalid sync! 
 at 
 org.apache.avro.file.DataFileStream.nextRawBlock(DataFileStream.java:293)
 at 
 org.apache.avro.file.DataFileWriter.appendAllFrom(DataFileWriter.java:329)
 at DeliveryLogAvroStreamCombiner.main(Unknown Source)
 
 
 Here's the code in question:
 
 public class DeliveryLogAvroStreamCombiner {

/**
 * @param args
 */
public static void main(String[] args) throws Exception {
DataFileStreamDeliveryLogEvent dfs = null;
DataFileWriterDeliveryLogEvent dfw = null;

try {
dfs = new DataFileStreamDeliveryLogEvent(System.in, 
 new SpecificDatumReaderDeliveryLogEvent());

OutputStream stdout = System.out;

dfw = new DataFileWriterDeliveryLogEvent(new 
 SpecificDatumWriterDeliveryLogEvent());
dfw.setCodec(CodecFactory.deflateCodec(9));
dfw.setSyncInterval(1024 * 256);
dfw.create(DeliveryLogEvent.SCHEMA$, stdout);
 
dfw.appendAllFrom(dfs, false);
 
 
 dfs is from System.in, which has multiple files one after the other.  Each 
 file will need its own DataFileStream (has its own header and metadata).   
 
 In Java you could get the list of files, and for each file use HDFS's API 
 to open the file stream, and append that to your one file.
 In bash you could loop over all the source files and append one at a time 
 (the above fails on the second file).  However, in order to append to the 
 end of a pre-existing file the only API now takes a File, not a seekable 
 stream, so Avro would need a patch to allow that in HDFS (also, only an 
 HDFS version that supports appends would work).
 
 Other things of note:
 You will probably get better total file size compression by using a larger 
 sync interval (1M to 4 M) than deflate level 9.  Deflate 9 is VERY slow and 
 almost never compresses more than 1% better than deflate 6, which is much 
 faster.  I suggest experimenting with the 'recodec' option on some of your 
 files to see what the best size / performance tradeoff is.  I doubt that 
 256K (pre-compression) blocks with level 9 compression is the way to go.
 
 For reference: http://tukaani.org/lzma/benchmarks.html
 (gzip uses deflate compression)
 
 -Scott
 
 
}
finally {
if (dfs != null) try {dfs.close();} catch (Exception e) 
 {e.printStackTrace();}
if (dfw != null) try {dfw.close();} catch (Exception e) 
 {e.printStackTrace();}
}
}
 
 }
 
 Is there any way this could be made to work without needing to download 
 the individual files to disk and calling append for each of them?
 
 Thanks,
 
 Frank Grimes
 
 
 On 2012-01-12, at 2:24 PM, Frank Grimes wrote:
 
 Hi Scott,
 
 If I have a map-only job, would I want only one mapper running to pull 
 all the records from the source input files and stream/append them to the 
 

AVRO Schema Validator

2012-01-13 Thread Jason Rutherglen
Is there a command line way to validate an AVRO schema?