snappy is not splittable neither, combining with sequence files it gives identical result - bulk dumps whole file into HDFS
I feel a bit uneasy to keep 120MB (almost 1GB uncompressed) file open for one hour..... On Thu, Jan 30, 2014 at 1:59 PM, Jeff Lord <[email protected]> wrote: > You are using gzip so the files won't splittable. > You may be better off using snappy and sequence files. > > > On Thu, Jan 30, 2014 at 10:51 AM, Jimmy <[email protected]> wrote: > >> I am running few tests and would like to confirm whether this is true... >> >> hdfs.codeC = gzip >> hdfs.fileType = CompressedStream >> hdfs.writeFormat = Text >> hdfs.batchSize = 100 >> >> >> now lets assume I have large number of transactions I roll file every 10 >> minutes >> >> it seems the tmp file stay 0bytes and flushes at once after 10 minutes vs >> if I dont use compression, the file will grow as data are written to HDFS >> >> is this correct? >> >> Do you see any drawback in using compressedstream and with very large >> files? In my case 120MB compressed file (block size) is 10x uncompressed >> >> >
