Re: hdfs.fileType = CompressedStream

Jimmy Thu, 30 Jan 2014 14:30:50 -0800

snappy is not splittable neither, combining with sequence files it gives
identical result - bulk dumps whole file into HDFS


I feel a bit uneasy to keep 120MB (almost 1GB uncompressed) file open for
one hour.....



On Thu, Jan 30, 2014 at 1:59 PM, Jeff Lord <[email protected]> wrote:

> You are using gzip so the files won't splittable.
> You may be better off using snappy and sequence files.
>
>
> On Thu, Jan 30, 2014 at 10:51 AM, Jimmy <[email protected]> wrote:
>
>> I am running few tests and would like to confirm whether this is true...
>>
>> hdfs.codeC = gzip
>> hdfs.fileType = CompressedStream
>> hdfs.writeFormat = Text
>> hdfs.batchSize = 100
>>
>>
>> now lets assume I have large number of transactions I roll file every 10
>> minutes
>>
>> it seems the tmp file stay 0bytes and flushes at once after 10 minutes vs
>> if I dont use compression, the file will grow as data are written to HDFS
>>
>> is this correct?
>>
>> Do you see any drawback in using compressedstream and with very large
>> files? In my case 120MB compressed file (block size) is 10x uncompressed
>>
>>
>

Re: hdfs.fileType = CompressedStream

Reply via email to