Re: Can I write to an compressed file which is located in hdfs?

bejoy . hadoop Mon, 06 Feb 2012 23:54:47 -0800

Hi
    AFAIK I don't think it is possible to append into a compressed file.


If you have files in hdfs on a dir and you need to compress the same (like 
files for an hour) you can use MapReduce to do that by setting 
mapred.output.compress = true and 
mapred.output.compression.codec='theCodecYouPrefer'
You'd get the blocks compressed in the output dir.

You can use the API to read from standard input like
-get hadoop conf
-register the required compression codec
-write to CompressionOutputStream.

You should get a well detailed explanation on the same from the book 'Hadoop - 
The definitive guide' by Tom White. 

Regards
Bejoy K S

From handheld, Please excuse typos.

-----Original Message-----
From: Xiaobin She <xiaobin...@gmail.com>
Date: Tue, 7 Feb 2012 14:24:01 
To: <common-user@hadoop.apache.org>; <bejoy.had...@gmail.com>; David 
Sinclair<dsincl...@chariotsolutions.com>
Subject: Re: Can I write to an compressed file which is located in hdfs?

hi Bejoy and David,

thank you for you help.

So I can't directly write logs or append logs into an compressed file in
hdfs, right?

Can I compress an file which is already in hdfs and has not been compressed?

If I can , how can I do that?

Thanks!



2012/2/6 <bejoy.had...@gmail.com>

> Hi
>       I agree with David on the point, you can achieve step 1 of my
> previous response with flume. ie load real time inflow of data in
> compressed format into hdfs. You can specify a time interval or data size
> in flume collector that determines when to flush data on to hdfs.
>
> Regards
> Bejoy K S
>
> From handheld, Please excuse typos.
>
> -----Original Message-----
> From: David Sinclair <dsincl...@chariotsolutions.com>
> Date: Mon, 6 Feb 2012 09:06:00
> To: <common-user@hadoop.apache.org>
> Cc: <bejoy.had...@gmail.com>
> Subject: Re: Can I write to an compressed file which is located in hdfs?
>
> Hi,
>
> You may want to have a look at the Flume project from Cloudera. I use it
> for writing data into HDFS.
>
> https://ccp.cloudera.com/display/SUPPORT/Downloads
>
> dave
>
> 2012/2/6 Xiaobin She <xiaobin...@gmail.com>
>
> > hi Bejoy ,
> >
> > thank you for your reply.
> >
> > actually I have set up an test cluster which has one namenode/jobtracker
> > and two datanode/tasktracker, and I have make an test on this cluster.
> >
> > I fetch the log file of one of our modules from the log collector
> machines
> > by rsync, and then I use hive command line tool to load this log file
> into
> > the hive warehouse which  simply copy the file from the local filesystem
> to
> > hdfs.
> >
> > And I have run some analysis on these data with hive, all this run well.
> >
> > But now I want to avoid the fetch section which use rsync, and write the
> > logs into hdfs files directly from the servers which generate these logs.
> >
> > And it seems easy to do this job if the file locate in the hdfs is not
> > compressed.
> >
> > But how to write or append logs to an file that is compressed and located
> > in hdfs?
> >
> > Is this possible?
> >
> > Or is this an bad practice?
> >
> > Thanks!
> >
> >
> >
> > 2012/2/6 <bejoy.had...@gmail.com>
> >
> > > Hi
> > >     If you have log files enough to become at least one block size in
> an
> > > hour. You can go ahead as
> > > - run a scheduled job every hour that compresses the log files for that
> > > hour and stores them on to hdfs (can use LZO or even Snappy to
> compress)
> > > - if your hive does more frequent analysis on this data store it as
> > > PARTITIONED BY (Date,Hour) . While loading into hdfs also follow a
> > > directory - sub dir structure. Once data is in hdfs issue a Alter Table
> > Add
> > > Partition statement on corresponding hive table.
> > > -in Hive DDL use the appropriate Input format (Hive has some ApacheLog
> > > Input Format already)
> > >
> > >
> > > Regards
> > > Bejoy K S
> > >
> > > From handheld, Please excuse typos.
> > >
> > > -----Original Message-----
> > > From: Xiaobin She <xiaobin...@gmail.com>
> > > Date: Mon, 6 Feb 2012 16:41:50
> > > To: <common-user@hadoop.apache.org>; 佘晓彬<xiaobin...@gmail.com>
> > > Reply-To: common-user@hadoop.apache.org
> > > Subject: Re: Can I write to an compressed file which is located in
> hdfs?
> > >
> > > sorry, this sentence is wrong,
> > >
> > > I can't compress these logs every hour and them put them into hdfs.
> > >
> > > it should be
> > >
> > > I can  compress these logs every hour and them put them into hdfs.
> > >
> > >
> > >
> > >
> > > 2012/2/6 Xiaobin She <xiaobin...@gmail.com>
> > >
> > > >
> > > > hi all,
> > > >
> > > > I'm testing hadoop and hive, and I want to use them in log analysis.
> > > >
> > > > Here I have a question, can I write/append log to  an compressed file
> > > > which is located in hdfs?
> > > >
> > > > Our system generate lots of log files every day, I can't compress
> these
> > > > logs every hour and them put them into hdfs.
> > > >
> > > > But what if I want to write logs into files that was already in the
> > hdfs
> > > > and was compressed?
> > > >
> > > > Is these files were not compressed, then this job seems easy, but how
> > to
> > > > write or append logs into an compressed log?
> > > >
> > > > Can I do that?
> > > >
> > > > Can anyone give me some advices or give me some examples?
> > > >
> > > > Thank you very much!
> > > >
> > > > xiaobin
> > > >
> > >
> > >
> >
>
>

Re: Can I write to an compressed file which is located in hdfs?

Reply via email to