Re: save several 64MB files in Pig Latin

Bertrand Dechoux Mon, 10 Jun 2013 04:43:30 -0700

The hadoop wiki give a brief explanation :
http://wiki.apache.org/hadoop/HowManyMapsAndReduces


The logic is indeed the same for Pig because, under the hood, Pig will
generated and optimize a workflow of MapReduce jobs.

With a splittable 500MB file, if the provided number of map (10) is
accepted, then yes, each map task will read more or less 50MB.

Regard

Bertrand




On Mon, Jun 10, 2013 at 1:34 PM, Ruslan Al-Fakikh <metarus...@gmail.com>wrote:

> Hi Pedro,
>
> Yes, Pig Latin is always compiled to MapReduce.
> Usually you don't have to specify the number of mappers (I am not sure
> whether you really can). If you have a file of 500MB and it is splittable
> then the number of mappers is automatically equals to 500MB / 64MB (block
> size) which is around 8. Here I assume that you have the default block size
> of 64MB. If your file is not splittable then the whole file will go to one
> mapper :(
>
> Let me know if you have further questions.
>
> Thanks
>
>
> On Mon, Jun 10, 2013 at 1:36 PM, Pedro Sá da Costa <psdc1...@gmail.com
> >wrote:
>
> > Yes, I understand the previous answers now. The reason of my question is
> > because I was trying to "split" a file with pig latin by loading the file
> > and writing portions of the file again in HDFS. With both replies, it
> seems
> > that pig latin uses mapreduce to compute the scripts, correct?
> >
> > And in map reduce, if I have one file with 500MB size, and I run an
> example
> > with 10 maps (we forget the reducers now), it means that each map will
> read
> > more or less 50MB?
> >
> >
> >
> > On 10 June 2013 11:21, Bertrand Dechoux <decho...@gmail.com> wrote:
> >
> > > I wasn't clear. Specifying the size of the files is not your real aim,
> I
> > > guess. But you think that's what is needed in order to solve your
> problem
> > > that we don't know about. 500MB is not a really big file in itself and
> is
> > > not an issue for HDFS and MapReduce.
> > >
> > > There is no absolute way to know how much data a reducer will produce
> > given
> > > its input because it depends on the implementation. In order to have a
> > > simple life cycle, each Reducer will write its own file. So if you want
> > to
> > > have smaller files, you will need to increase the number of Reducer.
> > (Same
> > > size + more files -> smaller files) However there is no way to have
> files
> > > with an exact size. One of the obvious reason is because you would need
> > to
> > > break a record (key/value) into two files. And there no reconciliation
> > > strategy for that. It does happen between blocks of a file but blocks
> of
> > a
> > > file are ordered so the RecordReader knows how to deal with it.
> > >
> > > Bertrand
> > >
> > > On Mon, Jun 10, 2013 at 8:58 AM, Johnny Zhang <xiao...@cloudera.com>
> > > wrote:
> > >
> > > > Hi, Pedro:
> > > > Basically how many splits of files depends on how many reducer you
> have
> > > in
> > > > your Pig job. So if total result data size is 100MB, and you have 10
> > > > reducers, you will get 10 files and each file with 10MB. Bertrand's
> > > pointer
> > > > is about specify number of reducer for your Pig job.
> > > >
> > > > Johnny
> > > >
> > > >
> > > > On Sun, Jun 9, 2013 at 10:42 PM, Pedro Sá da Costa <
> psdc1...@gmail.com
> > > > >wrote:
> > > >
> > > > > I don't understand why my purpose is not clear. The previous
> e-mails
> > > > > explain it very clearly.  I want to split a 500MB single txt in
> HDFS
> > > into
> > > > > multiple files using Pig latin. Is it possible? E.g.,
> > > > >
> > > > > A = LOAD ‘myfile.txt’ USING PigStorage() AS (t);
> > > > > STORE A INTO ‘multiplefiles’ USING PigStorage(); -- and here
> creates
> > > > > multiple file with a specific size
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On 10 June 2013 07:29, Bertrand Dechoux <decho...@gmail.com>
> wrote:
> > > > >
> > > > > > The purpose is not really clear. But if you are looking for how
> to
> > > > > specify
> > > > > > multiple Reducer task, it is well explained in the documentation.
> > > > > > http://pig.apache.org/docs/r0.11.1/perf.html#parallel
> > > > > >
> > > > > > You will get one file per reducer. It is up to you to specify the
> > > right
> > > > > > number but be careful of not falling into the small files problem
> > in
> > > > the
> > > > > > end.
> > > > > > http://blog.cloudera.com/blog/2009/02/the-small-files-problem/
> > > > > >
> > > > > > If you have specific question on HDFS itself or pig optimisation,
> > you
> > > > > > should provide more explanation.
> > > > > > (64MB is the default block size for HDFS)
> > > > > >
> > > > > > Regard
> > > > > >
> > > > > > Bertrand
> > > > > >
> > > > > >
> > > > > > On Mon, Jun 10, 2013 at 6:53 AM, Pedro Sá da Costa <
> > > psdc1...@gmail.com
> > > > > > >wrote:
> > > > > >
> > > > > > > I said 64MB, but it can be 128MB, or 5KB. It doesn't matter the
> > > > > number. I
> > > > > > > just want to extract data and put into several files with
> > specific
> > > > > size.
> > > > > > > Basically, I am doing a cat to a big txt file, and I want to
> > split
> > > > the
> > > > > > > content into multiple files with a fixed size.
> > > > > > >
> > > > > > >
> > > > > > > On 7 June 2013 10:14, Johnny Zhang <xiao...@cloudera.com>
> wrote:
> > > > > > >
> > > > > > > > Pedro, you can try Piggybank MultiStorage, which split
> results
> > > into
> > > > > > > > different dir/files by specific index attribute. But not sure
> > how
> > > > it
> > > > > > can
> > > > > > > > make sure the file size is 64MB. Why 64MB specifically?
> what's
> > > the
> > > > > > > > connection between your data and 64MB?
> > > > > > > >
> > > > > > > > Johnny
> > > > > > > >
> > > > > > > >
> > > > > > > > On Fri, Jun 7, 2013 at 12:56 AM, Pedro Sá da Costa <
> > > > > psdc1...@gmail.com
> > > > > > > > >wrote:
> > > > > > > >
> > > > > > > > > I am using the instruction:
> > > > > > > > >
> > > > > > > > > store A into 'result-australia-0' using PigStorage('\t');
> > > > > > > > >
> > > > > > > > > to store the data in HDFS. But the problem is that, this
> > > creates
> > > > 1
> > > > > > file
> > > > > > > > > with 500MB of size. Instead, want to save several 64MB
> files.
> > > > How I
> > > > > > do
> > > > > > > > > this?
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Best regards,
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Best regards,
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Bertrand Dechoux
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Best regards,
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Bertrand Dechoux
> > >
> >
> >
> >
> > --
> > Best regards,
> >
>



-- 
Bertrand Dechoux

Re: save several 64MB files in Pig Latin

Reply via email to