Hello Keren,

There is nothing wrong in this. One dataset in Hadoop is usually one folder
and not one file. Pig is doing what it is supposed to do and performing a
union on both the files. You would have seen the content of both the files
together while doing dump C.

Since this is a map only job, and 2 mappers are getting generated, you are
getting 2 separate files. Which is actually one complete dataset. If you
want to have just one file, you need to force a reduce so that you get all
the results collectively in a single output file.

HTH

Warm Regards,
Tariq
cloudfront.blogspot.com


On Thu, Jul 25, 2013 at 11:31 AM, Keren Ouaknine <ker...@gmail.com> wrote:

> Hi,
>
> According to Pig's documention on union, two schemas which have the same
> schema (have the same length and  types can be implicitly cast) can be
> concatenated (see http://pig.apache.org/docs/r0.11.1/basic.html#union)
>
> However, when I try with:
> A = load '1.txt'          using PigStorage(' ')  as (x:int, y:chararray,
> z:chararray);
> B = load '1_ext.txt'  using PigStorage(' ')  as (a:int, b:chararray,
> c:chararray);
> C = union A, B;
> describe C;
> DUMP C;
> store C into '/home/kereno/Documents/pig-0.11.1/workspace/res';
>
> with:
> ~/Documents/pig-0.11.1/workspace 130$ more 1.txt 1_ext.txt
> ::::::::::::::
> 1.txt
> ::::::::::::::
> 1 a aleph
> 2 b bet
> 3 g gimel
> ::::::::::::::
> 1_ext.txt
> ::::::::::::::
> 0 a alpha
> 0 b beta
> 0 g gimel
>
>
> I get in result:~/Documents/pig-0.11.1/workspace 0$ more res/part-m-0000*
> ::::::::::::::
> res/part-m-00000
> ::::::::::::::
> 0 a alpha
> 0 b beta
> 0 g gimel
>  ::::::::::::::
> res/part-m-00001
> ::::::::::::::
> 1 a aleph
> 2 b bet
> 3 g gimel
>
> Whereas I was expecting something like
> 0 a alpha
> 0 b beta
> 0 g gimel
> 1 a aleph
> 2 b bet
> 3 g gimel
>
> [all together]
>
> I understand that two files for non-matching schemas would be generated but
> why for union with a matching schema?
>
> Thanks,
> Keren
>
> --
> Keren Ouaknine
> Web: www.kereno.com
>

Reply via email to