Re: > and < Comparison Operators not working

2015-02-02 Thread Pradeep Gollakota
Explicit casting will work, though you shouldn't need to use it. You should
specify an input schema using the AS keyword. This will ensure that
PigStorage will load your data using the appropriate types.

On Mon, Feb 2, 2015 at 7:22 AM, Arvind S  wrote:

> Use explicit casting during comparison
>
> Cheers !!!
> Arvind
> On 02-Feb-2015 8:39 pm, "Amit"  wrote:
>
> > Thanks for the response.The Pig script as such does not fail, it runs
> > successfully ( trying in local mode), however when the run is finished it
> > does not dump any tuples.Has it something to do with the CSV where the f1
> > is stored as a string ?The CSV data would look like this -
> >
> *"10","abc""20","xyz""30,"lmn"...
> > etc ***
> > Thanks,Amit
> >
> >  On Monday, February 2, 2015 3:37 AM, Pradeep Gollakota <
> > pradeep...@gmail.com> wrote:
> >
> >
> >  Just to clarify, do you have a semicolon after f1 > 20?
> >
> > A = LOAD 'data' USING PigStorage(',');
> > B = FOREACH A GENERATE f1;
> > C = FILTER B BY f1 > 20;
> > DUMP C;
> >
> > This should be correct.
> > ​
> >
> > On Sun, Feb 1, 2015 at 4:50 PM, Amit  wrote:
> >
> > > Hello,I am trying to run a Ad-hoc pig script on IBM Bluemix platform
> that
> > > has a arithmetic comparison.Suppose the data is f1-10203040..
> > > Let us say I would like to select the records where f1 > 20 . It is
> > pretty
> > > easy operation, however I am not sure why I cannot see expected results
> > in
> > > there.The data is initially loaded from a CSV file.Here is may pig
> script
> > > -
> >
> A
> > > = << Load from CSV file >> B = FOREACH A generate f1;C  = FILTER B by
> f1
> > >
> > > 20DUMP
> > >
> >
> C;
> > >  Appreciate if someone points out what I am doing wrong here.
> > > I also tried to run this in local mode just to make sure I am doing
> this
> > > right.
> > > Regards,Amit
> >
> >
>


Re: > and < Comparison Operators not working

2015-02-02 Thread Pradeep Gollakota
Just to clarify, do you have a semicolon after f1 > 20?

A = LOAD 'data' USING PigStorage(',');
B = FOREACH A GENERATE f1;
C = FILTER B BY f1 > 20;
DUMP C;

This should be correct.
​

On Sun, Feb 1, 2015 at 4:50 PM, Amit  wrote:

> Hello,I am trying to run a Ad-hoc pig script on IBM Bluemix platform that
> has a arithmetic comparison.Suppose the data is f1-10203040..
> Let us say I would like to select the records where f1 > 20 . It is pretty
> easy operation, however I am not sure why I cannot see expected results in
> there.The data is initially loaded from a CSV file.Here is may pig script
> - 
> A
> = << Load from CSV file >> B = FOREACH A generate f1;C  = FILTER B by f1 >
> 20DUMP
> C;
>  Appreciate if someone points out what I am doing wrong here.
> I also tried to run this in local mode just to make sure I am doing this
> right.
> Regards,Amit


Re: solr indexing using pig script

2015-01-16 Thread Pradeep Gollakota
Actually, there is one more option. You copy the code of the LoadStoreFunc
and modify it to push the collection name from a config property into the
location URL. But this is more involved engineering wise than splitting it
up into two scripts.

It's up to you.

On Thu, Jan 15, 2015 at 11:59 PM, Pradeep Gollakota 
wrote:

> It looks like your only option then is to use two separate scripts. It's
> not ideal because you have twice the I/O, but it should work.
>
> P.S. make sure to guy reply all so the list is kept in the loop.
> On Jan 15, 2015 11:41 PM, "Vishnu Viswanath" 
> wrote:
>
>> Thanks Pradeep for the suggestion.
>>
>> I am using zookeeper to store into SOLR. So my location is the zookeeper
>> server. I followed this link for doing the same:
>> https://docs.lucidworks.com/plugins/servlet/mobile#content/view/24380610
>>
>> Is there a better way of doing it if I am using zookeeper?
>>
>> Regards,
>> Vishnu Viswanath
>>
>>
>> > On 16-Jan-2015, at 12:34, Pradeep Gollakota 
>> wrote:
>> >
>> > Just out of curiosity, why are you using SET to set the solr collection?
>> > I'm not sure if you're using an out of the box Load/Store Func, but if I
>> > were to design it, I would use the "location" of a Load/Store Func to
>> > specify which solr collection to write to.
>> >
>> > Is it possible for you to redesign this way?
>> >
>> > On Thu, Jan 15, 2015 at 9:41 PM, Vishnu Viswanath <
>> > vishnu.viswanat...@gmail.com> wrote:
>> >
>> >> Thanks
>> >>
>> >> SET sets the SOLR collection name. When the STORE is invoked, the data
>> >> will be ingested into the collection name set before.
>> >>
>> >> So, the problem must be because  the second set is overriding the
>> >> collection name and the STORE is failing.
>> >>
>> >> Is there any way to overcome this? Because most of the processing time
>> is
>> >> taken in the load and I don't want to do it twice.
>> >>
>> >> Regards,
>> >> Vishnu Viswanath
>> >>
>> >>> On 16-Jan-2015, at 09:29, Cheolsoo Park  wrote:
>> >>>
>> >>> What does "SET" do for Solr? Pig pre-processes all the set commands in
>> >> the
>> >>> entire script before executing any query, and values are overwritten
>> if
>> >> the
>> >>> same key is set more than once. In your example, you have two set
>> >> commands.
>> >>> If you're thinking that different values will be applied in each
>> section,
>> >>> that's not the case. e) will overwrite a).
>> >>>
>> >>>
>> >>> On Thu, Jan 15, 2015 at 7:46 PM, Vishnu Viswanath <
>> >>> vishnu.viswanat...@gmail.com> wrote:
>> >>>
>> >>>> Hi All,
>> >>>>
>> >>>> I am in indexing data into solr using pig script.
>> >>>> I have two such scripts, and I tried combining these two scripts
>> into a
>> >>>> single one.
>> >>>>
>> >>>> i.e., i have script 1 that does
>> >>>> 
>> >>>> a)SET solr collection info for collection 1
>> >>>> b)LOAD data
>> >>>> c)FILTER data for SOLR collection number 1
>> >>>> d)STORE data to solr
>> >>>>
>> >>>>
>> >>>> and script 2 that does
>> >>>> ---
>> >>>> a)SET solr collection info for collection 2
>> >>>> b)LOAD data
>> >>>> c)FILTER data for SOLR collection number 2
>> >>>> d)STORE data to solr
>> >>>>
>> >>>>
>> >>>> combined script looks something like
>> >>>> --
>> >>>> a)SET solr collection info for collection 1
>> >>>> b)LOAD data
>> >>>> c)FILTER data from (b) for SOLR collection number 1
>> >>>> d)STORE data to solr
>> >>>> e)SET solr collection info for collection 2
>> >>>> f)FILTER data from (b) for SOLR collection number 2
>> >>>> g)STORE data to solr
>> >>>>
>> >>>> But the store function fails when I run the combined script where as
>> it
>> >>>> runs fine if I run scripts 1 and 2 separately.
>> >>>>
>> >>>> Any idea?
>> >>>>
>> >>>> Regards,
>> >>>> Vishnu
>> >
>> > --001a11c13bfcdc3d7f050cbf93c1--
>>
>


Re: solr indexing using pig script

2015-01-16 Thread Pradeep Gollakota
It looks like your only option then is to use two separate scripts. It's
not ideal because you have twice the I/O, but it should work.

P.S. make sure to guy reply all so the list is kept in the loop.
On Jan 15, 2015 11:41 PM, "Vishnu Viswanath" 
wrote:

> Thanks Pradeep for the suggestion.
>
> I am using zookeeper to store into SOLR. So my location is the zookeeper
> server. I followed this link for doing the same:
> https://docs.lucidworks.com/plugins/servlet/mobile#content/view/24380610
>
> Is there a better way of doing it if I am using zookeeper?
>
> Regards,
> Vishnu Viswanath
>
>
> > On 16-Jan-2015, at 12:34, Pradeep Gollakota 
> wrote:
> >
> > Just out of curiosity, why are you using SET to set the solr collection?
> > I'm not sure if you're using an out of the box Load/Store Func, but if I
> > were to design it, I would use the "location" of a Load/Store Func to
> > specify which solr collection to write to.
> >
> > Is it possible for you to redesign this way?
> >
> > On Thu, Jan 15, 2015 at 9:41 PM, Vishnu Viswanath <
> > vishnu.viswanat...@gmail.com> wrote:
> >
> >> Thanks
> >>
> >> SET sets the SOLR collection name. When the STORE is invoked, the data
> >> will be ingested into the collection name set before.
> >>
> >> So, the problem must be because  the second set is overriding the
> >> collection name and the STORE is failing.
> >>
> >> Is there any way to overcome this? Because most of the processing time
> is
> >> taken in the load and I don't want to do it twice.
> >>
> >> Regards,
> >> Vishnu Viswanath
> >>
> >>> On 16-Jan-2015, at 09:29, Cheolsoo Park  wrote:
> >>>
> >>> What does "SET" do for Solr? Pig pre-processes all the set commands in
> >> the
> >>> entire script before executing any query, and values are overwritten if
> >> the
> >>> same key is set more than once. In your example, you have two set
> >> commands.
> >>> If you're thinking that different values will be applied in each
> section,
> >>> that's not the case. e) will overwrite a).
> >>>
> >>>
> >>> On Thu, Jan 15, 2015 at 7:46 PM, Vishnu Viswanath <
> >>> vishnu.viswanat...@gmail.com> wrote:
> >>>
> >>>> Hi All,
> >>>>
> >>>> I am in indexing data into solr using pig script.
> >>>> I have two such scripts, and I tried combining these two scripts into
> a
> >>>> single one.
> >>>>
> >>>> i.e., i have script 1 that does
> >>>> 
> >>>> a)SET solr collection info for collection 1
> >>>> b)LOAD data
> >>>> c)FILTER data for SOLR collection number 1
> >>>> d)STORE data to solr
> >>>>
> >>>>
> >>>> and script 2 that does
> >>>> ---
> >>>> a)SET solr collection info for collection 2
> >>>> b)LOAD data
> >>>> c)FILTER data for SOLR collection number 2
> >>>> d)STORE data to solr
> >>>>
> >>>>
> >>>> combined script looks something like
> >>>> --
> >>>> a)SET solr collection info for collection 1
> >>>> b)LOAD data
> >>>> c)FILTER data from (b) for SOLR collection number 1
> >>>> d)STORE data to solr
> >>>> e)SET solr collection info for collection 2
> >>>> f)FILTER data from (b) for SOLR collection number 2
> >>>> g)STORE data to solr
> >>>>
> >>>> But the store function fails when I run the combined script where as
> it
> >>>> runs fine if I run scripts 1 and 2 separately.
> >>>>
> >>>> Any idea?
> >>>>
> >>>> Regards,
> >>>> Vishnu
> >
> > --001a11c13bfcdc3d7f050cbf93c1--
>


Re: Is ther a way to run one test of special unit test?

2015-01-15 Thread Pradeep Gollakota
If you're using maven AND using surefire plugin 2.7.3+ AND using Junit 4,
then you can do this by specifying -Dtest=TestClass#methodName

ref:
http://maven.apache.org/surefire/maven-surefire-plugin/examples/single-test.html

On Thu, Jan 15, 2015 at 8:02 PM, Cheolsoo Park  wrote:

> I don't think you can disable test cases on the fly in JUnit. You will need
> to add @Ignore annotation and recompile the test file. Correct me if I am
> wrong.
>
> On Thu, Jan 15, 2015 at 6:55 PM, lulynn_2008  wrote:
>
> > Hi All,
> >
> > There are multiple tests in one Test* file. Is there a way to just run
> > only one pointed test?
> >
> > Thanks
> >
>


Re: solr indexing using pig script

2015-01-15 Thread Pradeep Gollakota
Just out of curiosity, why are you using SET to set the solr collection?
I'm not sure if you're using an out of the box Load/Store Func, but if I
were to design it, I would use the "location" of a Load/Store Func to
specify which solr collection to write to.

Is it possible for you to redesign this way?

On Thu, Jan 15, 2015 at 9:41 PM, Vishnu Viswanath <
vishnu.viswanat...@gmail.com> wrote:

> Thanks
>
> SET sets the SOLR collection name. When the STORE is invoked, the data
> will be ingested into the collection name set before.
>
> So, the problem must be because  the second set is overriding the
> collection name and the STORE is failing.
>
> Is there any way to overcome this? Because most of the processing time is
> taken in the load and I don't want to do it twice.
>
> Regards,
> Vishnu Viswanath
>
> > On 16-Jan-2015, at 09:29, Cheolsoo Park  wrote:
> >
> > What does "SET" do for Solr? Pig pre-processes all the set commands in
> the
> > entire script before executing any query, and values are overwritten if
> the
> > same key is set more than once. In your example, you have two set
> commands.
> > If you're thinking that different values will be applied in each section,
> > that's not the case. e) will overwrite a).
> >
> >
> > On Thu, Jan 15, 2015 at 7:46 PM, Vishnu Viswanath <
> > vishnu.viswanat...@gmail.com> wrote:
> >
> >> Hi All,
> >>
> >> I am in indexing data into solr using pig script.
> >> I have two such scripts, and I tried combining these two scripts into a
> >> single one.
> >>
> >> i.e., i have script 1 that does
> >> 
> >> a)SET solr collection info for collection 1
> >> b)LOAD data
> >> c)FILTER data for SOLR collection number 1
> >> d)STORE data to solr
> >>
> >>
> >> and script 2 that does
> >> ---
> >> a)SET solr collection info for collection 2
> >> b)LOAD data
> >> c)FILTER data for SOLR collection number 2
> >> d)STORE data to solr
> >>
> >>
> >> combined script looks something like
> >> --
> >> a)SET solr collection info for collection 1
> >> b)LOAD data
> >> c)FILTER data from (b) for SOLR collection number 1
> >> d)STORE data to solr
> >> e)SET solr collection info for collection 2
> >> f)FILTER data from (b) for SOLR collection number 2
> >> g)STORE data to solr
> >>
> >> But the store function fails when I run the combined script where as it
> >> runs fine if I run scripts 1 and 2 separately.
> >>
> >> Any idea?
> >>
> >> Regards,
> >> Vishnu
> >>
>


Re: Help with Pig UDF?

2014-12-05 Thread Pradeep Gollakota
A static variable is not necessary... a simple instance variable is just
fine.

On Fri Dec 05 2014 at 2:27:53 PM Ryan  wrote:

> After running it with updated code, it seems like the problem has to do
> with something related to Tika since my output says that my input is the
> correct number of bytes (i.e. it's actually being sent in correctly). Going
> to test further to narrow down the problem.
>
> Pradeep, would you recommend using a static variable inside the
> ExtractTextFromPDFs function to store the PdfParser once it has been
> initialized once? I'm still learning how to best do things within the
> Pig/MapReduce/Hadoop framework
>
> Ryan
>
> On Fri, Dec 5, 2014 at 1:35 PM, Ryan 
> wrote:
>
> > Thanks Pradeep! I'll give it a try and report back
> >
> > Ryan
> >
> > On Fri, Dec 5, 2014 at 12:30 PM, Pradeep Gollakota  >
> > wrote:
> >
> >> I forgot to mention earlier that you should probably move the PdfParser
> >> initialization code out of the evaluate method. This will probably
> cause a
> >> significant overhead both in terms of gc and runtime performance. You'll
> >> want to initialize your parser once and evaluate all your docs against
> it.
> >>
> >> - Pradeep
> >>
> >> On Fri Dec 05 2014 at 9:18:16 AM Pradeep Gollakota <
> pradeep...@gmail.com>
> >> wrote:
> >>
> >> > Java string's are immutable. So "pdfText.concat()" returns a new
> string
> >> > and the original string is left unmolested. So at the end, all you're
> >> doing
> >> > is returning an empty string. Instead, you can do "pdfText =
> >> > pdfText.concat(...)". But the better way to write it is to use a
> >> > StringBuilder.
> >> >
> >> > StringBuilder pdfText = ...;
> >> > pdfText.append(...);
> >> > pdfText.append(...);
> >> > ...
> >> > return pdfText.toString();
> >> >
> >> > On Fri Dec 05 2014 at 9:12:37 AM Ryan 
> >> > wrote:
> >> >
> >> >> Hi,
> >> >>
> >> >> I'm working on an open source project attempting to convert raw
> content
> >> >> from a pdf (stored as a databytearray) into plain text using a Pig
> UDF
> >> and
> >> >> Apache Tika. I could use your help. For some reason, the UDF I'm
> using
> >> >> isn't working. The script succeeds but no output is written. *This is
> >> the
> >> >> Pig script I'm following:*
> >> >>
> >> >> register 'target/warcbase-0.1.0-SNAPSHOT-fatjar.jar';
> >> >> DEFINE ExtractTextFromPDFs
> >> >>  org.warcbase.pig.piggybank.ExtractTextFromPDFs();
> >> >> DEFINE ArcLoader org.warcbase.pig.ArcLoader();
> >> >>
> >> >> raw = load '/data/arc/sample.arc' using ArcLoader as (url: chararray,
> >> >> date:
> >> >> chararray, mime: chararray, content: bytearray); --load the data
> >> >>
> >> >> a = FILTER raw BY (url matches '.*\\.pdf$');  --gets all PDF pages
> from
> >> >> the
> >> >> arc file
> >> >> b = LIMIT a 2; --limit to 2 pages to speed up testing time
> >> >> c = foreach b generate url, ExtractTextFromPDFs(content);
> >> >> store c into 'output/pdf_test';
> >> >>
> >> >>
> >> >> *This is the UDF I wrote:*
> >> >>
> >> >> public class ExtractTextFromPDFs extends EvalFunc {
> >> >>
> >> >>   @Override
> >> >>   public String exec(Tuple input) throws IOException {
> >> >>   String pdfText = "";
> >> >>
> >> >>   if (input == null || input.size() == 0 || input.get(0) ==
> null) {
> >> >>   return "N/A";
> >> >>   }
> >> >>
> >> >>   DataByteArray dba = (DataByteArray)input.get(0);
> >> >>   pdfText.concat(String.valueOf(dba.size())); //my attempt at
> >> >> debugging. Nothing written
> >> >>
> >> >>   InputStream is = new ByteArrayInputStream(dba.get());
> >> >>
> >> >>   ContentHandler contenthandler = new BodyContentHandler();
> >> >>   Metadata metadata = new Metadata();
> >> >>   DefaultDetector detector = new DefaultDetector();
> >> >>   AutoDetectParser pdfparser = new AutoDetectParser(detector);
> >> >>
> >> >>   try {
> >> >> pdfparser.parse(is, contenthandler, metadata, new
> >> ParseContext());
> >> >>   } catch (SAXException | TikaException e) {
> >> >> // TODO Auto-generated catch block
> >> >> e.printStackTrace();
> >> >>   }
> >> >>   pdfText.concat(" : "); //another attempt at debugging. Still
> >> nothing
> >> >> written
> >> >>   pdfText.concat(contenthandler.toString());
> >> >>
> >> >>   //close the input stream
> >> >>   if(is != null){
> >> >> is.close();
> >> >>   }
> >> >>   return pdfText;
> >> >>   }
> >> >>
> >> >> }
> >> >>
> >> >> Thank you for your assistance,
> >> >> Ryan
> >> >>
> >> >
> >>
> >
> >
>


Re: Help with Pig UDF?

2014-12-05 Thread Pradeep Gollakota
I forgot to mention earlier that you should probably move the PdfParser
initialization code out of the evaluate method. This will probably cause a
significant overhead both in terms of gc and runtime performance. You'll
want to initialize your parser once and evaluate all your docs against it.

- Pradeep

On Fri Dec 05 2014 at 9:18:16 AM Pradeep Gollakota 
wrote:

> Java string's are immutable. So "pdfText.concat()" returns a new string
> and the original string is left unmolested. So at the end, all you're doing
> is returning an empty string. Instead, you can do "pdfText =
> pdfText.concat(...)". But the better way to write it is to use a
> StringBuilder.
>
> StringBuilder pdfText = ...;
> pdfText.append(...);
> pdfText.append(...);
> ...
> return pdfText.toString();
>
> On Fri Dec 05 2014 at 9:12:37 AM Ryan 
> wrote:
>
>> Hi,
>>
>> I'm working on an open source project attempting to convert raw content
>> from a pdf (stored as a databytearray) into plain text using a Pig UDF and
>> Apache Tika. I could use your help. For some reason, the UDF I'm using
>> isn't working. The script succeeds but no output is written. *This is the
>> Pig script I'm following:*
>>
>> register 'target/warcbase-0.1.0-SNAPSHOT-fatjar.jar';
>> DEFINE ExtractTextFromPDFs
>>  org.warcbase.pig.piggybank.ExtractTextFromPDFs();
>> DEFINE ArcLoader org.warcbase.pig.ArcLoader();
>>
>> raw = load '/data/arc/sample.arc' using ArcLoader as (url: chararray,
>> date:
>> chararray, mime: chararray, content: bytearray); --load the data
>>
>> a = FILTER raw BY (url matches '.*\\.pdf$');  --gets all PDF pages from
>> the
>> arc file
>> b = LIMIT a 2; --limit to 2 pages to speed up testing time
>> c = foreach b generate url, ExtractTextFromPDFs(content);
>> store c into 'output/pdf_test';
>>
>>
>> *This is the UDF I wrote:*
>>
>> public class ExtractTextFromPDFs extends EvalFunc {
>>
>>   @Override
>>   public String exec(Tuple input) throws IOException {
>>   String pdfText = "";
>>
>>   if (input == null || input.size() == 0 || input.get(0) == null) {
>>   return "N/A";
>>   }
>>
>>   DataByteArray dba = (DataByteArray)input.get(0);
>>   pdfText.concat(String.valueOf(dba.size())); //my attempt at
>> debugging. Nothing written
>>
>>   InputStream is = new ByteArrayInputStream(dba.get());
>>
>>   ContentHandler contenthandler = new BodyContentHandler();
>>   Metadata metadata = new Metadata();
>>   DefaultDetector detector = new DefaultDetector();
>>   AutoDetectParser pdfparser = new AutoDetectParser(detector);
>>
>>   try {
>> pdfparser.parse(is, contenthandler, metadata, new ParseContext());
>>   } catch (SAXException | TikaException e) {
>> // TODO Auto-generated catch block
>> e.printStackTrace();
>>   }
>>   pdfText.concat(" : "); //another attempt at debugging. Still nothing
>> written
>>   pdfText.concat(contenthandler.toString());
>>
>>   //close the input stream
>>   if(is != null){
>> is.close();
>>   }
>>   return pdfText;
>>   }
>>
>> }
>>
>> Thank you for your assistance,
>> Ryan
>>
>


Re: Help with Pig UDF?

2014-12-05 Thread Pradeep Gollakota
Java string's are immutable. So "pdfText.concat()" returns a new string and
the original string is left unmolested. So at the end, all you're doing is
returning an empty string. Instead, you can do "pdfText =
pdfText.concat(...)". But the better way to write it is to use a
StringBuilder.

StringBuilder pdfText = ...;
pdfText.append(...);
pdfText.append(...);
...
return pdfText.toString();

On Fri Dec 05 2014 at 9:12:37 AM Ryan  wrote:

> Hi,
>
> I'm working on an open source project attempting to convert raw content
> from a pdf (stored as a databytearray) into plain text using a Pig UDF and
> Apache Tika. I could use your help. For some reason, the UDF I'm using
> isn't working. The script succeeds but no output is written. *This is the
> Pig script I'm following:*
>
> register 'target/warcbase-0.1.0-SNAPSHOT-fatjar.jar';
> DEFINE ExtractTextFromPDFs
>  org.warcbase.pig.piggybank.ExtractTextFromPDFs();
> DEFINE ArcLoader org.warcbase.pig.ArcLoader();
>
> raw = load '/data/arc/sample.arc' using ArcLoader as (url: chararray, date:
> chararray, mime: chararray, content: bytearray); --load the data
>
> a = FILTER raw BY (url matches '.*\\.pdf$');  --gets all PDF pages from the
> arc file
> b = LIMIT a 2; --limit to 2 pages to speed up testing time
> c = foreach b generate url, ExtractTextFromPDFs(content);
> store c into 'output/pdf_test';
>
>
> *This is the UDF I wrote:*
>
> public class ExtractTextFromPDFs extends EvalFunc {
>
>   @Override
>   public String exec(Tuple input) throws IOException {
>   String pdfText = "";
>
>   if (input == null || input.size() == 0 || input.get(0) == null) {
>   return "N/A";
>   }
>
>   DataByteArray dba = (DataByteArray)input.get(0);
>   pdfText.concat(String.valueOf(dba.size())); //my attempt at
> debugging. Nothing written
>
>   InputStream is = new ByteArrayInputStream(dba.get());
>
>   ContentHandler contenthandler = new BodyContentHandler();
>   Metadata metadata = new Metadata();
>   DefaultDetector detector = new DefaultDetector();
>   AutoDetectParser pdfparser = new AutoDetectParser(detector);
>
>   try {
> pdfparser.parse(is, contenthandler, metadata, new ParseContext());
>   } catch (SAXException | TikaException e) {
> // TODO Auto-generated catch block
> e.printStackTrace();
>   }
>   pdfText.concat(" : "); //another attempt at debugging. Still nothing
> written
>   pdfText.concat(contenthandler.toString());
>
>   //close the input stream
>   if(is != null){
> is.close();
>   }
>   return pdfText;
>   }
>
> }
>
> Thank you for your assistance,
> Ryan
>


Re: Error during filter - difference between . and ::

2014-10-14 Thread Pradeep Gollakota
The "." is a dereference operator. This is used for look into complex data
types. See http://pig.apache.org/docs/r0.12.0/basic.html#deref

The "::" is a disambiguation operator. When performing a join, you may have
fields that are named the same in the relations that were joined. In order
to tell pig which relation to get the field from, you need to use the
disambiguation operator. This is not necessary if the field is only in one
relation. See http://pig.apache.org/docs/r0.12.0/basic.html#disambiguate

Hope this helps.

- Pradeep

On Tue, Oct 14, 2014 at 3:30 AM, Jakub Stransky 
wrote:

> Hello experienced users,
>
> I am relatively new to pig and I came across to one thing I do not fully
> understand. I have following script:
>
> dirtydata = load '/data/120422' using AvroStorage();
>
> sodtr = filter dirtydata by TransactionBlockNumber == 1;
> sto   = foreach sodtr generate Dob.Value as Dob,StoreId,
> Created.UnixUtcTime;
> g = GROUP sto BY  (Dob,StoreId);
> sodtime = FOREACH g GENERATE group.Dob AS Dob, group.StoreId as StoreId,
> MAX(sto.UnixUtcTime) AS latestStartOfDayTime;
>
> joined = join dirtydata by (Dob.Value, StoreId) LEFT OUTER, sodtime by
> (Dob, StoreId);
>
> cleandata = filter joined by dirtydata.Created.UnixUtcTime >=
> sodtime.latestStartOfDayTime;
> dump cleandata
>
> I am getting folloving error:
>
>
>  ERROR 0: Exception while executing (Name: joined: Local
> Rearrange[tuple]{tuple}(false) - scope-166 Operator Key: scope-166):
> org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception
> while executing [POProject (Name: Project[long][0] - scope-152 Operator
> Key: scope-152) children: null at []]:
> org.apache.pig.backend.executionengine.ExecException: *ERROR 0: Scalar has
> more than one row in the output.* 1st :
>
> (1,(20120422),64619,2164,{(((20120422),64619,2164,(1335120734,-300),2,),{},(false,840),{},{(00200079----,((1,LUNCH),(2097271,(2097271,WL
>
> 119),false),{(,(1335120734,-300),CheckPrint)},{},((0),PerGroup),20121,(3,Coffee
> Bar),),((34.57),(36.02)),{},{},{},{},{},{},{})},{})},(1412864847,-300)),
> 2nd
>
> :(1,(20120422),64619,1,{(((20120422),64619,1,(1335088853,-300),3,),{},(false,840),{},{},{({(ClockedIn,(1335088800,-300),(-62135596800,0),(1),{(0,(11),false)},0,(4,Baker),{},false)},(511,Roger
> Baeza-Vasquez))})},(1412864846,-300))
> 2014-10-14 05:28:25,165 [main] ERROR
> org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed!
>
> When I change following relation:
> cleandata = filter joined by dirtydata*::*Created.UnixUtcTime >=
> sodtime.latestStartOfDayTime;
>
> Than all works fine. Seems to me  like a mystery because I would expect
> that the same I need to do for sodtime relation. I am missing something
> here. Could someone please put some light on it?
>
> Thanks a lot
> Jakub
>


Re: Is there a way to indicate that the data is sorted for a group-by operation?

2014-10-13 Thread Pradeep Gollakota
This is a great question!

I could be wrong, but I don't believe there is a way to indicate this for a
group-by. It definitely does matter for performance if your input is
globally sorted. Currently a group by happens on reduce side. But if the
input is globally sorted, this can happen map side for a significant
performance boost.

I did see a CollectableLoadFunc

interface that's used in the MergeJoin algorithm... I don't see why this
couldn't be used for a map side group by also.

On Sun, Oct 12, 2014 at 11:48 PM, Sunil S Nandihalli <
sunil.nandiha...@gmail.com> wrote:

> Hi Everybody,
>  Is there a way to indicate that the data is sorted by the key using which
> the relations are being grouped? Or does it even matter for performance
> whether we indicate it or not?
> Thanks,
> Sunil.
>


Re: Optimizing Pig script

2014-10-06 Thread Pradeep Gollakota
It looks like the best option at this point is to write a custom UDF that
takes loads a set of regular expressions from file and runs the data
through all of them.

On Mon, Oct 6, 2014 at 1:44 PM, Ankur Kasliwal 
wrote:

> Thanks for replying everyone. Few comments to everyone's suggestion.
>
> 1>  I am processing sequence file which consist of many CSV files. I need
> to extract only few among all CSV'S. So that is the reason I am doing 
> 'SelectFieldByValue'
> which is file name in my case not by field directly.
>
> 2>  All selected files ( different RegEx ) are stored in HDFS separately.
> So one STORE statement for each extracted file in a bag.
>
> 3>  Cannot  do cross join as all files input will get combined, do not
> want to do that.
>
> 4>  Cannot do AND/OR operator as i need different bags for each selected
> file ( RegEx).
>
>
>
> Let me know if any one has any other suggestions.
> Sorry for not being clear with specification at first place.
>
> Thanks.
>
> On Mon, Oct 6, 2014 at 4:12 PM, Pradeep Gollakota 
> wrote:
>
>> In case you haven't seen this already, take a look at
>> http://pig.apache.org/docs/r0.13.0/perf.html for some basic strategies on
>> optimizing your pig scripts.
>>
>> On Mon, Oct 6, 2014 at 1:08 PM, Russell Jurney 
>> wrote:
>>
>> > Actually, I don't think you need SelectFieldByValue. Just use the name
>> of
>> > the field directly.
>> >
>> > On Monday, October 6, 2014, Prashant Kommireddi 
>> > wrote:
>> >
>> > > Are these regex static? If yes, this is easily achieved with embedding
>> > your
>> > > script in Java or any other language that Pig supports
>> > > http://pig.apache.org/docs/r0.13.0/cont.html
>> > >
>> > > You could also possibly write a UDF that loops through all the regex
>> and
>> > > returns result.
>> > >
>> > >
>> > >
>> > > On Mon, Oct 6, 2014 at 12:44 PM, Ankur Kasliwal <
>> > > ankur.kasliwal...@gmail.com 
>> > > > wrote:
>> > >
>> > > > Hi,
>> > > >
>> > > >
>> > > >
>> > > > I have written a ‘Pig Script’ which is processing Sequence files
>> given
>> > as
>> > > > input.
>> > > >
>> > > > It is working fine but there is one problem mentioned below.
>> > > >
>> > > >
>> > > >
>> > > > I have repetitive statements in my pig script,  as shown below:
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >-  Filtered_Data _1= FILTER BagName BY ($0 matches 'RegEx-1');
>> > > >-  Filtered_Data_2 = FILTER BagName BY ($0 matches 'RegEx-2');
>> > > >-  Filtered_Data_3 = FILTER BagName BY ($0 matches 'RegEx-3');
>> > > >- So on…
>> > > >
>> > > >
>> > > >
>> > > > Question :
>> > > >
>> > > > So is there any way by which I can have above statement written once
>> > and
>> > > >
>> > > > then loop through all possible “RegEx” and substitute in Pig script.
>> > > >
>> > > >
>> > > >
>> > > > For Example:
>> > > >
>> > > >
>> > > > Filtered_Data _X  =   FILTER BagName BY ($0 matches 'RegEx');  (
>> have
>> > > this
>> > > > statement once )
>> > > >
>> > > > ( loop through all possible RegEx and substitute value in the
>> > statement )
>> > > >
>> > > >
>> > > >
>> > > > Right now I am calling Pig script from a shell script, so any way
>> from
>> > > > shell script will be also be welcome..
>> > > >
>> > > >
>> > > >
>> > > > Thanks in advance.
>> > > >
>> > > > Happy Pigging
>> > > >
>> > >
>> >
>> >
>> > --
>> > Russell Jurney twitter.com/rjurney russell.jur...@gmail.com
>> > datasyndrome.com
>> >
>>
>
>


Re: Optimizing Pig script

2014-10-06 Thread Pradeep Gollakota
In case you haven't seen this already, take a look at
http://pig.apache.org/docs/r0.13.0/perf.html for some basic strategies on
optimizing your pig scripts.

On Mon, Oct 6, 2014 at 1:08 PM, Russell Jurney 
wrote:

> Actually, I don't think you need SelectFieldByValue. Just use the name of
> the field directly.
>
> On Monday, October 6, 2014, Prashant Kommireddi 
> wrote:
>
> > Are these regex static? If yes, this is easily achieved with embedding
> your
> > script in Java or any other language that Pig supports
> > http://pig.apache.org/docs/r0.13.0/cont.html
> >
> > You could also possibly write a UDF that loops through all the regex and
> > returns result.
> >
> >
> >
> > On Mon, Oct 6, 2014 at 12:44 PM, Ankur Kasliwal <
> > ankur.kasliwal...@gmail.com 
> > > wrote:
> >
> > > Hi,
> > >
> > >
> > >
> > > I have written a ‘Pig Script’ which is processing Sequence files given
> as
> > > input.
> > >
> > > It is working fine but there is one problem mentioned below.
> > >
> > >
> > >
> > > I have repetitive statements in my pig script,  as shown below:
> > >
> > >
> > >
> > >
> > >
> > >-  Filtered_Data _1= FILTER BagName BY ($0 matches 'RegEx-1');
> > >-  Filtered_Data_2 = FILTER BagName BY ($0 matches 'RegEx-2');
> > >-  Filtered_Data_3 = FILTER BagName BY ($0 matches 'RegEx-3');
> > >- So on…
> > >
> > >
> > >
> > > Question :
> > >
> > > So is there any way by which I can have above statement written once
> and
> > >
> > > then loop through all possible “RegEx” and substitute in Pig script.
> > >
> > >
> > >
> > > For Example:
> > >
> > >
> > > Filtered_Data _X  =   FILTER BagName BY ($0 matches 'RegEx');  ( have
> > this
> > > statement once )
> > >
> > > ( loop through all possible RegEx and substitute value in the
> statement )
> > >
> > >
> > >
> > > Right now I am calling Pig script from a shell script, so any way from
> > > shell script will be also be welcome..
> > >
> > >
> > >
> > > Thanks in advance.
> > >
> > > Happy Pigging
> > >
> >
>
>
> --
> Russell Jurney twitter.com/rjurney russell.jur...@gmail.com
> datasyndrome.com
>


Re: Optimizing Pig script

2014-10-06 Thread Pradeep Gollakota
Hi Ankur,

Is the list of regular expressions static or dynamic? If it's a static
list, you can collapse all the filter operators into a single operator and
use the AND keyword to combine them.

E.g.

 Filtered_Data = FILTER BagName BY ($0 matches 'RegEx-1') AND ($0 matches
'RegEx-2') AND ($0 matches 'RegEx-3');

If it's dynamic, you can use the option that Russell and Prashant
suggested. Write a UDF that loads a list of regular expressions and
processes them in sequence.

On Mon, Oct 6, 2014 at 12:44 PM, Ankur Kasliwal  wrote:

> Hi,
>
>
>
> I have written a ‘Pig Script’ which is processing Sequence files given as
> input.
>
> It is working fine but there is one problem mentioned below.
>
>
>
> I have repetitive statements in my pig script,  as shown below:
>
>
>
>
>
>-  Filtered_Data _1= FILTER BagName BY ($0 matches 'RegEx-1');
>-  Filtered_Data_2 = FILTER BagName BY ($0 matches 'RegEx-2');
>-  Filtered_Data_3 = FILTER BagName BY ($0 matches 'RegEx-3');
>- So on…
>
>
>
> Question :
>
> So is there any way by which I can have above statement written once and
>
> then loop through all possible “RegEx” and substitute in Pig script.
>
>
>
> For Example:
>
>
> Filtered_Data _X  =   FILTER BagName BY ($0 matches 'RegEx');  ( have this
> statement once )
>
> ( loop through all possible RegEx and substitute value in the statement )
>
>
>
> Right now I am calling Pig script from a shell script, so any way from
> shell script will be also be welcome..
>
>
>
> Thanks in advance.
>
> Happy Pigging
>


Re: Json Loader - Array of objects - Loading results in empty data set

2014-08-08 Thread Pradeep Gollakota
I haven't worked with JsonLoader much, so I'm not sure what the problem is.
But your schema looks correct for your JSON structure now.

DataBSets is an Array (or Bag) of Objects (or Tuples). Each Object (or
Tuple) inside the Array has one key which maps to an Object(or Tuple) with
two keys. This is exactly what you would want the structure to look like in
pig.

```
Grunt > describe b;
b: {DataASet: (A1: int,A2: int,DataBSets: {tuple_0: (DataBSet: (B1:
chararray,B2: chararray))})}
grunt> dump b;
()
grunt>
```

I know that lots of people have been having problems with JsonLoader in the
past. I can recall off-hand several emails over the past year on this
mailing list complaining about the loader. Most of the recommendations,
remembering off the top of my head, have been to use the Elephant bird
version of the Loader.

I'm not sure what the version conflict you're seeing with cdh +
elephant-bird, but I'd recommend compiling elephant-bird with the correct
version of hadoop + pig that you're using and deploy it to your maven repo.
I myself do this so that I know that all the code is compiled against
correct version that we're running in house.

I'm going to look into this problem a little bit more and see if I can get
it to work without elephant-bird.


On Fri, Aug 8, 2014 at 8:44 AM, Klüber, Ralf 
wrote:

> Hello,
>
> Much appreciated you taking your time to answer.
>
> > should probably look like
> >
> >  {DataASet: (A1: int,A2: int,DataBSets: {(DataBSet: (B1: chararray,B2:
> > chararray))})}
>
> How to achieve this? I tried:
> ```
> b = load 'b.json' using JsonLoader('
>  DataASet: (
>A1:int,
>A2:int,
>DataBSets: {
> (
> (DataBSet: (
>B1:chararray,
>B2:chararray
>  )
> ))
>}
>  )
>  ');
> ```
>
> Which gives this schema which does not look right.
> Dump fails (empty bag)
>
> ```
> Grunt > describe b;
> b: {DataASet: (A1: int,A2: int,DataBSets: {tuple_0: (DataBSet: (B1:
> chararray,B2: chararray))})}
> grunt> dump b;
> ()
> grunt>
> ```
>
> Kind regards.
> Ralf
>
> > -Ursprüngliche Nachricht-
> > Von: Pradeep Gollakota [mailto:pradeep...@gmail.com]
> > Gesendet: Friday, August 08, 2014 2:21 PM
> > An: user@pig.apache.org
> > Betreff: Re: Json Loader - Array of objects - Loading results in empty
> data set
> >
> > I think there's a problem with your schema.
> >
> >  {DataASet: (A1: int,A2: int,DataBSets: {DataBSet: (B1: chararray,B2:
> > chararray)})}
> >
> > should probably look like
> >
> >  {DataASet: (A1: int,A2: int,DataBSets: {(DataBSet: (B1: chararray,B2:
> > chararray))})}
> >
> >
> > On Thu, Aug 7, 2014 at 11:22 AM, Klüber, Ralf  >
> > wrote:
>


Re: Json Loader - Array of objects - Loading results in empty data set

2014-08-08 Thread Pradeep Gollakota
I think there's a problem with your schema.

 {DataASet: (A1: int,A2: int,DataBSets: {DataBSet: (B1: chararray,B2:
chararray)})}

should probably look like

 {DataASet: (A1: int,A2: int,DataBSets: {(DataBSet: (B1: chararray,B2:
chararray))})}


On Thu, Aug 7, 2014 at 11:22 AM, Klüber, Ralf 
wrote:

> Hello,
>
>
>
> I am new to this list. I tried to solve this problem for the last 48h but
> I am stuck. I hope someone here can hint me in the right direction.
>
>
>
> I have problems using the Pig JsonLoader and wondering if I do something
> wrong or I encounter another problem.
>
>
>
> The 1st half of this post is to show I know a at least something about
> what I am talking and that I did my homework. During research I found a lot
> about elephant-bird but there seems to be a conflict with cloudera. This
> way I am stuck as well. If you trust me already you can directly jump to
> the 2nd half of my post ,-).
>
>
>
> The desired solution should work both, in Cloudera and on Amazon EMR.
>
>
>
> To proof something works.
>
> --
>
>
>
> I have this data file:
>
> ```
>
> $ cat a.json
>
>
> {"DataASet":{"A1":1,"A2":4,"DataBSets":[{"B1":"1","B2":"1"},{"B1":"2","B2":"2"}]}}
>
> $ ./jq '.' a.json
>
> {
>
>   "DataASet": {
>
> "A1": 1,
>
> "A2": 4,
>
> "DataBSets": [
>
>   {
>
> "B1": "1",
>
> "B2": "1"
>
>   },
>
>   {
>
> "B1": "2",
>
> "B2": "2"
>
>   }
>
> ]
>
>   }
>
> }
>
> $
>
> ```
>
>
>
> I am using this Pig Script to load it.
>
>
>
> ``` Pig
>
> a = load 'a.json' using JsonLoader('
>
>  DataASet: (
>
>A1:int,
>
>A2:int,
>
>DataBSets: {
>
> (
>
>B1:chararray,
>
>B2:chararray
>
>  )
>
>}
>
>  )
>
> ');
>
> ```
>
>
>
> In grunt everything seems ok.
>
>
>
> ```
>
> grunt> describe a;
>
> a: {DataASet: (A1: int,A2: int,DataBSets: {(B1: chararray,B2: chararray)})}
>
> grunt> dump a;
>
> ((1,4,{(1,1),(2,2)}))
>
> grunt>
>
> ```
>
>
>
> So far so good.
>
>
>
> Real Problem
>
> 
>
>
>
> In fact my real data (Gigabytes) looks a little bit different. The array
> is in fact an array of an object.
>
>
>
> ```
>
> $ ./jq '.' b.json
>
> {
>
>   "DataASet": {
>
> "A1": 1,
>
> "A2": 4,
>
> "DataBSets": [
>
>   {
>
> "DataBSet": {
>
>   "B1": "1",
>
>   "B2": "1"
>
> }
>
>   },
>
>   {
>
> "DataBSet": {
>
>   "B1": "2",
>
>   "B2": "2"
>
> }
>
>   }
>
> ]
>
>   }
>
> }
>
> $ cat b.json
>
>
> {"DataASet":{"A1":1,"A2":4,"DataBSets":[{"DataBSet":{"B1":"1","B2":"1"}},{"DataBSet":{"B1":"2","B2":"2"}}]}}
>
> $
>
> ```
>
>
>
> I trying to load this json with the following schema:
>
>
>
> ``` Pig
>
> b = load 'b.json' using JsonLoader('
>
>  DataASet: (
>
>A1:int,
>
>A2:int,
>
>DataBSets: {
>
> DataBSet: (
>
>B1:chararray,
>
>B2:chararray
>
>  )
>
>}
>
>  )
>
> ');
>
> ```
>
>
>
> Again it looks good so far in grunt.
>
>
>
> ```
>
> grunt> describe b;
>
> b: {DataASet: (A1: int,A2: int,DataBSets: {DataBSet: (B1: chararray,B2:
> chararray)})} ```
>
>
>
> I expect someting like this when dumping b:
>
>
>
> ```
>
> ((1,4,{((1,1)),((2,2))}))
>
> ```
>
>
>
> But I get this:
>
>
>
> ```
>
> grunt> dump b;
>
> ()
>
> grunt>
>
> ```
>
>
>
> Obviously I am doing something wrong. An empty set hints in the direction
> that the schema does not match on the input line.
>
>
>
> Any hints? Thanks in advance.
>
>
>
> Kind regards.
>
> Ralf
>


Re: Query On Pig

2014-07-01 Thread Pradeep Gollakota
i. That's correct.
ii. If the key partial match is at the beginning of the row key, then what
you're looking for is the -gte and -lt/-lte flags. If you want to start
with 123, you just specify -gte 123 -lt 124. This would have the same
affect as a partial starts with match. If what you're looking for is more
complex, trunk has a new feature of HBaseStorage which adds the -regex
flag. You can use this to do more complex matching.


On Mon, Jun 30, 2014 at 11:43 PM, Nivetha K  wrote:

> (i) There is no direct way to take exact match
>
> (ii) Partial row key means
>
>  consider my rowkeys are
> 123456,123678,123678,124789,124789.. i need to take the rowkeys
> starts with 123
>
> On 1 July 2014 11:36, Pradeep Gollakota  wrote:
>
> > i. Equals can be mimicked by specifying both >= and <= (i.e. -lte=123
> > -gte=123)
> > ii. What do you mean by taking a partial rowkey? the lte and gte are
> > "partial" matches.
> >
> >
> > On Mon, Jun 30, 2014 at 10:24 PM, Nivetha K 
> > wrote:
> >
> > > Hi,
> > >
> > >
> > >  Iam working with Pig.
> > >
> > >
> > > I need to know some information on HBaseStorage.
> > >
> > >
> > >
> > > B = LOAD 'hbase://sample' using
> > > org.apache.pig.backend.hadoop.hbase.HBaseStorage('details:* details1:*
> > > details2:*','-loadKey true -lte=123') as
> > > (id:chararray,m1:map[],m2:map[],m3:map[]);
> > >
> > >
> > > (i)   like lte ((ie) Less than Equalto) is there any option like
> equalto
> > >
> > > (ii) Is there any possible to take the partial rowkey.
> > >
> >
>


Re: Query On Pig

2014-06-30 Thread Pradeep Gollakota
i. Equals can be mimicked by specifying both >= and <= (i.e. -lte=123
-gte=123)
ii. What do you mean by taking a partial rowkey? the lte and gte are
"partial" matches.


On Mon, Jun 30, 2014 at 10:24 PM, Nivetha K  wrote:

> Hi,
>
>
>  Iam working with Pig.
>
>
> I need to know some information on HBaseStorage.
>
>
>
> B = LOAD 'hbase://sample' using
> org.apache.pig.backend.hadoop.hbase.HBaseStorage('details:* details1:*
> details2:*','-loadKey true -lte=123') as
> (id:chararray,m1:map[],m2:map[],m3:map[]);
>
>
> (i)   like lte ((ie) Less than Equalto) is there any option like equalto
>
> (ii) Is there any possible to take the partial rowkey.
>


Re: How to FLATTEN hive column in Pig with ARRAY data type

2014-06-02 Thread Pradeep Gollakota
Awesome... that's the way I would have done it as well.


On Mon, Jun 2, 2014 at 10:14 AM, Rahul Channe 
wrote:

> I tried changing the hive column datatype from ARRAY to STRUCT for
> cust_address, then i imported the table in pig.
>
> Now I am able to separate the fields, as below
>
> grunt> Z = load 'cust_info' using org.apache.hcatalog.pig.HCatLoader();
> grunt> describe Z;
> Z: {cust_id: int,cust_name: chararray,cust_address: (house_no: int,street:
> chararray,city: chararray)}
>
>
> grunt> Y = foreach Z generate cust_address.house_no as
> house_no,cust_address.street as street,UPPER(cust_address.city) as city;
> grunt> describe Y;
> Y: {house_no: int,street: chararray,city: chararray}
>
> grunt> dump Y;
> (2200,benjamin franklin,PHILADELPHIA)
> (44,atlanta franklin,FLORIDA)
>
>
> On Mon, Jun 2, 2014 at 1:09 PM, Rahul Channe 
> wrote:
>
> > grunt> B = foreach A generate BagToTuple(cust_address);
> >
> > grunt> describe B;
> > B: {org.apache.pig.builtin.bagtotuple_cust_address_24: (innerfield:
> > chararray)}
> >
> > grunt> dump B;
> > ((2200,benjamin franklin,philadelphia))
> > ((44,atlanta franklin,florida))
> >
> >
> >
> >
> > On Mon, Jun 2, 2014 at 12:59 PM, Pradeep Gollakota  >
> > wrote:
> >
> >> If you're using the built-in BagToTuple UDF, then you probably don't
> need
> >> the FLATTEN operator.
> >>
> >> I suspect that your output looks as follows:
> >>
> >> 2200
> >> benjamin avenue
> >> philadelphia
> >> ...
> >>
> >> Can you confirm that this is what you're seeing?
> >>
> >>
> >> On Mon, Jun 2, 2014 at 9:52 AM, Rahul Channe 
> >> wrote:
> >>
> >> > Thank You Pradeep, it worked to a certain extend but having following
> >> > difficulty in separating fields as $0,$1 for the customer_address.
> >> >
> >> >
> >> > Example -
> >> >
> >> > grunt> describe A;
> >> > A: {cust_id: int,cust_name: chararray,cust_address: {innertuple:
> >> > (innerfield: chararray)},cust_email: chararray}
> >> >
> >> > grunt> dump A;
> >> >
> >> > (123,phil abc,{(2200),(benjamin avenue),(philadelphia)},
> t...@gmail.com)
> >> > (124,diego arty,{(44),(atlanta franklin),(florida)},o...@gmail.com)
> >> >
> >> > grunt> B = foreach A generate FLATTEN(BagToTuple(cust_address));
> >> > grunt> dump B;
> >> > (2200,benjamin franklin,philadelphia)
> >> > (44,atlanta franklin,florida)
> >> >
> >> > grunt> describe B;
> >> > B: {org.apache.pig.builtin.bagtotuple_cust_address_34::innerfield:
> >> > chararray}
> >> >
> >> >
> >> >
> >> > I am not able to seperate the fields in B as $0,$1 and $3 ,tried using
> >> > STRSPLIT but didnt work.
> >> >
> >> >
> >> >
> >> > On Mon, Jun 2, 2014 at 11:50 AM, Pradeep Gollakota <
> >> pradeep...@gmail.com>
> >> > wrote:
> >> >
> >> > > There was a similar question as this on StackOverflow a while back.
> >> The
> >> > > suggestion was to write a custom BagToTuple UDF.
> >> > >
> >> > >
> >> > >
> >> >
> >>
> http://stackoverflow.com/questions/18544602/how-to-flatten-a-group-into-a-single-tuple-in-pig
> >> > >
> >> > >
> >> > > On Mon, Jun 2, 2014 at 8:46 AM, Pradeep Gollakota <
> >> pradeep...@gmail.com>
> >> > > wrote:
> >> > >
> >> > > > Disregard last email.
> >> > > >
> >> > > > Sorry... didn't fully understand the question.
> >> > > >
> >> > > >
> >> > > > On Mon, Jun 2, 2014 at 8:44 AM, Pradeep Gollakota <
> >> > pradeep...@gmail.com>
> >> > > > wrote:
> >> > > >
> >> > > >> FOREACH A GENERATE cust_id, cust_name, FLATTEN(cust_address),
> >> > > cust_email;
> >> > > >>
> >> > > >> ​
> >> > > >>
> >> > > >>
> >> > > >> On Sun, Jun 1, 2014 at 5:54 PM, Rahul Channe <
> >> drah...@googlemail.com>
> >> > > >> wrote:
> >> > > >>
> >> > > >>> Hi All,
> >> > > >>>
> >> > > >>> I have imported hive table into pig having a complex data type
> >> > > >>> (ARRAY). The alias in pig looks as below
> >> > > >>>
> >> > > >>> grunt> describe A;
> >> > > >>> A: {cust_id: int,cust_name: chararray,cust_address: {innertuple:
> >> > > >>> (innerfield: chararray)},cust_email: chararray}
> >> > > >>>
> >> > > >>> grunt> dump A;
> >> > > >>>
> >> > > >>> (123,phil abc,{(2200),(benjamin avenue),(philadelphia)},
> >> > t...@gmail.com
> >> > > )
> >> > > >>> (124,diego arty,{(44),(atlanta franklin),(florida)},
> >> o...@gmail.com)
> >> > > >>>
> >> > > >>> The cust_address is the ARRAY field from hive. I want to FLATTEN
> >> the
> >> > > >>> cust_address into different fields.
> >> > > >>>
> >> > > >>>
> >> > > >>> Expected output
> >> > > >>> (2200,benjamin avenue,philadelphia)
> >> > > >>> (44,atlanta franklin,florida)
> >> > > >>>
> >> > > >>> please help
> >> > > >>>
> >> > > >>> Regards,
> >> > > >>> Rahul
> >> > > >>>
> >> > > >>
> >> > > >>
> >> > > >
> >> > >
> >> >
> >>
> >
> >
>


Re: How to FLATTEN hive column in Pig with ARRAY data type

2014-06-02 Thread Pradeep Gollakota
If you're using the built-in BagToTuple UDF, then you probably don't need
the FLATTEN operator.

I suspect that your output looks as follows:

2200
benjamin avenue
philadelphia
...

Can you confirm that this is what you're seeing?


On Mon, Jun 2, 2014 at 9:52 AM, Rahul Channe  wrote:

> Thank You Pradeep, it worked to a certain extend but having following
> difficulty in separating fields as $0,$1 for the customer_address.
>
>
> Example -
>
> grunt> describe A;
> A: {cust_id: int,cust_name: chararray,cust_address: {innertuple:
> (innerfield: chararray)},cust_email: chararray}
>
> grunt> dump A;
>
> (123,phil abc,{(2200),(benjamin avenue),(philadelphia)},t...@gmail.com)
> (124,diego arty,{(44),(atlanta franklin),(florida)},o...@gmail.com)
>
> grunt> B = foreach A generate FLATTEN(BagToTuple(cust_address));
> grunt> dump B;
> (2200,benjamin franklin,philadelphia)
> (44,atlanta franklin,florida)
>
> grunt> describe B;
> B: {org.apache.pig.builtin.bagtotuple_cust_address_34::innerfield:
> chararray}
>
>
>
> I am not able to seperate the fields in B as $0,$1 and $3 ,tried using
> STRSPLIT but didnt work.
>
>
>
> On Mon, Jun 2, 2014 at 11:50 AM, Pradeep Gollakota 
> wrote:
>
> > There was a similar question as this on StackOverflow a while back. The
> > suggestion was to write a custom BagToTuple UDF.
> >
> >
> >
> http://stackoverflow.com/questions/18544602/how-to-flatten-a-group-into-a-single-tuple-in-pig
> >
> >
> > On Mon, Jun 2, 2014 at 8:46 AM, Pradeep Gollakota 
> > wrote:
> >
> > > Disregard last email.
> > >
> > > Sorry... didn't fully understand the question.
> > >
> > >
> > > On Mon, Jun 2, 2014 at 8:44 AM, Pradeep Gollakota <
> pradeep...@gmail.com>
> > > wrote:
> > >
> > >> FOREACH A GENERATE cust_id, cust_name, FLATTEN(cust_address),
> > cust_email;
> > >>
> > >> ​
> > >>
> > >>
> > >> On Sun, Jun 1, 2014 at 5:54 PM, Rahul Channe 
> > >> wrote:
> > >>
> > >>> Hi All,
> > >>>
> > >>> I have imported hive table into pig having a complex data type
> > >>> (ARRAY). The alias in pig looks as below
> > >>>
> > >>> grunt> describe A;
> > >>> A: {cust_id: int,cust_name: chararray,cust_address: {innertuple:
> > >>> (innerfield: chararray)},cust_email: chararray}
> > >>>
> > >>> grunt> dump A;
> > >>>
> > >>> (123,phil abc,{(2200),(benjamin avenue),(philadelphia)},
> t...@gmail.com
> > )
> > >>> (124,diego arty,{(44),(atlanta franklin),(florida)},o...@gmail.com)
> > >>>
> > >>> The cust_address is the ARRAY field from hive. I want to FLATTEN the
> > >>> cust_address into different fields.
> > >>>
> > >>>
> > >>> Expected output
> > >>> (2200,benjamin avenue,philadelphia)
> > >>> (44,atlanta franklin,florida)
> > >>>
> > >>> please help
> > >>>
> > >>> Regards,
> > >>> Rahul
> > >>>
> > >>
> > >>
> > >
> >
>


Re: How to FLATTEN hive column in Pig with ARRAY data type

2014-06-02 Thread Pradeep Gollakota
There was a similar question as this on StackOverflow a while back. The
suggestion was to write a custom BagToTuple UDF.

http://stackoverflow.com/questions/18544602/how-to-flatten-a-group-into-a-single-tuple-in-pig


On Mon, Jun 2, 2014 at 8:46 AM, Pradeep Gollakota 
wrote:

> Disregard last email.
>
> Sorry... didn't fully understand the question.
>
>
> On Mon, Jun 2, 2014 at 8:44 AM, Pradeep Gollakota 
> wrote:
>
>> FOREACH A GENERATE cust_id, cust_name, FLATTEN(cust_address), cust_email;
>>
>> ​
>>
>>
>> On Sun, Jun 1, 2014 at 5:54 PM, Rahul Channe 
>> wrote:
>>
>>> Hi All,
>>>
>>> I have imported hive table into pig having a complex data type
>>> (ARRAY). The alias in pig looks as below
>>>
>>> grunt> describe A;
>>> A: {cust_id: int,cust_name: chararray,cust_address: {innertuple:
>>> (innerfield: chararray)},cust_email: chararray}
>>>
>>> grunt> dump A;
>>>
>>> (123,phil abc,{(2200),(benjamin avenue),(philadelphia)},t...@gmail.com)
>>> (124,diego arty,{(44),(atlanta franklin),(florida)},o...@gmail.com)
>>>
>>> The cust_address is the ARRAY field from hive. I want to FLATTEN the
>>> cust_address into different fields.
>>>
>>>
>>> Expected output
>>> (2200,benjamin avenue,philadelphia)
>>> (44,atlanta franklin,florida)
>>>
>>> please help
>>>
>>> Regards,
>>> Rahul
>>>
>>
>>
>


Re: How to FLATTEN hive column in Pig with ARRAY data type

2014-06-02 Thread Pradeep Gollakota
Disregard last email.

Sorry... didn't fully understand the question.


On Mon, Jun 2, 2014 at 8:44 AM, Pradeep Gollakota 
wrote:

> FOREACH A GENERATE cust_id, cust_name, FLATTEN(cust_address), cust_email;
>
> ​
>
>
> On Sun, Jun 1, 2014 at 5:54 PM, Rahul Channe 
> wrote:
>
>> Hi All,
>>
>> I have imported hive table into pig having a complex data type
>> (ARRAY). The alias in pig looks as below
>>
>> grunt> describe A;
>> A: {cust_id: int,cust_name: chararray,cust_address: {innertuple:
>> (innerfield: chararray)},cust_email: chararray}
>>
>> grunt> dump A;
>>
>> (123,phil abc,{(2200),(benjamin avenue),(philadelphia)},t...@gmail.com)
>> (124,diego arty,{(44),(atlanta franklin),(florida)},o...@gmail.com)
>>
>> The cust_address is the ARRAY field from hive. I want to FLATTEN the
>> cust_address into different fields.
>>
>>
>> Expected output
>> (2200,benjamin avenue,philadelphia)
>> (44,atlanta franklin,florida)
>>
>> please help
>>
>> Regards,
>> Rahul
>>
>
>


Re: How to FLATTEN hive column in Pig with ARRAY data type

2014-06-02 Thread Pradeep Gollakota
FOREACH A GENERATE cust_id, cust_name, FLATTEN(cust_address), cust_email;

​


On Sun, Jun 1, 2014 at 5:54 PM, Rahul Channe  wrote:

> Hi All,
>
> I have imported hive table into pig having a complex data type
> (ARRAY). The alias in pig looks as below
>
> grunt> describe A;
> A: {cust_id: int,cust_name: chararray,cust_address: {innertuple:
> (innerfield: chararray)},cust_email: chararray}
>
> grunt> dump A;
>
> (123,phil abc,{(2200),(benjamin avenue),(philadelphia)},t...@gmail.com)
> (124,diego arty,{(44),(atlanta franklin),(florida)},o...@gmail.com)
>
> The cust_address is the ARRAY field from hive. I want to FLATTEN the
> cust_address into different fields.
>
>
> Expected output
> (2200,benjamin avenue,philadelphia)
> (44,atlanta franklin,florida)
>
> please help
>
> Regards,
> Rahul
>


Re: How to sample an inner bag?

2014-05-27 Thread Pradeep Gollakota
@Mehmet... great hack! I like it :-P


On Tue, May 27, 2014 at 5:08 PM, Mehmet Tepedelenlioglu <
mehmets...@yahoo.com> wrote:

> If you know how many items you want from each inner bag exactly, you can
> hack it like this:
>
> x = foreach x {
> y = foreach x generate RANDOM() as rnd, *;
> y = order y by rnd;
> y = limit y $SAMPLE_NUM;
> y = foreach y generate $1 ..;
> generate group, y;
> }
>
> Basically randomize the inner bag, sort it wrt the random number and limit
> it to the sample size you want. No reducers needed.
> If the inner bags are huge, ordering will obviously be expensive. If you
> don’t like this, you might have to write your own udf.
>
> Mehmet
>
> On May 27, 2014, at 10:03 AM,  <
> william.dowl...@thomsonreuters.com> wrote:
>
> > Hi Pig users,
> >
> > Is there an easy/efficient way to sample an inner bag? For example, with
> input in a relation like
> >
> > (id1,att1,{(a,0.01),(b,0.02),(x,0.999749968742)})
> > (id1,att2,{(a,0.03),(b,0.04),(x,0.998749217772)})
> > (id2,att1,{(b,0.05),(c,0.06),(x,0.996945334509)})
> >
> > I’d like to sample 1/3 the elements of the bags, and get something like
> (ignoring the non-determinism)
> > (id1,att1,{(x,0.999749968742)})
> > (id1,att2,{(b,0.04)})
> > (id2,att1,{(b,0.05)})
> >
> > I have a circumlocution that seems to work using flatten+ group but that
> looks ugly to me:
> >
> > tfidf1 = load '$tfidf' as (id: chararray,
> >  att: chararray,
> >  pairs: {pair: (word: chararray, value:
> double)});
> >
> > flat_tfidf = foreach tfidf1 generate id, att, FLATTEN(pairs);
> > sample_flat_tfidf = sample flat_tfidf 0.33;
> > tfidf2 = group sample_flat_tfidf by (id, att);
> >
> > tfidf = foreach tfidf2 {
> >   pairs = foreach sample_flat_tfidf generate pairs::word, pairs::value;
> >   generate group.id, group.att, pairs;
> > };
> >
> > Can someone suggest a better way to do this?  Many thanks!
> >
> > William F Dowling
> > Senior Technologist
> >
> > Thomson Reuters
> >
> >
> >
>
>


Re: Reading sequence file in pig

2014-05-21 Thread Pradeep Gollakota
That is because null is not a datatype in Pig.
http://pig.apache.org/docs/r0.12.1/basic.html#data-types

If fact, you don't need to specify a type at all for aliases.

Try, (key, value: chararray).


On Wed, May 21, 2014 at 2:21 PM, abhishek dodda
wrote:

> Hi,
>
> REGISTER /home/xyz/elephant-bird-pig-4.5.jar;
> REGISTER /home/xyz/elephant-bird-pig-4.5-sources.jar;
> REGISTER /home/xyz/elephant-bird-pig-4.5-tests.jar;
>
>
> A = load '/etl/table=04' using
> com.twitter.elephantbird.pig.load.SequenceFileLoader
>  ('-c com.twitter.elephantbird.pig.util.TextConverter','-c
> com.twitter.elephantbird.pig.util.TextConverter')
>  AS (key:chararray,value:chararray);
>
>
> 2014-05-21 18:10:53,391 [main] ERROR org.apache.pig.tools.grunt.Grunt -
> ERROR 2998: Unhandled internal error.
> com/twitter/elephantbird/mapreduce/input/RawSequenceFileInputFormat
> Details at logfile: /home/xyz/pig_1400694772994.log
>
>   A = load '/etl/table=04' using
> com.twitter.elephantbird.pig.load.SequenceFileLoader
>  ('-c com.twitter.elephantbird.pig.util.NullWritableConverter','-c
> com.twitter.elephantbird.pig.util.TextConverter')
>  AS (key:null,value:chararray);
>
> Also tried NullWritable as key
>
> 2014-05-21 18:11:58,554 [main] ERROR org.apache.pig.tools.grunt.Grunt -
> ERROR 1200:   Syntax error, unexpected symbol at or near
> 'null'
> Details at logfile: /home/xyz/pig_1400694772994.log
>
> None of them worked. I am something missing here ?
>
>
>
>
> On Tue, May 20, 2014 at 9:12 PM, Pradeep Gollakota 
> wrote:
>
>> Sorry,
>>
>> Missed the part about loading custom types from SequenceFiles. The
>> LoadFunc from piggybank will only load pig types. However, (as you already
>> know), you can use elephant-bird. Not sure why you need to build it. The
>> artifact exists in maven central.
>>
>>
>> http://search.maven.org/#artifactdetails%7Ccom.twitter.elephantbird%7Celephant-bird-pig%7C4.5%7Cjar
>>
>> Hope this helps.
>>
>>
>> On Tue, May 20, 2014 at 1:44 PM, abhishek dodda <
>> abhishek.dod...@gmail.com> wrote:
>>
>>> Iam getting this error
>>>
>>> A = load '/a/part-m-' using 
>>> org.apache.pig.piggybank.storage.SequenceFileLoader();
>>>
>>>
>>>
>>>
>>> org.apache.pig.backend.BackendException: ERROR 0: Unable to translate
>>> class org.apache.hadoop.io.NullWritable to a Pig datatype
>>>
>>>
>>>
>>> at 
>>> org.apache.pig.piggybank.storage.SequenceFileLoader.setKeyType(SequenceFileLoader.java:81)
>>> at 
>>> org.apache.pig.piggybank.storage.SequenceFileLoader.getNext(SequenceFileLoader.java:138)
>>> at 
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:211)
>>> at 
>>> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:484)
>>> at 
>>> org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:76)
>>> at 
>>> org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:85)
>>> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:139)
>>> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:673)
>>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:331)
>>> at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
>>> at java.security.AccessController.doPrivileged(Native Method)
>>>
>>>
>>>
>>> On Tue, May 20, 2014 at 5:41 AM, Pradeep Gollakota >> > wrote:
>>>
>>>> You can use the SequenceFileLoader from the piggybank.
>>>>
>>>>
>>>> http://pig.apache.org/docs/r0.12.0/api/org/apache/pig/piggybank/storage/SequenceFileLoader.html
>>>>
>>>>
>>>> On Tue, May 20, 2014 at 2:46 AM, abhishek dodda
>>>> wrote:
>>>>
>>>> > Hi All,
>>>> >
>>>> > I have trouble building code for this project.
>>>> >
>>>> > https://github.com/kevinweil/elephant-bird
>>>> >
>>>> > can some one tell how to read sequence files in pig.
>>>> >
>>>> > --
>>>> > Thanks,
>>>> > Abhishek
>>>> >
>>>>
>>>
>>>
>>>
>>> --
>>> Thanks,
>>> Abhishek
>>> 2018509769
>>>
>>
>>
>
>
> --
> Thanks,
> Abhishek
> 2018509769
>


Re: Reading sequence file in pig

2014-05-20 Thread Pradeep Gollakota
Sorry,

Missed the part about loading custom types from SequenceFiles. The LoadFunc
from piggybank will only load pig types. However, (as you already know),
you can use elephant-bird. Not sure why you need to build it. The artifact
exists in maven central.

http://search.maven.org/#artifactdetails%7Ccom.twitter.elephantbird%7Celephant-bird-pig%7C4.5%7Cjar

Hope this helps.


On Tue, May 20, 2014 at 1:44 PM, abhishek dodda
wrote:

> Iam getting this error
>
> A = load '/a/part-m-' using 
> org.apache.pig.piggybank.storage.SequenceFileLoader();
>
> org.apache.pig.backend.BackendException: ERROR 0: Unable to translate
> class org.apache.hadoop.io.NullWritable to a Pig datatype
>
>
>   at 
> org.apache.pig.piggybank.storage.SequenceFileLoader.setKeyType(SequenceFileLoader.java:81)
>   at 
> org.apache.pig.piggybank.storage.SequenceFileLoader.getNext(SequenceFileLoader.java:138)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:211)
>   at 
> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:484)
>   at 
> org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:76)
>   at 
> org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:85)
>   at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:139)
>   at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:673)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:331)
>   at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
>   at java.security.AccessController.doPrivileged(Native Method)
>
>
>
> On Tue, May 20, 2014 at 5:41 AM, Pradeep Gollakota 
> wrote:
>
>> You can use the SequenceFileLoader from the piggybank.
>>
>>
>> http://pig.apache.org/docs/r0.12.0/api/org/apache/pig/piggybank/storage/SequenceFileLoader.html
>>
>>
>> On Tue, May 20, 2014 at 2:46 AM, abhishek dodda
>> wrote:
>>
>> > Hi All,
>> >
>> > I have trouble building code for this project.
>> >
>> > https://github.com/kevinweil/elephant-bird
>> >
>> > can some one tell how to read sequence files in pig.
>> >
>> > --
>> > Thanks,
>> > Abhishek
>> >
>>
>
>
>
> --
> Thanks,
> Abhishek
> 2018509769
>


Re: Reading sequence file in pig

2014-05-20 Thread Pradeep Gollakota
You can use the SequenceFileLoader from the piggybank.

http://pig.apache.org/docs/r0.12.0/api/org/apache/pig/piggybank/storage/SequenceFileLoader.html


On Tue, May 20, 2014 at 2:46 AM, abhishek dodda
wrote:

> Hi All,
>
> I have trouble building code for this project.
>
> https://github.com/kevinweil/elephant-bird
>
> can some one tell how to read sequence files in pig.
>
> --
> Thanks,
> Abhishek
>


Re: Query : Filtering out string from a field

2014-05-12 Thread Pradeep Gollakota
Kartik,

Looks like you're facing this issues:
https://issues.apache.org/jira/browse/PIG-2507
What version of Pig are you using? The issue is fixed in 0.11.2 and 0.12.
So if you upgrade to these versions, your problem should go away.

If you're unable to upgrade for some reason, your best bet is to write a
custom UDF. But the general idea remains the same, write a regex to extract
the appropriate substring and project that from the UDF.


Unmesha,

Start a new thread with your question so we don't pollute this thread for
Kartik. Can you give some samples as well? I'm not sure I understood your
question.


On Mon, May 12, 2014 at 3:05 AM, kartik manocha wrote:

> Pradeep,
>
> Thanks for the pointers, but as i mentioned that I need to extract that
> string till semicolon, so facing issues with that.
>
> I need to print it before semiclon that's causing pain as when I mention
> semicolon in regex it treats it as end of statement & produces error.
>
> However without mentioning semicolon it works fine but produces complete
> stuff starting with B75.
> eg .
> B=foreach D generate REGEX_EXTRACT(test,'(B75.*)',1);
>
> Is there any way by which I can mention semicolon in my above regex, so
> that it prints the string before that.
>
>
> Thanks,
> Kartik
>
>
>
> On Mon, May 12, 2014 at 2:03 PM, Pradeep Gollakota  >wrote:
>
> > Check out
> > http://archive.cloudera.com/cdh/3/pig/piglatin_ref2.html#REGEX_EXTRACT
> >
> > This may suit your needs
> >
> >
> > On Mon, May 12, 2014 at 12:16 AM, kartik manocha  > >wrote:
> >
> > > Hi,
> > >
> > > I am new to pig & facing an issue in filtering out a string from a
> field,
> > > mentioned is the scenario.
> > >
> > > - > I am loading data with several fields, among those fields there is
> > > field name called 'test_data'
> > > - > There are lot of things in this field, I wanted to filter out a
> > string
> > > from this field which starts from B75 & ends with semi colon.
> > > - > After taking this string out, wanted to add this as a new field to
> > the
> > > existing bag which was loaded
> > >
> > > I tried using INDEXOF UDF, but that works for a single character only,
> > > however when I tried using that for single character, it returns ()
> only
> > > instead of index number. I was just testing, & by manually providing
> > > indexes in SUBSTRING UDF, it was generating string.
> > >
> > > But unable to get the position using indexof UDF, or may be there could
> > be
> > > a better of doing this.
> > >
> > > If you have any pointers / suggestions, please share.
> > >
> > > Thanks in advance.
> > >
> > >
> > > Best,
> > > Kartik
> > >
> >
>


Re: Query : Filtering out string from a field

2014-05-12 Thread Pradeep Gollakota
Check out
http://archive.cloudera.com/cdh/3/pig/piglatin_ref2.html#REGEX_EXTRACT

This may suit your needs


On Mon, May 12, 2014 at 12:16 AM, kartik manocha wrote:

> Hi,
>
> I am new to pig & facing an issue in filtering out a string from a field,
> mentioned is the scenario.
>
> - > I am loading data with several fields, among those fields there is
> field name called 'test_data'
> - > There are lot of things in this field, I wanted to filter out a string
> from this field which starts from B75 & ends with semi colon.
> - > After taking this string out, wanted to add this as a new field to the
> existing bag which was loaded
>
> I tried using INDEXOF UDF, but that works for a single character only,
> however when I tried using that for single character, it returns () only
> instead of index number. I was just testing, & by manually providing
> indexes in SUBSTRING UDF, it was generating string.
>
> But unable to get the position using indexof UDF, or may be there could be
> a better of doing this.
>
> If you have any pointers / suggestions, please share.
>
> Thanks in advance.
>
>
> Best,
> Kartik
>


Re: ClassCastException: org.apache.pig.data.DataByteArray cannot be cast to java.lang.Number

2014-04-24 Thread Pradeep Gollakota
One possibility off the top of my head is that the delimiter might be
wrong. Can you try specifying the correct delimiter to PigStorage.

E.g. For CSV files

A = LOAD 'file_A' USING PigStorage(',') AS (colA1 : double, colA2 : double);



On Thu, Apr 24, 2014 at 12:48 PM, Steven E. Waldren wrote:

> Swapnil, sorry I partially saw the tile and thought Darpan/Pradeep were
> responding to my earlier post. My problem was not the same as yours.
>
> Best,
> Steven
>
> On Apr 24, 2014, at 2:11 PM, Swapnil Shinde 
> wrote:
>
> > Thanks for reply..
> > @ Pradeep - I am using PigStorage load function.
> > @ Darpan - I forgot to mention but I made sure that all values in columns
> > are numeric and can be cast to double.
> > @ Steven - Could you please explain more what resolved your error?
> >
> > Thanks
> >
> >
> >
> > On Thu, Apr 24, 2014 at 2:59 PM, Steven E. Waldren  >wrote:
> >
> >> Thanks I made a last ditch effort and bounced my cluster. The error went
> >> away must be Cloudera gremlin.
> >>
> >> Thanks for the suggestions and help.
> >>
> >> Best,
> >> Steven
> >>
> >> On Apr 24, 2014, at 12:25 PM, Darpan R  wrote:
> >>
> >>> Please do a sanity of the datacheck : colA2  might not be cast-able to
> >>> numeric for one or more records.
> >>>
> >>>
> >>>
> >>>
> >>> On 24 April 2014 22:24, Pradeep Gollakota 
> wrote:
> >>>
> >>>> Whats the LoadFunc you're using?
> >>>>
> >>>>
> >>>> On Thu, Apr 24, 2014 at 9:28 AM, Swapnil Shinde <
> >> swapnilushi...@gmail.com
> >>>>> wrote:
> >>>>
> >>>>> I am facing very weird problem while multiplication.
> >>>>> Pig simplified code snippet-
> >>>>> A = LOAD 'file_A' AS (colA1 : double, colA2 : double);
> >>>>> describe A;
> >>>>>*A: {colA1: double,colA2: double}*
> >>>>> B = LOAD 'file_B' AS (colB1 : double, colB2 : double);
> >>>>> describe B;
> >>>>>*B: {colB1: double,colB2: double}*
> >>>>>
> >>>>> joined = JOIN A BY (colA1) LEFT OUTER, B BY (colB1) USING
> 'replicated';
> >>>>> SPLIT joined INTO  split1 IF A::colB1 IS NOT NULL,
> >>>>>   split2 IF (A::colB1 IS NULL AND A;:colA2 ==
> >>>> 2),
> >>>>>   split3 IF (A::colB1 IS NULL AND A;:colA2 !=
> >>>> 2);
> >>>>> describe split1;
> >>>>> *   split1: {A::colA1: double,A::colA2: double,B::colB1:
> >>>>> double,B::colB2: double}*
> >>>>>
> >>>>>
> >>>>> D = FOREACH split1 GENERATE (A::colA1 * B::colB1) AS newCol;
> >>>>>
> >>>>> *Error-*
> >>>>> 2014-04-24 10:02:30,458 [main] ERROR
> >>>>> org.apache.pig.tools.pigstats.SimplePigStats - ERROR 0: Exception
> while
> >>>>> executing [Multiply (Name: Multiply[double] - scope-6 Operator Key:
> >>>>> scope-6) children: [[POProject (Name: Project[double][1] - scope-3
> >>>> Operator
> >>>>> Key: scope-3) children: null at []], [POCast (Name: Cast[double] -
> >>>> scope-5
> >>>>> Operator Key: scope-5) children: [[ConstantExpression (Name:
> >> Constant(3)
> >>>> -
> >>>>> scope-4 Operator Key: scope-4) children: null at []]] at []]] at []]:
> >>>>> java.lang.ClassCastException: org.apache.pig.data.DataByteArray
> cannot
> >> be
> >>>>> cast to java.lang.Number
> >>>>>
> >>>>> Stack tarce-
> >>>>> org.apache.pig.backend.executionengine.ExecException: ERROR 0:
> >> Exception
> >>>>> while executing [Multiply (Name: Multiply[double] - scope-6 Operator
> >> Key:
> >>>>> scope-6) children: [[POProject (Name: Project[double][1] - scope-3
> >>>> Operator
> >>>>> Key: scope-3) children: null at []], [POCast (Name: Cast[double] -
> >>>> scope-5
> >>>>> Operator Key: scope-5) children: [[ConstantExpression (Name:
> >> Constant(3)
> >>>> -
> >>>>> scope-4 Operator Key: scope-4) children: null at []]] at []]] at []]:
> >>>>&

Re: ClassCastException: org.apache.pig.data.DataByteArray cannot be cast to java.lang.Number

2014-04-24 Thread Pradeep Gollakota
Whats the LoadFunc you're using?


On Thu, Apr 24, 2014 at 9:28 AM, Swapnil Shinde wrote:

> I am facing very weird problem while multiplication.
> Pig simplified code snippet-
> A = LOAD 'file_A' AS (colA1 : double, colA2 : double);
> describe A;
>  *A: {colA1: double,colA2: double}*
> B = LOAD 'file_B' AS (colB1 : double, colB2 : double);
> describe B;
>  *B: {colB1: double,colB2: double}*
>
> joined = JOIN A BY (colA1) LEFT OUTER, B BY (colB1) USING 'replicated';
> SPLIT joined INTO  split1 IF A::colB1 IS NOT NULL,
> split2 IF (A::colB1 IS NULL AND A;:colA2 == 2),
> split3 IF (A::colB1 IS NULL AND A;:colA2 != 2);
> describe split1;
> *   split1: {A::colA1: double,A::colA2: double,B::colB1:
> double,B::colB2: double}*
>
>
> D = FOREACH split1 GENERATE (A::colA1 * B::colB1) AS newCol;
>
> *Error-*
> 2014-04-24 10:02:30,458 [main] ERROR
> org.apache.pig.tools.pigstats.SimplePigStats - ERROR 0: Exception while
> executing [Multiply (Name: Multiply[double] - scope-6 Operator Key:
> scope-6) children: [[POProject (Name: Project[double][1] - scope-3 Operator
> Key: scope-3) children: null at []], [POCast (Name: Cast[double] - scope-5
> Operator Key: scope-5) children: [[ConstantExpression (Name: Constant(3) -
> scope-4 Operator Key: scope-4) children: null at []]] at []]] at []]:
> java.lang.ClassCastException: org.apache.pig.data.DataByteArray cannot be
> cast to java.lang.Number
>
> Stack tarce-
> org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception
> while executing [Multiply (Name: Multiply[double] - scope-6 Operator Key:
> scope-6) children: [[POProject (Name: Project[double][1] - scope-3 Operator
> Key: scope-3) children: null at []], [POCast (Name: Cast[double] - scope-5
> Operator Key: scope-5) children: [[ConstantExpression (Name: Constant(3) -
> scope-4 Operator Key: scope-4) children: null at []]] at []]] at []]:
> java.lang.ClassCastException: org.apache.pig.data.DataByteArray cannot be
> cast to java.lang.Number at
>
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:338)
> at
>
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:378)
> at
>
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNextTuple(POForEach.java:298)
> at
>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:282)
> at
>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:277)
> at
>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at
> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:681) at
> org.apache.hadoop.mapred.MapTask.run(MapTask.java:347) at
> org.apache.hadoop.mapred.Child$4.run(Child.java:270) at
> java.security.AccessController.doPrivileged(Native Method) at
> javax.security.auth.Subject.doAs(Subject.java:396) at
>
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
> at org.apache.hadoop.mapred.Child.main(Child.java:264) Caused by:
> java.lang.ClassCastException: org.apache.pig.data.DataByteArray cannot be
> cast to java.lang.Number at
>
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.Multiply.genericGetNext(Multiply.java:89)
> at
>
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.Multiply.getNextDouble(Multiply.java:104)
> at
>
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:317)
> ... 13 more
>
>
> I tried below options but no luck-
> 1) Doing addition instead of multiplication and I get similar error.
> 2) I verified multiplication for double works with few sample files.
> 3) I tried casting it again to double before multiplication too.
> 4) I tried storing result before multiplication and loading it back. still
> same error.
>
> I am not sure why it's throwing classCastException when schema has double
> as data type.
> Please let me know if need any further information or missing something in
> above simplified snippet.
> Any help is very much appreciated.
>
> Thanks
>


Re: Number of map task

2014-04-22 Thread Pradeep Gollakota
Pig is a little too smart when dealing with data. It has a feature called
split combination. If you set it to false, you should see more mappers.

SET pig.noSplitCombination true;



On Tue, Apr 22, 2014 at 12:14 PM, Patcharee Thongtra <
patcharee.thong...@uni.no> wrote:

> Hi,
>
> I wrote a custom InputFormat. When I ran the pig script Load function
> using this InputFormat, the number of InputSplit = 16, but there was only 2
> map tasks handling these splits. Apparently the no. of map tasks = the no.
> of input files.
>
> Does the number of Map task not correspond to the number of splits?
>
> I think the job will be done quicker if there are more Map tasks?
>
> Patcharee
>


Re: Strange CROSS behavior

2014-04-18 Thread Pradeep Gollakota
What is the storage func you're using? My guess is that there is some
shared state in the Storage func. Take a look at this SO that is dealing
with shared state in Stores.
http://stackoverflow.com/questions/20225842/apache-pig-append-one-dataset-to-another-one/20235592#20235592.
The reason why this doesn't occur is because PigStorage doesn't have shared
state. So, in T3, you're loading from text files instead of your original
store func.

CROSS is pretty expensive by nature. If one of your datasets is small
enough to load into memory, you use a fragment replicate join instead.


On Fri, Apr 18, 2014 at 11:43 AM, Alex Rasmussen wrote:

> I'm noticing some really strange behavior with a CROSS operation in one of
> my scripts.
>
> I'm CROSSing a table T1 with another table T2 to produce T3. T1 has one
> row, and T2 has 2,982,035 rows.
>
> If I STORE both T1 and T2 before CROSSing them together to get T3, like so:
>
> -- ... Long script that, among other things, creates T1 and T2 ...
> STORE T1 INTO 'hdfs://namenode/x/T1' USING PigStorage(',');
> STORE T2 INTO 'hdfs://namenode/x/T2' USING PigStorage(',');
> T3 = CROSS T2, T1;
>
> then I get what I expect; T3 has 2,982,035 records.
>
> However, if I omit the STOREs and run the CROSS directly, T3 only has
> 1,492,977
> records.
>
> I've run EXPLAIN on both the script with the STOREs and the script without,
> and their query plans are identical.
>
> I'm going to end up refactoring the script to get rid of the CROSS anyway
> since it's expensive, but am curious as to whether I'm doing something
> wrong or if there may be a subtle bug in CROSS.
>
> I'm using Pig version 0.11.0-cdh4.5.0
>
> Any insight you could give me here would be greatly appreciated.
>
> Thanks,
> --Alex
>


Re: Pig script : Need help

2014-04-07 Thread Pradeep Gollakota
That is because you're calling REPLACE on a bag of tuples and not a string.
What you would want to do is write a UDF (suggested name JOIN_ON), that
takes as an argument a join char and will join all the tuples in the bag by
the join char.


On Mon, Apr 7, 2014 at 12:31 PM, Krishnan Narayanan <
krishnan.sm...@gmail.com> wrote:

> Thanks Ankit,
>
> I am getting the desired result. What i did was I group by year but did not
> use any aggregate function so that the output is a bag(tuples).
> Now when I try to apply replace function for replacing the curly braces and
> brackets so that it will look 2014,2013 but it gives me error can't cast
> tuple to chararray.
> Cannot cast tuple with schema
>
> {(2014),(2013)}
>
>
> On Tue, Apr 1, 2014 at 5:47 PM, Ankit Bhatnagar  >wrote:
>
> > Run a group by year field
> >
> > Then do whatever u want to do
> >
> > On 4/1/14 5:45 PM, "Krishnan Narayanan" 
> wrote:
> >
> > >Hi All,
> > >
> > >Using Pig how can I achieve the below mentioned output from the input.
> > >
> > >Input
> > >
> > >year , 2000
> > >year , 2001
> > >year , 2002
> > >
> > >OutPut
> > >
> > >year 2000|2001|2002 ( in one row).
> > >
> > >Thanks
> > >Krishnan
> >
> >
>


Re: 回复:Re: Any way to join two aliases without using CROSS

2014-03-25 Thread Pradeep Gollakota
Unfortunately, the Enumerate UDF from DataFu would not work in this case.
The UDF works on Bags and in this case, we want to enumerate a relation.
Implementing RANK is a very tricky thing to do correctly. I'm not even sure
if it's doable just by using Pig operators, UDFs or macros. Best option is
probably to request a Pig upgrade.


On Tue, Mar 25, 2014 at 6:21 PM, James  wrote:

> Hello,
>
> There is a similar UDF in DataFu named Enumerate.
>
> http://datafu.incubator.apache.org/docs/datafu/1.2.0/datafu/pig/bags/Enumerate.html
>
> I wish it may help.
>
> James


Re: Any way to join two aliases without using CROSS

2014-03-25 Thread Pradeep Gollakota
CROSS is by definition a very very expensive operation. Regardless, CROSS
is the wrong operator for what you're trying to do.

As was suggested by others, you want to RANK the relations then do a JOIN
by the rank.


On Tue, Mar 25, 2014 at 1:27 PM,  wrote:

> Here is how to use rank and join for this problem:
>
> sh cat xxx
> 1,2,3,4,5
> 1,2,4,5,7
> 1,5,7,8,9
>
> sh cat yyy
> 10,11
> 10,12
> 10,13
>
>
> a= load 'xxx' using PigStorage(',');
> b= load 'yyy' using PigStorage(',');
>
> a2 = rank a;
> b2 = rank b;
>
> c = join a1 by $0, b2 by $0;
> c2 = order c by $6;
> c3 = foreach c2 generate $1 .. $5, $7 ..;
>
> dump c3
> (1,2,3,4,5,10,11)
> (1,2,4,5,7,10,12)
> (1,5,7,8,9,10,13)
>
>
> William F Dowling
> Senior Technologist
> Thomson Reuters
>
>
> -Original Message-
> From: Christopher Surage [mailto:csur...@gmail.com]
> Sent: Tuesday, March 25, 2014 4:03 PM
> To: user@pig.apache.org
> Subject: Re: Any way to join two aliases without using CROSS
>
> The output I would like to see is
>
> (1,2,3,4,5,10,11)
> (1,2,4,5,7,10,12)
> (1,5,7,8,9,10,13)
>
>
> On Tue, Mar 25, 2014 at 3:58 PM, Pradeep Gollakota  >wrote:
>
> > I don't understand what you're trying to do from your example.
> >
> > If you perform a cross on the data you have, the output will be the
> > following:
> >
> > (1,2,3,4,5,10,11)
> > (1,2,3,4,5,10,11)
> > (1,2,3,4,5,10,11)
> > (1,2,4,5,7,10,11)
> > (1,2,4,5,7,10,11)
> > (1,2,4,5,7,10,11)
> > (1,5,7,8,9,10,11)
> > (1,5,7,8,9,10,11)
> > (1,5,7,8,9,10,11)
> >
> > On this, you'll have to do a distinct to get what you're looking for.
> >
> > Let's change the example a little bit so we get a more clear
> understanding
> > of your problem. What would be the output if your two relations looked as
> > follows:
> >
> > (1,2,3,4,5)  (10,11)
> > (1,2,4,5,7)  (10,12)
> > (1,5,7,8,9)  (10,13)
> >
> >
> > On Tue, Mar 25, 2014 at 12:18 PM, Shahab Yunus  > >wrote:
> >
> > > Have you tried iterating over the first relation and in the nested
> > > *generate* clause, always appending the second relation? Your top level
> > > looping is on first relation but in the nested block you are sort of
> > > hardcoding appending of second relation.
> > >
> > > I am referring to the examples like in  "Example: Nested Blocks"
> section
> > > http://pig.apache.org/docs/r0.10.0/basic.html#foreach
> > >
> > >
> > > On Tue, Mar 25, 2014 at 3:01 PM, Christopher Surage  > > >wrote:
> > >
> > > > I am trying to perform the following action, but the only solution I
> > have
> > > > been able to come up with is using a CROSS, but I don't want to use
> > that
> > > > statement as it is a very expensive process.
> > > >
> > > > (1,2,3,4,5)  (10,11)
> > > > (1,2,4,5,7)  (10,11)
> > > > (1,5,7,8,9)  (10,11)
> > > >
> > > >
> > > > I want to make it
> > > > (1,2,3,4,5,10,11)
> > > > (1,2,4,5,7,10,11)
> > > > (1,5,7,8,9,10,11)
> > > >
> > > > any help would be much appreciated,
> > > >
> > > > Chris
> > > >
> > >
> >
>


Re: Any way to join two aliases without using CROSS

2014-03-25 Thread Pradeep Gollakota
I don't understand what you're trying to do from your example.

If you perform a cross on the data you have, the output will be the
following:

(1,2,3,4,5,10,11)
(1,2,3,4,5,10,11)
(1,2,3,4,5,10,11)
(1,2,4,5,7,10,11)
(1,2,4,5,7,10,11)
(1,2,4,5,7,10,11)
(1,5,7,8,9,10,11)
(1,5,7,8,9,10,11)
(1,5,7,8,9,10,11)

On this, you'll have to do a distinct to get what you're looking for.

Let's change the example a little bit so we get a more clear understanding
of your problem. What would be the output if your two relations looked as
follows:

(1,2,3,4,5)  (10,11)
(1,2,4,5,7)  (10,12)
(1,5,7,8,9)  (10,13)


On Tue, Mar 25, 2014 at 12:18 PM, Shahab Yunus wrote:

> Have you tried iterating over the first relation and in the nested
> *generate* clause, always appending the second relation? Your top level
> looping is on first relation but in the nested block you are sort of
> hardcoding appending of second relation.
>
> I am referring to the examples like in  "Example: Nested Blocks" section
> http://pig.apache.org/docs/r0.10.0/basic.html#foreach
>
>
> On Tue, Mar 25, 2014 at 3:01 PM, Christopher Surage  >wrote:
>
> > I am trying to perform the following action, but the only solution I have
> > been able to come up with is using a CROSS, but I don't want to use that
> > statement as it is a very expensive process.
> >
> > (1,2,3,4,5)  (10,11)
> > (1,2,4,5,7)  (10,11)
> > (1,5,7,8,9)  (10,11)
> >
> >
> > I want to make it
> > (1,2,3,4,5,10,11)
> > (1,2,4,5,7,10,11)
> > (1,5,7,8,9,10,11)
> >
> > any help would be much appreciated,
> >
> > Chris
> >
>


Re: Unable to add file paths when registering a UDF

2014-03-12 Thread Pradeep Gollakota
According to the docs, It should work.
http://pig.apache.org/docs/r0.12.0/basic.html#register

Stupid question, but is the path correct? Is it on HDFS or local disk?


On Tue, Mar 11, 2014 at 8:36 PM, Anthony Alleven wrote:

> Hello,
>
> I am trying to use a User Defined Function and I am unable to get Pig
> to recognize my compiled jar file when I have a path to the file. The
> file works when I put the jar file into the working directory for Pig,
> but I am a teaching assistant for a big-data course and we need
> students to put their work in their respective directories.
>
> Below is my pig script for my application:
> REGISTER MovingAverage.jar;
> --REGISTER /class/s14419x/lab7/MovingAverage.jar;
>
> The first line works, but the second gives me:
>
> ERROR org.apache.pig.tools.grunt.Grunt - ERROR 101: file
> '/class/s14419x/lab7/MovingAverage.jar' does not exist.
>
> I am teaching UDFs for a lab tomorrow so I would really appreciate any
> help or insight you could offer.
>
> Also, big fan of Pig. Keep it up!
>
> Tony
> --
> Anthony Alleven
> (952)-250-7166
> Senior, Computer Engineering
> IEEE | President
> Microsoft Ambassador
>


Re: one MR job for group-bys and cube-bys

2014-03-11 Thread Pradeep Gollakota
I forgot to mention that there are also other 3rd party libraries that make
examining the physical plan easier. For example take a look at
Lipstick<https://github.com/Netflix/Lipstick>from Netflix.


On Tue, Mar 11, 2014 at 11:41 AM, Pradeep Gollakota wrote:

> Best way to examine this is to use the EXPLAIN operator. It will show you
> the physical MapReduce plan and what features are being executed in each
> phase.
>
>
> On Tue, Mar 11, 2014 at 11:29 AM, ey-chih Chow  wrote:
>
>> Hi,
>>
>> I got a question on a pig script that has a single input with multiple
>> group-bys and cube-bys.  Will Pig use one or multiple MR job(s) to process
>> this script?  Thanks.
>>
>> Best regards,
>>
>> Ey-Chih Chow
>>
>
>


Re: one MR job for group-bys and cube-bys

2014-03-11 Thread Pradeep Gollakota
Best way to examine this is to use the EXPLAIN operator. It will show you
the physical MapReduce plan and what features are being executed in each
phase.


On Tue, Mar 11, 2014 at 11:29 AM, ey-chih Chow  wrote:

> Hi,
>
> I got a question on a pig script that has a single input with multiple
> group-bys and cube-bys.  Will Pig use one or multiple MR job(s) to process
> this script?  Thanks.
>
> Best regards,
>
> Ey-Chih Chow
>


Re: Nested foreach with order by

2014-02-27 Thread Pradeep Gollakota
No... that wouldn't be related since you're not doing a GROUP ALL.

The `FLATTEN(MY_UDF(t))` has me a little weary. Something is possibly going
wrong in your UDF. The output of your UDF is going to be a string that is
some generic status right? My uneducated guess is that there's a bug in
your UDF. To confirm, do you get the correct result if you replace your UDF
with an out of the box one e.g. COUNT?


On Thu, Feb 27, 2014 at 2:21 PM, Anastasis Andronidis <
andronat_...@hotmail.com> wrote:

> BTW, is this some how related[1] ?
>
>
> [1]:
> http://mail-archives.apache.org/mod_mbox/pig-user/201102.mbox/%3c5528d537-d05c-47d9-8bc8-cc68e236a...@yahoo-inc.com%3E
>
> On 27 Φεβ 2014, at 11:20 μ.μ., Anastasis Andronidis <
> andronat_...@hotmail.com> wrote:
>
> > Yes, of course, my output is like that:
> >
> > (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE)
> > (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE)
> > (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,SRMv2)
> > (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,SRMv2)
> > (20131209,AM-02-SEUA,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE)
> > .
> > .
> > .
> >
> > and when I put PARALLEL 1 in GROUP BY I get:
> >
> > (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE)
> > (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,SRMv2)
> > (20131209,AM-02-SEUA,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE)
> > .
> > .
> > .
> >
> >
> > On 27 Φεβ 2014, at 10:20 μ.μ., Pradeep Gollakota 
> wrote:
> >
> >> Where exactly are you getting duplicates? I'm not sure I understand your
> >> question. Can you give an example please?
> >>
> >>
> >> On Thu, Feb 27, 2014 at 11:15 AM, Anastasis Andronidis <
> >> andronat_...@hotmail.com> wrote:
> >>
> >>> Hello everyone,
> >>>
> >>> I have a foreach statement and inside of it, I use an order by. After
> the
> >>> order by, I have a UDF. Example like this:
> >>>
> >>>
> >>> logs = LOAD 'raw_data' USING org.apache.hcatalog.pig.HCatLoader();
> >>>
> >>> logs_g = GROUP logs BY (date, site, profile) PARALLEL 2;
> >>>
> >>> service_flavors = FOREACH logs_g {
> >>>   t = ORDER logs BY status;
> >>>   GENERATE group.date as dates, group.site as site, group.profile
> as
> >>> profile,
> >>>   FLATTEN(MY_UDF(t)) as
> >>> (generic_status);
> >>> };
> >>>
> >>> The problem is that I get duplicate results.. I know that MY_UDF is
> >>> running on mappers, but shouldn't each mapper take 1 group from the
> logs_g?
> >>> Is something wrong with order by? I tried to add  order by parallel
> but I
> >>> get syntax errors...
> >>>
> >>> My problem is resolved if I put  GROUP logs BY (date, site, profile)
> >>> PARALLEL 1; But this is not a scalable solution. Can someone help me
> pls? I
> >>> am using pig 0.11
> >>>
> >>> Cheers,
> >>> Anastasis
> >
>
>


Re: Nested foreach with order by

2014-02-27 Thread Pradeep Gollakota
Where exactly are you getting duplicates? I'm not sure I understand your
question. Can you give an example please?


On Thu, Feb 27, 2014 at 11:15 AM, Anastasis Andronidis <
andronat_...@hotmail.com> wrote:

> Hello everyone,
>
> I have a foreach statement and inside of it, I use an order by. After the
> order by, I have a UDF. Example like this:
>
>
> logs = LOAD 'raw_data' USING org.apache.hcatalog.pig.HCatLoader();
>
> logs_g = GROUP logs BY (date, site, profile) PARALLEL 2;
>
> service_flavors = FOREACH logs_g {
> t = ORDER logs BY status;
> GENERATE group.date as dates, group.site as site, group.profile as
> profile,
> FLATTEN(MY_UDF(t)) as
> (generic_status);
> };
>
> The problem is that I get duplicate results.. I know that MY_UDF is
> running on mappers, but shouldn't each mapper take 1 group from the logs_g?
> Is something wrong with order by? I tried to add  order by parallel but I
> get syntax errors...
>
> My problem is resolved if I put  GROUP logs BY (date, site, profile)
> PARALLEL 1; But this is not a scalable solution. Can someone help me pls? I
> am using pig 0.11
>
> Cheers,
> Anastasis


Re: how to control nested CROSS parallelism?

2014-01-20 Thread Pradeep Gollakota
It's strange that it's being executed on the Map-side. The group is a
reduce side operation (I'm assuming) and it seems that the nested foreach
would happen on Reduce-side after grouping. Have you looked at the MR plan
to verify that it is being executed Map-side?

One thing to try might be to CROSS first before grouping... although that
might be 2 reduce steps.


On Mon, Jan 20, 2014 at 1:27 AM, Serega Sheypak wrote:

> Hi, I'm in trouble
> Here a part of code:
>
> itemGrp = GROUP itemProj1 BY sale_id PARALLEL 12;
> notFiltered = FOREACH itemGrp{
> itemProj2 = FOREACH itemProj1
>GENERATE FLATTEN(
> TOTUPLE(id, other_id)) as
>(id, other_id);
>
> crossed = CROSS itemProj1, itemProj2;
> filtered =  FILTER crossed by (
> --some cond
>);
> projected = FOREACH filtered GENERATE f1, f2, f3;
> GENERATE FLATTEN(projected) as (f1, f2,f3);
> }
>
> The problem is that all this stuff is executed on map phase. But i want it
> to be executed on reduce phase to get parallelism benfit.
> Now only two mappers (not to much data before CROSS explosion) perform
> cross inside groups and complicated filtering.
>
> I can't find a way to make it run on reduce-phase...
> What do I do wrong?
>


Re: Spilling issue - Optimize "GROUP BY"

2014-01-10 Thread Pradeep Gollakota
Did you mean to say "timeout" instead of "spill"? Spills don't cause task
failures (unless a spill fails). Default timeout for a task is 10 min. It
would be very helpful to have a stack trace to look at, at the very least.


On Fri, Jan 10, 2014 at 7:53 AM, Zebeljan, Nebojsa <
nebojsa.zebel...@adtech.com> wrote:

> Hi Serega,
> Default task attempts = 4
> --> Yes, 4 task attempts
>
> Do you use any "balancing" properties, for eaxmple
> pig.exec.reducers.bytes.per.reducer
> --> No
>
> I suppose you have unbalanced data
> --> I guess so
>
> It's better to provide logs
> --> Unfortunately not possible any more "May be cleaned up by Task
> Tracker, if older logs"
>
> Regards,
> Nebo
> 
> From: Serega Sheypak [serega.shey...@gmail.com]
> Sent: Friday, January 10, 2014 2:32 PM
> To: user@pig.apache.org
> Subject: Re: Spilling issue - Optimize "GROUP BY"
>
> "and after trying it on several datanodes in the end it failes"
> Default task attempts = 4?
>
> 1. It's better to provde logs
> 2. Do you use any "balancing" properties, for eaxmple
> pig.exec.reducers.bytes.per.reducer ?
>
> I suppose you have unbalanced data
>
>
> 2014/1/10 Zebeljan, Nebojsa 
>
> > Hi,
> > I'm encountering for a "simple" pig script, spilling issues. All map
> tasks
> > and reducers succeed pretty fast except the last reducer!
> > The last reducer always starts spilling after ~10mins and after trying it
> > on several datanodes in the end it failes.
> >
> > Do you have any idea, how I could optimize the GROUP BY, so I don't run
> > into spilling issues.
> >
> > Thanks in advance!
> >
> > Below the pig script:
> > ###
> > dataImport = LOAD ;
> > generatedData = FOREACH dataImport GENERATE Field_A, Field_B, Field_C;
> > groupedData = GROUP generatedData BY (Field_B, Field_C);
> >
> > result = FOREACH groupedData {
> > counter_1 = FILTER generatedData BY ;
> > counter_2 = FILTER generatedData BY ;
> > GENERATE
> > group.Field_B,
> > group.Field_C,
> > COUNT(counter_1),
> > COUNT(counter_2);
> > }
> >
> > STORE result INTO  USING PigStorage();
> > ###
> >
> > Regards,
> > Nebo
> >
>
>
>


Re: Log File Versioning and Pig

2013-12-12 Thread Pradeep Gollakota
It seems like what you're asking for is Versioned Schema management. Pig is
not designed for that. Pig is only a scripting language to manipulate
datasets.

I'd recommend you look into Thrift, Protocol Buffers and Avro. They are
compact serialization libraries that do versioned schema management.


On Thu, Dec 12, 2013 at 2:06 PM, Mike Sukmanowsky  wrote:

> We're playing around with options to what I'm sure is a common problem -
> changing schemas in our log data.
>
> Specifically we collect pixel data via nginx servers.  These pixels
> currently have a pretty static list of parameters in the query string.  We
> have eventual plans to change this and support many different types of
> parameters in the query string.
>
> Our current logs have a static number of fields separated by a \u0001
> delimiter.  So to support "dynamic fields" we have two options:
>
>1. Store data using a Java/Pig Map of key:chararray and val:chararray
>2. Stick with static fields, and version the log format so that we know
>exactly how many fields to expect and what the schema is per line
>
> *Option 1 Pros:*
> No versioning needed.  If we add a new param, it's automatically picked up
> in the map and is available for all scripts to use.  Old scripts don't have
> to worry about new params being added.
>
> *Option 1 Cons:*
> Adds significantly to our file sizes.  Compression will help big time as
> many of the keys in the map are repeated string values which will benefit
> largely from compression.   But eventually when logs are decompressed for
> analysis, they'll eat up significantly more disk space.  Also, we're not
> sure about this but dealing with a ton of Map objects in Pig could be way
> more inefficient and have more overhead than just a bunch of
> chararrays/Strings.  Anyone know if this is true?
>
> *Option 2 Pros:*
> Basically smaller file size is the big one here since we don't have to
> store the field name in our raw logs only the value and probably a version
> number also.
>
> *Option 2 Cons:*
> Becomes harder for scripts to work with different versions and we need to
> explicitly state which log file version the script depends on somewhere.
>
> Was hoping to get a few opinions on this, what are people doing to solve
> this in the wild?
>
> --
> Mike Sukmanowsky
>
> Product Lead, http://parse.ly
> 989 Avenue of the Americas, 3rd Floor
> New York, NY  10018
> p: +1 (416) 953-4248
> e: m...@parsely.com
>


Re: CROSS/Self-Join Bug - Please Help :(

2013-12-04 Thread Pradeep Gollakota
I tried to following script (not exactly the same) and it worked correctly
for me.

businesses = LOAD 'dataset' using PigStorage(',') AS (a, b, c,
business_id: chararray, lat: double, lng: double);
locations = FOREACH businesses GENERATE business_id, lat, lng;
STORE locations INTO 'locations.tsv';
locations2 = LOAD 'locations.tsv' AS (business_id, lat, long);
loc_com = CROSS locations2, locations;
dump loc_com;

I’m wondering your problem has something to do with the way that the
JsonStorage works. Another thing you can try is to load ‘locations.tsv’
twice and do a self-cross on that.


On Wed, Dec 4, 2013 at 1:21 PM, Russell Jurney wrote:

> I have this bug that is killing me, where I can't self-join/cross a dataset
> with itself. Its blocking my work :(
>
> The script is like this:
>
> businesses = LOAD
> 'yelp_phoenix_academic_dataset/yelp_academic_dataset_business.json' using
> com.twitter.elephantbird.pig.load.JsonLoader() as json:map[];
>
> /* {open=true, neighborhoods={}, review_count=14, stars=4.0, name=Drybar,
> business_id=LcAamvosJu0bcPgEVF-9sQ, state=AZ, full_address=3172 E Camelback
> Rd
> Phoenix, AZ85018, categories={(Hair Salons),(Hair Stylists),(Beauty &
> Spas)}, longitude=-112.0131927, latitude=33.5107772, type=business,
> city=Phoenix} */
> locations = FOREACH businesses GENERATE $0#'business_id' AS business_id,
>   $0#'longitude' AS longitude,
>   $0#'latitude' AS latitude;
> STORE locations INTO 'yelp_phoenix_academic_dataset/locations.tsv';
> locations_2 = LOAD 'yelp_phoenix_academic_dataset/locations.tsv' AS
> (business_id:chararray, longitude:double, latitude:double);
> location_comparisons = CROSS locations_2, locations;
>
> distances = FOREACH businesses GENERATE locations.business_id AS
> business_id_1,
> locations_2.business_id AS
> business_id_2,
> udfs.haversine(locations.longitude,
>locations.latitude,
>
>  locations_2.longitude,
>
>  locations_2.latitude) AS distance;
> STORE distances INTO 'yelp_phoenix_academic_dataset/distances.tsv';
>
>
> I have also tried converting this to a self-join using JOIN BY '1', and
> also locations_2 = locations, and I get the same error:
>
> *org.apache.pig.backend.executionengine.ExecException: ERROR 0: Scalar has
> more than one row in the output. 1st :
> (rncjoVoEFUJGCUoC1JgnUA,-112.241596,33.581867), 2nd
> :(0FNFSzCFP_rGUoJx8W7tJg,-112.105933,33.604054)*
>
> at org.apache.pig.impl.builtin.ReadScalars.exec(ReadScalars.java:111)
>
> at
>
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:336)
>
> at
>
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:438)
>
> at
>
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:347)
>
> at
>
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:378)
>
> at
>
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:298)
>
> at
>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:283)
>
> at
>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:278)
>
> at
>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
>
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>
> at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
>
> This makes no sense! What am I to do? I can't self-join :(
> --
> Russell Jurney twitter.com/rjurney russell.jur...@gmail.com
> datasyndrome.com
>


Re: Trouble with REGEX in PIG

2013-12-04 Thread Pradeep Gollakota
It's not valid PigLatin...

The Grunt shell doesn't let you try out functions and UDFs are you're
trying to use them.

A = LOAD 'data' USING PigStorage() as (ip: chararray);
B = FOREACH A GENERATE REGEX_EXTRACT(ip, '(.*):(.*)', 1);
DUMP B;

You always have to load a dataset and work with said dataset(s).
You can create a file called 'data' (per the above script) and put "
192.168.1.5:8020" in the file and try the above set of commands in the
grunt shell.


On Wed, Dec 4, 2013 at 10:15 AM, Ankit Bhatnagar wrote:

> R u planning to use
>
> org.apache.pig.builtin.REGEX_EXTRACT
>
>
> ?
>
> On 12/4/13 9:28 AM, "Watrous, Daniel"  wrote:
>
> >Hi,
> >
> >I'm trying to use regular expressions in PIG, but it's failing. Based on
> >the documentation
> >http://pig.apache.org/docs/r0.12.0/func.html#regex-extract I am trying
> >this:
> >
> >[watrous@c0003913 ~]$ pig -x local
> >which: no hadoop in
> >(/opt/krb5/sbin/64:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr
> >/local/sbin:/usr/sbin:/sbin:/usr/X11R6/bin:/sbin:/usr/sbin:/usr/bin:/opt/p
> >b/bin:/opt/perf/bin:/bin:/usr/local/bin:/home/watrous/bin:/home/watrous/pi
> >g-0.12.0/bin)
> >2013-12-04 17:15:15,398 [main] INFO  org.apache.pig.Main - Apache Pig
> >version 0.12.0 (r1529718) compiled Oct 07 2013, 12:20:14
> >2013-12-04 17:15:15,398 [main] INFO  org.apache.pig.Main - Logging error
> >messages to: /home/watrous/pig_1386177315394.log
> >2013-12-04 17:15:15,425 [main] INFO  org.apache.pig.impl.util.Utils -
> >Default bootup file /home/watrous/.pigbootup not found
> >2013-12-04 17:15:15,599 [main] INFO
> >org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
> >Connecting to hadoop file system at: file:///
> >grunt> REGEX_EXTRACT('192.168.1.5:8020', '(.*):(.*)', 1);
> >2013-12-04 17:16:59,753 [main] ERROR org.apache.pig.tools.grunt.Grunt -
> >ERROR 1200:  Cannot expand macro 'REGEX_EXTRACT'. Reason: Macro
> >must be defined before expansion.
> >Details at logfile: /home/watrous/pig_1386177315394.log
> >
> >Here's the relevant bit from the log file:
> >Pig Stack Trace
> >---
> >ERROR 1200:  Cannot expand macro 'REGEX_EXTRACT'. Reason: Macro
> >must be defined before expansion.
> >
> >Failed to parse:  Cannot expand macro 'REGEX_EXTRACT'. Reason:
> >Macro must be defined before expansion.
> >at org.apache.pig.parser.PigMacro.macroInline(PigMacro.java:455)
> >at
> >org.apache.pig.parser.QueryParserDriver.inlineMacro(QueryParserDriver.java
> >:298)
> >at
> >org.apache.pig.parser.QueryParserDriver.expandMacro(QueryParserDriver.java
> >:287)
> >at
> >org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:180)
> >at
> >org.apache.pig.PigServer$Graph.validateQuery(PigServer.java:1648)
> >at
> >org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1621)
> >at org.apache.pig.PigServer.registerQuery(PigServer.java:575)
> >at
> >org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:1093)
> >at
> >org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParse
> >r.java:501)
> >at
> >org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:1
> >98)
> >at
> >org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:1
> >73)
> >at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69)
> >at org.apache.pig.Main.run(Main.java:541)
> >at org.apache.pig.Main.main(Main.java:156)
> >
> >I attempted to define the macro (following this tutorial
> >http://aws.amazon.com/articles/2729). However, piggybank.jar doesn't
> >define org.apache.pig.piggybank.evaluation.string.EXTRACT, so I located
> >the most likely file in the current version of the jar.
> >
> >grunt> register
> >/home/watrous/pig-0.12.0/contrib/piggybank/java/piggybank.jar
> >grunt> DEFINE REGEX_EXTRACT
> >org.apache.pig.piggybank.evaluation.string.RegexExtract;
> >grunt> REGEX_EXTRACT('192.168.1.5:8020', '(.*):(.*)', 1);
> >2013-12-04 17:23:20,383 [main] ERROR org.apache.pig.tools.grunt.Grunt -
> >ERROR 1200:  Cannot expand macro 'REGEX_EXTRACT'. Reason: Macro
> >must be defined before expansion.
> >Details at logfile: /home/watrous/pig_1386177315394.log
> >
> >I get the same stack trace with the only change being a reference to
> > instead of .
> >
> >Any idea how I can get this working?
> >
> >Daniel
>
>


Re: Apache Pig + Storm integration

2013-12-02 Thread Pradeep Gollakota
Jacob Perkins submitted a POC patch. However, my guess is that this will
not be included in the 0.13 release. There's still quite a bit of work to
be done and we'll be working on it. You can track the progress at
https://issues.apache.org/jira/browse/PIG-3453




On Mon, Dec 2, 2013 at 9:51 AM, Sameer Tilak  wrote:

>
>
>
> Hi everyone,I remember Pradeep had started discussion/proposal about Pig +
> Storm integration:
> https://cwiki.apache.org/confluence/display/PIG/Pig+on+Storm+Proposal
> Will this be released as a part of Pig 0.13? We are quite interested in
> this effort.
>


Re: Need help

2013-11-27 Thread Pradeep Gollakota
This question belongs on the user list. The dev list is meant for Pig
developers to discuss issues related to the development of Pig. I’ve
forwarded this to the user list. It also helps tremendously if you format
your data and scripts nicely as they’re much easier to read and understand.
I use a chrome extension called MarkdownHere to get the proper HTML (see
below).

Data:



  
ClinicalTrials.gov processed this data on November
07, 2013
Link to the current ClinicalTrials.gov record.
http://clinicaltrials.gov/show/NCT0611
  
  
114
NCT0611
  
  Women's Health Initiative (WHI)
  

  National Heart, Lung, and Blood Institute (NHLBI)
  NIH


  National Institute of Arthritis and Musculoskeletal and
Skin Diseases (NIAMS)
  NIH


  National Cancer Institute (NCI)
  NIH


  National Institute on Aging (NIA)
  NIH

  


Script:

register piggybank.jar;
A = load 'piglab/NCT0611.xml' using
org.apache.pig.piggybank.storage.XMLLoader('id_info')as (x:
chararray);
B = foreach A GENERATE FLATTEN(
REGEX_EXTRACT_ALL(x,
'\\n\\s*(.*)\\n\\s*(.*)\\n\\s*'))
as (org_study_id: chararray,nct_id : chararray);
C = foreach B GENERATE CONCAT('1$',CONCAT(CONCAT(org_study_id,'$'),nct_id));
STORE C into 'piglab/result1';
data = load 'piglab/result1' USING PigStorage('$') as (a1: int,a2:
chararray,a3: chararray);
A1 = load 'piglab/NCT0611.xml' using
org.apache.pig.piggybank.storage.XMLLoader('lead_sponsor') as (y:
chararray);
B1 = foreach A1 GENERATE FLATTEN(REGEX_EXTRACT_ALL(y,
'\\n\\s*(.*)\\n\\s*(.*)\\n\\s*'))
as (agency: chararray,agency_class: chararray);
D = foreach B1 GENERATE CONCAT('1$',CONCAT(CONCAT(agency,'$'),agency_class));
STORE D into 'piglab/result2';
data1 = load 'piglab/result2' USING PigStorage('$') as (b1: int,b2:
chararray,b3: chararray);
result= JOIN data by a1,data1 by b1;
store result into 'piglab/result' USING PigStorage('$');



On Wed, Nov 27, 2013 at 6:03 PM, Haider  wrote:

> Hi Daniel
>
>  I need help so badly , I hope you would understand my situation
>
>  The use case is, I have one folder which has multiple XML files and I need
> to write a PIG script which recursively parse all the files and generate
> one flat file.
>
> The XML looks like this and each XML file has different clinical_study_rank
> such as <*clinical_study rank="687"*
> **
> **
> *  *
> *  *
> *ClinicalTrials.gov processed this data on November 07,
> 2013*
> *Link to the current ClinicalTrials.gov record.*
> *http://clinicaltrials.gov/show/NCT0611
> *
> *  *
> *  *
> *114*
> *NCT0611*
> *  *
> *  Women's Health Initiative (WHI)*
> *  *
> **
> *  National Heart, Lung, and Blood Institute (NHLBI)*
> *  NIH*
> **
> **
> *  National Institute of Arthritis and Musculoskeletal and Skin
> Diseases (NIAMS)*
> *  NIH*
> **
> **
> *  National Cancer Institute (NCI)*
> *  NIH*
> **
> **
> *  National Institute on Aging (NIA)*
> *  NIH*
> **
> *  *
> *<**clinical_study>*
>
> *I have written the below script by considering  one XML file but this is
> not working as per requirement since it generating many small file and I
> dont know how merge them to make one.*
> *Below is my Pig script.*
>
>
>
>
>
>
>
> *register piggybank.jar;A = load 'piglab/NCT0611.xml' using
> org.apache.pig.piggybank.storage.XMLLoader('id_info')as (x: chararray);B =
> foreach A GENERATE FLATTEN(REGEX_EXTRACT_ALL(x,
>
>  
> '\\n\\s*(.*)\\n\\s*(.*)\\n\\s*'))
>as (org_study_id: chararray,nct_id : chararray);C = foreach B GENERATE
> CONCAT('1$',CONCAT(CONCAT(org_study_id,'$'),nct_id));STORE C into
> 'piglab/result1';data = load 'piglab/result1' USING PigStorage('$') as (a1:
> int,a2: chararray,a3: chararray);A1 = load 'piglab/NCT0611.xml' using
> org.apache.pig.piggybank.storage.XMLLoader('lead_sponsor')as (y:
> chararray);B1 = foreach A1 GENERATE FLATTEN(REGEX_EXTRACT_ALL(y,
>
>  
> '\\n\\s*(.*)\\n\\s*(.*)\\n\\s*'))
>as (agency: chararray,agency_class: chararray);D = foreach B1 GENERATE
> CONCAT('1$',CONCAT(CONCAT(agency,'$'),agency_class));STORE D into
> 'piglab/result2';data1 = load 'piglab/result2' USING PigStorage('$') as
> (b1: int,b2: chararray,b3: chararray);result= JOIN data by a1,data1 by b1
> ;store result into 'piglab/result' USING PigStorage('$');If you can give me
> one sample PIG script which parses such nested XML file then I can go
> forward with that.*
>


Re: add a key value pair in map

2013-11-15 Thread Pradeep Gollakota
I don't think there's an out of the box solution for it. But it's fairly
trivial to do with a UDF
On Nov 15, 2013 3:19 PM, "Jerry Lam"  wrote:

> Hi Pig users,
>
> Do you know how to add a key value pair into a map?
>
> For instance, a relation of A contains a document:map[] for each tuple;
>
> B = foreach A generate document, ['newkey'#'newvalue'];
>
> I want to add the 'newkey' and 'newvalue' inside the document map. Is it
> possible?
>
> Best Regards,
>
> Jerry
>


Re: replicated join gets extra job

2013-11-11 Thread Pradeep Gollakota
Use the ILLUSTRATE or EXPLAIN keywords to look at the details of the
physical execution plan... from first glance it doesn't look like you'd
need a 2nd job to do the joins, but if you can post the output of
ILLUSTRATE/EXPLAIN, we can look into it.


On Mon, Nov 11, 2013 at 4:36 PM, Dexin Wang  wrote:

> Hi,
>
> I'm running a job like this:
>
> raw_large = LOAD 'lots_of_files' AS (...);
> raw_filtered = FILTER raw_large BY ...;
> large_table = FOREACH raw_filtered GENERATE f1, f2, f3,;
>
> joined_1 = JOIN large_table BY (key1) LEFT, config_table_1  BY (key2) USING
> 'replicated';
> joined_2 = JOIN join1  BY (key3) LEFT, config_table_2  BY (key4)
> USING 'replicated';
> joined_3 = JOIN join2  BY (key5) LEFT, config_table_3  BY (key6)
> USING 'replicated';
> joined_4 = JOIN join4  BY (key7) LEFT, config_table_3  BY (key8)
> USING 'replicated';
>
> basically left join a large table with 4 relatively small tables using the
> replicated join.
>
> I see a first load job has 120 mapper tasks and no reducer, and this job
> seems to be doing the load and filtering. And there is another job
> following that has 26 mapper tasks that seem to be doing the joins.
>
> Shouldn't there be only one job and the joins being done in the mapper
> phase of the first job?
>
> The 4 config tables (files) have these sizes respectively:
>
> 3MB
> 220kB
> 2kB
> 100kB
>
> these are running on AWS EMR Pig 0.92 on xlarge instances which has 15GB
> memory.
>
> Thanks!
>


Re: Reading from local and writing to HDFS?

2013-11-07 Thread Pradeep Gollakota
Your pretty much stuck to options 1 and 2, with option 1 being the accepted
solution. The whole idea of MapReduce is that you're not able to use a
single machine to compute your answers. You can put an 'fs -put' command in
your script that can stage the output on HDFS first before running your
script in MR mode.

Local mode is mainly there for testing purposes. Not for production use.


On Thu, Nov 7, 2013 at 5:47 AM, Carl-Daniel Hailfinger <
c-d.hailfinger.devel.2...@gmx.net> wrote:

> Hi,
>
> I'm processing squid log files with Pig courtesy of MyRegexLoader. After
> a first processing step (saving with PigStorage) there's quite a lot of
> data processing to do.
>
> There's a catch, though. A superfluous copy operation:
> 1. variant: Copy the original Squid logs manually to HDFS with "hdfs dfs
> -copyFromLocal", then read them in Pig (distributed mode) from HDFS with
> MyRegexLoader, then store them in HDFS with PigStorage.
> 2. variant: Read the original Logs from local filesystem in Pig (local
> mode) with MyRegexLoader, store the on the local filesystem with
> PigStorage, then copy the result to HDFS with "hdfs dfs -copyFromLocal".
>
> Is there a way to have Pig read files from local fs, but store the
> result in HDFS? Given that reading files from local fs can't be done in
> distributed mode, I'd be totally happy to have that operation only run
> on the local node as long as the stored file is accessible via HDFS
> afterwards.
> I tried various ways to specify file locations as hdfs:// and file://,
> but that didn't work out. AFAICS the documentation is pretty silent on
> this.
>
> Any ideas or hints about what to do?
>
> Regards,
> Carl-Daniel
> --
> http://www.hailfinger.org/
>


Re: Bag of tuples

2013-11-06 Thread Pradeep Gollakota
Each element in A is not a Bag. A relation is a collection of tuples (just
like a bag). So each element in A is a tuple whose first element is a Bag.

If you want to order the tuples by id, you have to extract them from the
bag first.

A = LOAD 'data' ...;
B = FOREACH A GENERATE FLATTEN($0);
C = ORDER B BY $0;
DUMP C;

The error about “expression is not a project expression” is because you
started a FOREACH statement that is not ended by a GENERATE

If you want to find the top n tuples in a Bag you can use the TOP UDF.

A = LOAD 'data' AS (info: bag{t: (id, f1, f2, f3)});
B = FOREACH A GENERATE TOP(5, 'f1', A.info);
DUMP B;

I think I might have a syntax error in the above script, but you get the
general idea. The above strategy might only work if all the tuples in your
bag have the same schema. I'm not sure if `TOP` supports indices for
ordering field.

I also strongly recommend that you buy the Programming Pig book from
O’Riley written by Alan Gates and read it cover to cover (it’s a pretty
small book at about 200 pages). It explains basics of pig, advanced
techniques and optimization strategies. Not to mention it’s a fun read.


On Wed, Nov 6, 2013 at 2:38 PM, Sameer Tilak  wrote:

> Hi Alan,
> Thanks for your reply.
>
>
> I am trying to understand how Pig processes these relations. As I
> mentioned, my UDF returns the result in the following format;
>
>  {(id1,x,y,z), (id2, a, b, c), (id3,x,a)}  /* User 1 info */
>  {(id10,x,y,z), (id9, a, b, c), (id1,x,a)} /* User 2 info */
>  {(id8,x,y,z), (id4, a, b, c), (id2,x,a)} /* User 3 info */
>  {(id6,x,y,z), (id6, a, b, c), (id9,x,a)} /* User 4 info */
>
> B = foreach A { /* Each element in A is a bag. This will apply the
> following on each element within A that is each bag. */ Is this correct?
> B1 = order A by $0; -- order on the id /*What does this A refer to? Does
> it refer to it to each Bag of relationship A ? I get the following error:
> expression is not a project expression:
> /* rest of the code */
> }
>
> Thanks for your help.
>
>
> > Subject: Re: Bag of tuples
> > From: ga...@hortonworks.com
> > Date: Wed, 6 Nov 2013 09:36:04 -0800
> > To: user@pig.apache.org
> >
> > Do you mean you want to find the top 5 per input record?  Also, what is
> your ordering criteria?  Just sort by id?  Something like this should order
> all tuples in each bag by id and then produce the top 5.  My syntax may be
> a little off as I'm working offline and don't have the manual in front of
> me, but this should be the general idea.
> >
> > A = load 'yourinput' as (b:bag);
> > B = foreach A {
> >   B1 = order A by $0; -- order on the id
> >   B2 = limit B1 5;
> >   generate flatten(B2);
> > }
> >
> > Alan.
> >
> > On Nov 5, 2013, at 9:52 AM, Sameer Tilak wrote:
> >
> > > Hi Pig experts,
> > > Sorry to post so many questions, I have one more question on doing
> some analytics on bag of tuples.
> > >
> > > My input has the following format:
> > >
> > > {(id1,x,y,z), (id2, a, b, c), (id3,x,a)}  /* User 1 info */
> > > {(id10,x,y,z), (id9, a, b, c), (id1,x,a)} /* User 2 info */
> > > {(id8,x,y,z), (id4, a, b, c), (id2,x,a)} /* User 3 info */
> > > {(id6,x,y,z), (id6, a, b, c), (id9,x,a)} /* User 4 info */
> > >
> > > I can change my UDF to give more simple output. However, I want to
> find out if something like this can be done easily:
> > > I would like to find out top 5 ids (field 1 in a tuple) among all the
> users. Note that each user has a bag and the first field of each tuple in
> that bag is id.
> > >
> > > How difficult will it be to filter based on fields of tuples and do
> analytics across the entire user base.
> > >
> >
> >
> > --
> > CONFIDENTIALITY NOTICE
> > NOTICE: This message is intended for the use of the individual or entity
> to
> > which it is addressed and may contain information that is confidential,
> > privileged and exempt from disclosure under applicable law. If the reader
> > of this message is not the intended recipient, you are hereby notified
> that
> > any printing, copying, dissemination, distribution, disclosure or
> > forwarding of this communication is strictly prohibited. If you have
> > received this communication in error, please contact the sender
> immediately
> > and delete it from your system. Thank You.
>
>


Re: Pig Distributed Cache

2013-11-05 Thread Pradeep Gollakota
I see... do you have to do a full cross product or are you able to do a
join?


On Tue, Nov 5, 2013 at 11:07 AM, burakkk  wrote:

> There are some small different lookup files so that I need to process each
> single lookup files. From your example it can be that way:
>
> a = LOAD 'small1'; --for example taking source_id=1 --> then find
> source_name
> d = LOAD 'small2'; --for example taking campaign_id=2 --> then find
> campaign_name
> e = LOAD 'small3'; --for example taking offer_id=3 --> then find offer_name
> B = LOAD 'big';
> C = JOIN B BY 1, A BY 1 USING 'replicated';
> f = JOIN c BY 1, d BY 1 USING 'replicated';
> g = JOIN f BY 1, e BY 1 USING 'replicated';
> dump g;
>
> small1, small2 and small3 is different files so they store different rows.
> At the end of the process I need to attach to all rows in my big file.
> I know HDFS doesn't perform well with the small files but originally it
> stores in different environment. I pull the data from there and load into
> HDFS. Anyway because of our architecture I can't change it right now.
>
>
> Thanks
> Best regards...
>
>
> On Tue, Nov 5, 2013 at 7:43 PM, Pradeep Gollakota  >wrote:
>
> > CROSS is grossly expensive to compute so I’m not surprised that the
> > performance is good enough. Are you repeating your LOAD and FILTER op’s
> for
> > every one of your small files? At the end of the day, what is it that
> > you’re trying to accomplish? Find the 1 row you’re after and attach to
> all
> > rows in your big file?
> >
> > In terms of using DistributedCache, if you’re computing the cross product
> > of two (and no more than two) relations, AND one of the relations is
> small
> > enough to fit in memory, you can use a replicated JOIN instead which
> would
> > be much more performant.
> >
> > A = LOAD 'small';
> > B = LOAD 'big';
> > C = JOIN B BY 1, A BY 1 USING 'replicated';
> > dump C;
> >
> > Note that the smaller relation that will be loaded into memory needs to
> be
> > specified second in the JOIN statement.
> >
> > Also keep in mind that HDFS doesn't perform well with lots of small
> files.
> > If you're design has (lots of) small files, you might benefit from
> loading
> > that data into some database (e.g. HBase).
> >
> >
> > On Tue, Nov 5, 2013 at 7:29 AM, burakkk  wrote:
> >
> > > Hi,
> > > I'm using Pig 0.8.1-cdh3u5. Is there any method to use distributed
> cache
> > > inside Pig?
> > >
> > > My problem is that: I have lots of small files in hdfs. Let's say 10
> > files.
> > > Each files contain more than one rows but I need only one row. But
> there
> > > isn't any relationship between each other. So I filter them what I need
> > and
> > > then join them without any relationship(cross join) This is my
> workaround
> > > solution:
> > >
> > > a = load(smallFile1) --ex: rows count: 1000
> > > b = FILTER a BY myrow=='filter by exp1'
> > > c = load(smallFile2) --ex: rows count: 3
> > > d = FILTER c BY myrow2=='filter by exp2'
> > > e = CROSS b,d
> > > ...
> > > f = load(bigFile) --ex:rows count: 50mio
> > > g = CROSS e, f
> > >
> > > But it's performance isn't good enough. So if I can use distributed
> cache
> > > inside pig script, I can lookup the files which I first read and filter
> > in
> > > the memory. What is your suggestion? Is there any other performance
> > > efficient way to do it?
> > >
> > > Thanks
> > > Best regards...
> > >
> > >
> > > --
> > >
> > > *BURAK ISIKLI* | *http://burakisikli.wordpress.com
> > > <http://burakisikli.wordpress.com>*
> > >
> >
>
>
>
> --
>
> *BURAK ISIKLI* | *http://burakisikli.wordpress.com
> <http://burakisikli.wordpress.com>*
>


Re: Pig Distributed Cache

2013-11-05 Thread Pradeep Gollakota
CROSS is grossly expensive to compute so I’m not surprised that the
performance is good enough. Are you repeating your LOAD and FILTER op’s for
every one of your small files? At the end of the day, what is it that
you’re trying to accomplish? Find the 1 row you’re after and attach to all
rows in your big file?

In terms of using DistributedCache, if you’re computing the cross product
of two (and no more than two) relations, AND one of the relations is small
enough to fit in memory, you can use a replicated JOIN instead which would
be much more performant.

A = LOAD 'small';
B = LOAD 'big';
C = JOIN B BY 1, A BY 1 USING 'replicated';
dump C;

Note that the smaller relation that will be loaded into memory needs to be
specified second in the JOIN statement.

Also keep in mind that HDFS doesn't perform well with lots of small files.
If you're design has (lots of) small files, you might benefit from loading
that data into some database (e.g. HBase).


On Tue, Nov 5, 2013 at 7:29 AM, burakkk  wrote:

> Hi,
> I'm using Pig 0.8.1-cdh3u5. Is there any method to use distributed cache
> inside Pig?
>
> My problem is that: I have lots of small files in hdfs. Let's say 10 files.
> Each files contain more than one rows but I need only one row. But there
> isn't any relationship between each other. So I filter them what I need and
> then join them without any relationship(cross join) This is my workaround
> solution:
>
> a = load(smallFile1) --ex: rows count: 1000
> b = FILTER a BY myrow=='filter by exp1'
> c = load(smallFile2) --ex: rows count: 3
> d = FILTER c BY myrow2=='filter by exp2'
> e = CROSS b,d
> ...
> f = load(bigFile) --ex:rows count: 50mio
> g = CROSS e, f
>
> But it's performance isn't good enough. So if I can use distributed cache
> inside pig script, I can lookup the files which I first read and filter in
> the memory. What is your suggestion? Is there any other performance
> efficient way to do it?
>
> Thanks
> Best regards...
>
>
> --
>
> *BURAK ISIKLI* | *http://burakisikli.wordpress.com
> *
>


Re: Local vs mapreduce mode

2013-11-05 Thread Pradeep Gollakota
Really dumb question but... when running in MapReduce mode, is your input
file on HDFS?


On Tue, Nov 5, 2013 at 9:17 AM, Sameer Tilak  wrote:

>
> Dear Pig experts,
>
> I have the following Pig script that works perfectly in local mode.
> However, in the mapreduce mode I get AU as :
>
> $HADOOP_CONF_DIR fs -cat /scratch/AU/part-m-0
> Warning: $HADOOP_HOME is deprecated.
>
> {}
> {}
> {}
> {}
>
> Both the local mode and the mapreduce mode relation A is set correctly.
>
> Can anyone please tell me what are the recommended ways for debugging the
> script in mapreduce mode -- logging utilities etc.
>
> REGISTER
> /users/p529444/software/pig-0.11.1/contrib/piggybank/java/piggybank.jar;
> REGISTER /users/p529444/software/pig-0.11.1/parser.jar
>
> DEFINE SequenceFileLoader
> org.apache.pig.piggybank.storage.SequenceFileLoader();
>
> A = LOAD '/scratch/file.seq' USING SequenceFileLoader AS (key: chararray,
> value: chararray);
> DESCRIBE A;
> STORE A into '/scratch/A';
>
> AU
>  = FOREACH A GENERATE parser.Parser(key) AS {(id: int, class: chararray,
>  name: chararray, begin: int, end: int, probone: chararray, probtwo:
> chararray)};
> STORE AU into '/scratch/AU';
>
>
>
>


Re: Java UDF and incompatible schema

2013-11-04 Thread Pradeep Gollakota
This is most likely because you haven't defined the outputSchema method of
the UDF. The AS keyword merges the schema generated by the UDF with the
user specified schema. If the UDF does not override the method and specify
the output schema, it is considered null and you will not be able to use AS
to override the schema.

Out of curiosity, if each one of your small files describes a user, is
there any reason why you can't use a database (e.g. HBase) to store this
information? It seems like any file based storage may not be the best
solution given my extremely limited knowledge of your problem domain.


On Mon, Nov 4, 2013 at 4:26 PM, Sameer Tilak  wrote:

> Hi everyone,
>
> I have written my custom parser and since my files are sm,all I am using
> sequence file for efficiency. Each file in the equence file has info about
> one user and I am parsing that file and I would like to get a bag of tuples
> for every user/file/.  In my Parser class I have implemented exec function
> that will be called for each file/user.  I then gather the info and package
> it as tuples. Each user will generate multiple tuples sine the file is
> quite rich and complex. Is it correct to assume that the  the relation AU
> will contact one bag per user?
>
> When I execute the following script, I get the following error. Any help
> with this would be great!
> ERROR 1031: Incompatable field schema: declared is
>
> "bag_0:bag{:tuple(id:int,class:chararray,name:chararray,begin:int,end:int,probone:chararray,probtwo:chararray)}",
>  infered is ":Unknown"
>
>
> Java UDF code snippet
>
> PopulateBag
> {
>
> for (MyItems item : items)
> {
>
>
> Tuple output = TupleFactory.getInstance().newTuple(7);
>
>
> output.set(0, item.getId());
>
> output.set(1, item.getClass());
>
> output.set(2,item.getName());
>
> output.set(3,item.Begin());
>
> output.set(4,item.End());
>
> output.set(5,item.Probabilityone());
>
> output.set(6,item.Probtwo());
>
> m_defaultDataBag.add(output);
>
>
> }
>  }
>
>  public DefaultDataBag exec(Tuple input) throws IOException {
>
>  try
>{
>
>this.ParseFile((String)input.get(0));
>this.PopulateBag();
>return m_defaultDataBag;
>} catch (Exception e) {
>System.err.println("Failed to process th i/p \n");
>return null;
>}
> }
>
>
> Pig Script
>
> REGISTER
> /users/p529444/software/pig-0.11.1/contrib/piggybank/java/piggybank.jar;
> REGISTER /users/p529444/software/pig-0.11.1/parser.jar
>
> DEFINE SequenceFileLoader
> org.apache.pig.piggybank.storage.SequenceFileLoader();
>
> A = LOAD '/scratch/file.seq' USING SequenceFileLoader AS (key: chararray,
> value: chararray);
> DESCRIBE A;
> STORE A into '/scratch/A';
>
> AU = FOREACH A GENERATE parser.Parser(key) AS {(id: int, class: chararray,
> name: chararray, begin: int, end: int, probone: chararray, probtwo:
> chararray)};
>
>
>
>
>


Re: limit map tasks for load function

2013-11-03 Thread Pradeep Gollakota
You would only be able to set it for the script... which means it will
apply to all 8 jobs. However, my guess is that you don't need to control
the number of map tasks per machine.


On Sun, Nov 3, 2013 at 4:21 PM, John  wrote:

> Thanks for your answer! How can I set the mapred.tasktracker.map.tasks.
> maxiumum value only for this speficic job? For example the pig script is
> creating 8 jobs, and I only want to modify this value for the first job? I
> think there is no option in PigLatin to influence this value?
>
> kind regards
>
>
>
>
> 2013/11/4 Pradeep Gollakota 
>
> > I think you’re misunderstanding how HBaseStorage works. HBaseStorage uses
> > the HBaseInputFormat underneath the hood. The number of map tasks that
> are
> > spawned is dependent on the number of regions you have. The map tasks are
> > spawned such that the tasks are local to the regions they’re reading
> from.
> > You will typically not have to worry about problems such as this with
> > MapReduce. If you do have some performance concerns, you can set the
> > mapred.tasktracker.map.tasks.maxiumum setting in the job conf and it will
> > not affect all the other jobs.
> >
> >
> > On Sun, Nov 3, 2013 at 3:04 PM, John  wrote:
> >
> > > Hi,
> > >
> > > is it possible to limit the number of map slots used for the load
> > function?
> > > For example I have 5 nodes with 10 map slots (each node has 2 slots for
> > > every cpu) I want only one map task for every node. Is there a way to
> set
> > > this only for the load function? I know there is a option called
> > > "mapred.tasktracker.map.tasks.maximum",
> > > but this would influence every MapReduce job. I want to influence the
> > > number only for this specific job.
> > >
> > > My use case is the following: I'm using a modified version of the
> > > HBaseStorage function. I try to load for example 10 different rowkeys
> > with
> > > very big column sizes and join them afterwords. Since the columns all
> > have
> > > the same column family every row can be stored to a different server.
> For
> > > example rowkey rowkey 1-5 is stored on node1 and the other rowkeys on
> the
> > > other nodes. So If I create a Pig script to load the 10 keys and join
> > them
> > > afterwards this will end up in 1 MapReduce Job with 10 map task and
> some
> > > reduce tasks (depends on the parallel factor).  The problem is that
> there
> > > will be created 2 map tasks on node1, because there are 2 slots
> > available.
> > > This means every task is reading simultaneously a large number of
> columns
> > > from the local hard drive. Maybe I'm wrong, but this should be a
> > > performance issue?! It should be faster if to read each rowkey one
> after
> > > another!?
> > >
> > > kind regards
> > >
> >
>


Re: limit map tasks for load function

2013-11-03 Thread Pradeep Gollakota
I think you’re misunderstanding how HBaseStorage works. HBaseStorage uses
the HBaseInputFormat underneath the hood. The number of map tasks that are
spawned is dependent on the number of regions you have. The map tasks are
spawned such that the tasks are local to the regions they’re reading from.
You will typically not have to worry about problems such as this with
MapReduce. If you do have some performance concerns, you can set the
mapred.tasktracker.map.tasks.maxiumum setting in the job conf and it will
not affect all the other jobs.


On Sun, Nov 3, 2013 at 3:04 PM, John  wrote:

> Hi,
>
> is it possible to limit the number of map slots used for the load function?
> For example I have 5 nodes with 10 map slots (each node has 2 slots for
> every cpu) I want only one map task for every node. Is there a way to set
> this only for the load function? I know there is a option called
> "mapred.tasktracker.map.tasks.maximum",
> but this would influence every MapReduce job. I want to influence the
> number only for this specific job.
>
> My use case is the following: I'm using a modified version of the
> HBaseStorage function. I try to load for example 10 different rowkeys with
> very big column sizes and join them afterwords. Since the columns all have
> the same column family every row can be stored to a different server. For
> example rowkey rowkey 1-5 is stored on node1 and the other rowkeys on the
> other nodes. So If I create a Pig script to load the 10 keys and join them
> afterwards this will end up in 1 MapReduce Job with 10 map task and some
> reduce tasks (depends on the parallel factor).  The problem is that there
> will be created 2 map tasks on node1, because there are 2 slots available.
> This means every task is reading simultaneously a large number of columns
> from the local hard drive. Maybe I'm wrong, but this should be a
> performance issue?! It should be faster if to read each rowkey one after
> another!?
>
> kind regards
>


Re: simple pig logic

2013-10-31 Thread Pradeep Gollakota
If I understood your question correctly, given the following input:

main_data.txt
{"id": "foo", "some_field": 12354, "score": 0}
{"id": "foobar", "some_field": 12354, "score": 0}
{"id": "baz", "some_field": 12345, "score": 0}

score_data.txt
{"id": "foo", "score": 1}
{"id": "foobar", "score": 20}

you want the following output

{"id": "foo", "some_field": 12354, "score": 1}
{"id": "foobar", "some_field": 12354, "score": 20}
{"id": "baz", "some_field": 12345, "score": 0}

If that is correct, you can do a LEFT OUTER join on the two relations.

main = LOAD 'main_data.txt' as (id: chararray, some_field: int, score: int);
scores = LOAD 'score_data.txt' as (id: chararray, score: int);
both = JOIN main by id LEFT, scores by id;
final = FOREACH both GENERATE main::id as id, main::some_field as
some_field, (scores::score == null ? main::score : scores::score) as
score;
dump final;

After the join, check to see if the scores::score is null… if it is, choose
the default of main::score… if not choose scores::score.

Hope this helps!


Re: UDFContext NULL JobConf

2013-10-30 Thread Pradeep Gollakota
Are you able to post your UDF (or at least a sanitized version)?


On Wed, Oct 30, 2013 at 10:46 AM, Henning Kropp wrote:

> Hi,
>
> thanks for your reply. I read about the expected behavior on the front-end
> and I am getting the NPE on the back-end. The Mappers log the Exception
> during Execution.
>
> I am currently digging through debug messages. What to look out for? There
> are bunch of
>
> [main] DEBUG org.apache.hadoop.conf.Configuration  - java.io.IOException:
> config()
>
> log messages. But I recall them as being "normal" for reasons I don't
> remember.
>
> Regards
>
>
> 2013/10/30 Cheolsoo Park 
>
> > Hi,
> >
> > Are you getting NPE on the front-end or the back-end? Sounds like jobConf
> > is not added to UDFContext, which is expected on the front-end. Please
> see
> > the comments in getJobConf() and addJobConf() in the source code:
> >
> >
> >
> https://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/impl/util/UDFContext.java
> >
> > Thanks,
> > Cheolsoo
> >
> >
> > On Wed, Oct 30, 2013 at 9:57 AM, Henning Kropp  > >wrote:
> >
> > > Hi,
> > >
> > > I am stuck. In my UDF (Java) extends EvalFunc the following code throws
> > and
> > > NPE in exec(), when executed in -x mapred mode:
> > >
> > > Configuration jobConf = UDFContext.getUDFContext().getJobConf();
> > > System.err.println(jobConf.toString());
> > >
> > > I did not find any useful information as why my JobConf is always null.
> > All
> > > I find is that this is the right way to get the JobConf in a UDF and
> that
> > > the behavior of what is returned when running locally (jira issue).
> > >
> > > Any ideas? I am running it on a very old Hadoop version 0.20.2 Are
> there
> > > some known issues? I use Pig 0.11.1
> > >
> > > Many thanks in advanced
> > >
> > > PS: Just found someone with the same issue
> > >
> http://stackoverflow.com/questions/18795008/accessing-hdfs-from-pig-udf
> > >
> >
>


Re: count distinct on multiple columns

2013-10-29 Thread Pradeep Gollakota
Great question. There seems to be some confusion about how DISTINCT
operates. I remembered (and thankfully found) this
message
that
explains the behavior.

As per the other post, it looks like what you've documented is expected
behavior.


On Mon, Oct 28, 2013 at 4:15 PM, Min Zhou  wrote:

> Hi all,
>
> Below script is how we count distinct on columns jid and mid
>
> sjv =  LOAD '/path/of/the/data' USING AvroStorage();
> jv = FOREACH sjv GENERATE TOTUPLE(jid, mid) AS jid_mid, time;
> groupv = GROUP jv ALL;
> countv = FOREACH groupv {
> unique = DISTINCT jv.jid_mid;
> GENERATE COUNT(unique);
> };
> dump countv;
> The result is 2302351.
>
> If I use code below, got another result
> sjv =  LOAD '/path/of/the/data' USING AvroStorage();
> groupv = GROUP sjv ALL;
> countv = FOREACH groupv {
> jid_mid = sjv.(jid, mid);
> unique = DISTINCT jid_mid;
> GENERATE COUNT(unique);
> };
> dump countv;
> The result is 2290003.
>
> If I concat the two columns with a delimiter never exists in jid and mid, I
> got another result which I think is the correct answer of this aggregation.
> sjv =  LOAD '/path/of/the/data' USING AvroStorage();
> jv = FOREACH sjv GENERATE CONCAT(jid, CONCAT(':', mid)) AS jid_mid, time;
> groupv = GROUP jv ALL;
> countv = FOREACH groupv {
> unique = DISTINCT jv.jid_mid;
> GENERATE COUNT(unique);
> };
> dump countv;
> The result is 2386385.
>
> I did a test with below script
> sjv =  LOAD '/path/of/the/data' USING AvroStorage();
> groupv = GROUP jv BY (jid, mid);
> unique = FOREACH groupv GENERATE FLATTEN(group), MIN(time);
> store unique ;
> The hadoop counters showed that there are 2386385 records written into
> HDFS.  The number is as same as the 3rd pig script I list above.
>
> Can anyone explain the difference among those three?  Whey they lead to
> different results?
>
> Regards,
> Min
> --
> My research interests are distributed systems, parallel computing and
> bytecode based virtual machine.
>
> My profile:
> http://www.linkedin.com/in/coderplay
> My blog:
> http://coderplay.javaeye.com
>


Re: Parent Child Relationships in Pig

2013-10-24 Thread Pradeep Gollakota
Not really...

In my experience, Pig is only good at dealing with tabular data. The type
of graphical data you have is not workable in Pig. Have you considered
using a Graph database (such as Neo4j)? These databases are highly
optimized for doing the type of path queries you're looking for.


On Thu, Oct 24, 2013 at 10:09 PM, Something Something <
mailinglist...@gmail.com> wrote:

> Hello,
>
> Is there a way in Pig to go thru a parent-child hierarchy?  For example,
> let's say I've following data:
>
> ChildParent   Value
> 1  10
> 10 20
> 20 30
> 30 40v30
> 40 50v40
>
>
> Now let's say, I look up Child=10, it has no 'Value', so I go to its parent
> (20), it has no value either, so now I go to its parent 30 & get v30.  This
> hierarchy could be up to 15 levels deep.
>
> I need to do this type of search for every line in a file containing
> millions of lines.
>
> Can I do this in Pig?  Please let me know.  Thanks.
>


Re: Elephant-Bird: Building error

2013-10-17 Thread Pradeep Gollakota
Repo: Read the docs at https://github.com/kevinweil/elephant-bird


On Thu, Oct 17, 2013 at 4:17 PM, Sameer Tilak  wrote:

> It has a number of utilities for Pig. I was interested in using them for
> Pig so I posted it to this mailing list.
>
> > Date: Thu, 17 Oct 2013 16:08:42 -0700
> > Subject: Re: Elephant-Bird: Building error
> > From: pradeep...@gmail.com
> > To: user@pig.apache.org
> >
> > This question does not belong in the Pig mailing list. Please ask on the
> > elephant bird mailing list at
> > https://groups.google.com/forum/?fromgroups#!forum/elephantbird-dev
> >
> >
> > On Thu, Oct 17, 2013 at 4:02 PM, Zhu Wayne 
> wrote:
> >
> > > Why build? Get from maven repo.
> > >
>
>


Re: Elephant-Bird: Building error

2013-10-17 Thread Pradeep Gollakota
This question does not belong in the Pig mailing list. Please ask on the
elephant bird mailing list at
https://groups.google.com/forum/?fromgroups#!forum/elephantbird-dev


On Thu, Oct 17, 2013 at 4:02 PM, Zhu Wayne  wrote:

> Why build? Get from maven repo.
>


Re: number of M/R jobs for a Pig Script

2013-10-15 Thread Pradeep Gollakota
Can you describe what your input data looks like and what you want your
output data to look like?

I don’t understand your question. A group by is really straight forward to
do on a dataset.

A = LOAD 'mydata' using MyStorage();
B = GROUP A BY group_key;
dump B;

Is that what you’re looking for?


On Tue, Oct 15, 2013 at 12:12 PM, ey-chih chow  wrote:

> What I really want to know is,in Pig, how can I read an input data set only
> once and generate multiple instances with distinct keys for each data point
> and do a group-by?
>
> Best regards,
>
> Ey-Chih Chow
>
>
> On Tue, Oct 15, 2013 at 10:16 AM, Pradeep Gollakota  >wrote:
>
> > I'm not aware of anyway to do that. I think you're also missing the
> spirit
> > of Pig. Pig is meant to be a data workflow language. Describe a workflow
> > for your data using PigLatin and Pig will then compile your script to
> > MapReduce jobs. The number of MapReduce jobs that it generates is the
> > smallest number of jobs (based on the optimizers) that Pig thinks it
> needs
> > to complete the workflow.
> >
> > Why do you want to control the number of MR jobs?
> >
> >
> > On Tue, Oct 15, 2013 at 10:07 AM, ey-chih chow  wrote:
> >
> > > Thanks everybody.  Is there anyway we can programmatically control the
> > > number of M-R jobs that a Pig script will generate, similar to write
> M-R
> > > jobs in Java?
> > >
> > > Best regards,
> > >
> > > Ey-Chih Chow
> > >
> > >
> > > On Tue, Oct 15, 2013 at 6:14 AM, Shahab Yunus  > > >wrote:
> > >
> > > > And Geert's comment about using external-to-Pig approach reminds me
> > that,
> > > > then you have Netflix's PigLipstick too. Nice visual tool for actual
> > > > execution and stores job history as well.
> > > >
> > > > Regards,
> > > > Shahab
> > > >
> > > >
> > > > On Tue, Oct 15, 2013 at 8:51 AM, Geert Van Landeghem <
> > g...@foundation.be
> > > > >wrote:
> > > >
> > > > > You can also use ambrose to monitor execution of your pig script at
> > > > > runtime. Remark: from pig-0.11 on.
> > > > >
> > > > > It show you the DAG of MR jobs and which are currently being
> > executed.
> > > As
> > > > > long as pig-ambrose is connected to the execution of your script
> > > > (workflow)
> > > > > you can replay the workflow.
> > > > >
> > > > > --
> > > > > kind regards,
> > > > >  Geert
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On 15-okt.-2013, at 14:43, Shahab Yunus 
> > > wrote:
> > > > >
> > > > > > Have you tried using ILLUSTRATE and EXPLAIN command? As far as I
> > > know,
> > > > I
> > > > > > don't think they give you the exact number as it depends on the
> > > actual
> > > > > data
> > > > > > but I believe you can interpret it/extrapolate it from the
> > > information
> > > > > > provided by these commands.
> > > > > >
> > > > > > Regards,
> > > > > > Shahab
> > > > > >
> > > > > >
> > > > > > On Tue, Oct 15, 2013 at 3:57 AM, ey-chih chow 
> > > > wrote:
> > > > > >
> > > > > >> Hi,
> > > > > >>
> > > > > >> I have a Pig script that has two group-by statements on the the
> > > input
> > > > > data
> > > > > >> set.  Is there anybody knows how many M-R jobs the script will
> > > > generate?
> > > > > >> Thanks.
> > > > > >>
> > > > > >> Best regards,
> > > > > >>
> > > > > >> Ey-Chih Chow
> > > > > >>
> > > > >
> > > > >
> > > >
> > >
> >
>


Re: number of M/R jobs for a Pig Script

2013-10-15 Thread Pradeep Gollakota
I'm not aware of anyway to do that. I think you're also missing the spirit
of Pig. Pig is meant to be a data workflow language. Describe a workflow
for your data using PigLatin and Pig will then compile your script to
MapReduce jobs. The number of MapReduce jobs that it generates is the
smallest number of jobs (based on the optimizers) that Pig thinks it needs
to complete the workflow.

Why do you want to control the number of MR jobs?


On Tue, Oct 15, 2013 at 10:07 AM, ey-chih chow  wrote:

> Thanks everybody.  Is there anyway we can programmatically control the
> number of M-R jobs that a Pig script will generate, similar to write M-R
> jobs in Java?
>
> Best regards,
>
> Ey-Chih Chow
>
>
> On Tue, Oct 15, 2013 at 6:14 AM, Shahab Yunus  >wrote:
>
> > And Geert's comment about using external-to-Pig approach reminds me that,
> > then you have Netflix's PigLipstick too. Nice visual tool for actual
> > execution and stores job history as well.
> >
> > Regards,
> > Shahab
> >
> >
> > On Tue, Oct 15, 2013 at 8:51 AM, Geert Van Landeghem  > >wrote:
> >
> > > You can also use ambrose to monitor execution of your pig script at
> > > runtime. Remark: from pig-0.11 on.
> > >
> > > It show you the DAG of MR jobs and which are currently being executed.
> As
> > > long as pig-ambrose is connected to the execution of your script
> > (workflow)
> > > you can replay the workflow.
> > >
> > > --
> > > kind regards,
> > >  Geert
> > >
> > >
> > >
> > >
> > > On 15-okt.-2013, at 14:43, Shahab Yunus 
> wrote:
> > >
> > > > Have you tried using ILLUSTRATE and EXPLAIN command? As far as I
> know,
> > I
> > > > don't think they give you the exact number as it depends on the
> actual
> > > data
> > > > but I believe you can interpret it/extrapolate it from the
> information
> > > > provided by these commands.
> > > >
> > > > Regards,
> > > > Shahab
> > > >
> > > >
> > > > On Tue, Oct 15, 2013 at 3:57 AM, ey-chih chow 
> > wrote:
> > > >
> > > >> Hi,
> > > >>
> > > >> I have a Pig script that has two group-by statements on the the
> input
> > > data
> > > >> set.  Is there anybody knows how many M-R jobs the script will
> > generate?
> > > >> Thanks.
> > > >>
> > > >> Best regards,
> > > >>
> > > >> Ey-Chih Chow
> > > >>
> > >
> > >
> >
>


Re: Reading simple json file

2013-09-23 Thread Pradeep Gollakota
Improper capitalization. Storage functions are case sensitive, try
JsonLoader.


On Mon, Sep 23, 2013 at 2:37 PM, jamal sasha  wrote:

> Hi,
>
> I am trying to read simple json data as:
> d =LOAD 'json_output' USING
> JSONLOADER(('ip:chararray,_id:chararray,cats:[chararray]');
> But I am getting this error:
> 2013-09-23 14:33:17,127 [main] ERROR org.apache.pig.tools.grunt.Grunt -
> ERROR 1070: Could not resolve JSONLOADER using imports: [,
> org.apache.pig.builtin., org.apache.pig.impl.builtin.]
> Details at logfile: /home/user/mohit/pig-0.11.1-src/pig_1379969371188.log
>
>
> What am i missing?
>


Re: ISOToUNix working in Pig 0.8.1 but not in Pig 0.11.0

2013-09-20 Thread Pradeep Gollakota
Doh!

I think I made a mistake myself...

"-MM-dd HH:mm:ss"

Since you don't have AM/PM, I'm assuming that your time is 24-hr format.
So, you need to use the 24 hour format symbol of 'H' for hour instead of
'h'.

I really hate time.


On Fri, Sep 20, 2013 at 6:25 PM, Pradeep Gollakota wrote:

> Be careful with your format definition... it looks like you might have a
> typo.
>
> I believe "-MM-dd hh:mm:ss" is the correct format.
> http://joda-time.sourceforge.net/apidocs/org/joda/time/format/DateTimeFormat.html
>
>
>
>
> On Fri, Sep 20, 2013 at 8:26 AM, Ruslan Al-Fakikh wrote:
>
>> What was the error?
>>
>> Not an issue, but why do you call the columns dt1, dt2, but not using the
>> name, using the ordinal number insted: $0?
>>
>>
>> On Fri, Sep 20, 2013 at 6:00 PM, Muni mahesh > >wrote:
>>
>> > Hi Hadoopers,
>> >
>> > I did the same thing in Pig 0.8.1 but not Pig 0.11.0
>> >
>> > register /usr/lib/pig/piggybank.jar;
>> > register /usr/lib/pig/lib/joda-time-2.1.jar;
>> >
>> > DEFINE CustomFormatToISO
>> >
>> org.apache.pig.piggybank.evaluation.datetime.convert.CustomFormatToISO();
>> > DEFINE ISOToUnix
>> > org.apache.pig.piggybank.evaluation.datetime.convert.ISOToUnix();
>> >
>> > A = load '/home/user/Desktop/1.tsv' USING PigStorage('\t') AS
>> > (dt1:chararray, dt2:chararray);
>> > B = foreach A generate (long) ISOToUnix(CustomFormatToISO($0,
>> '-mm-dd
>> > hh:mm:ss'));
>> >
>> >
>> > *input *
>> > 2013-01-16 04:01:182013-01-16 04:01:36
>> > 2013-01-16 04:02:192013-01-16 04:03:11
>> >
>> > *output* *expected*
>> > (1358308878000,1358308896000)
>> > (1358308939000,1358308991000)
>> >
>>
>
>


Re: ISOToUNix working in Pig 0.8.1 but not in Pig 0.11.0

2013-09-20 Thread Pradeep Gollakota
Be careful with your format definition... it looks like you might have a
typo.

I believe "-MM-dd hh:mm:ss" is the correct format.
http://joda-time.sourceforge.net/apidocs/org/joda/time/format/DateTimeFormat.html




On Fri, Sep 20, 2013 at 8:26 AM, Ruslan Al-Fakikh wrote:

> What was the error?
>
> Not an issue, but why do you call the columns dt1, dt2, but not using the
> name, using the ordinal number insted: $0?
>
>
> On Fri, Sep 20, 2013 at 6:00 PM, Muni mahesh  >wrote:
>
> > Hi Hadoopers,
> >
> > I did the same thing in Pig 0.8.1 but not Pig 0.11.0
> >
> > register /usr/lib/pig/piggybank.jar;
> > register /usr/lib/pig/lib/joda-time-2.1.jar;
> >
> > DEFINE CustomFormatToISO
> > org.apache.pig.piggybank.evaluation.datetime.convert.CustomFormatToISO();
> > DEFINE ISOToUnix
> > org.apache.pig.piggybank.evaluation.datetime.convert.ISOToUnix();
> >
> > A = load '/home/user/Desktop/1.tsv' USING PigStorage('\t') AS
> > (dt1:chararray, dt2:chararray);
> > B = foreach A generate (long) ISOToUnix(CustomFormatToISO($0, '-mm-dd
> > hh:mm:ss'));
> >
> >
> > *input *
> > 2013-01-16 04:01:182013-01-16 04:01:36
> > 2013-01-16 04:02:192013-01-16 04:03:11
> >
> > *output* *expected*
> > (1358308878000,1358308896000)
> > (1358308939000,1358308991000)
> >
>


Re: how to load custom Writable class from sequence file?

2013-09-16 Thread Pradeep Gollakota
It doesn't look like the SequenceFileLoader from the piggybank has much
support. The elephant bird version looks like it does what you need it to
do.
https://github.com/kevinweil/elephant-bird/blob/master/pig/src/main/java/com/twitter/elephantbird/pig/load/SequenceFileLoader.java

You'll have to write the converters from your types to Pig data types and
pass it into the constructor of the SequenceFileLoader.

Hope this helps!


On Mon, Sep 16, 2013 at 6:56 PM, Pradeep Gollakota wrote:

> Thats correct...
>
> The "load ... AS (k:chararray, v:charrary);" doesn't actually do what you
> think it does. The AS statement tell Pig what the schema types are, so it
> will call the appropriate LoadCaster method to get it into the right type.
> A LoadCaster object defines how to map byte[] into appropriate Pig
> datatypes. If the LoadFunc is not schema aware and you don't have the
> schema defined when you load, everything will be loaded as a bytearray.
>
> The problem you have is that the custom writable isn't a Pig datatype. I
> don't think you'll be able to do this without writing some custom code.
> I'll take a look at the source code for the SequenceFileLoader and see if
> there's a way to specify your own LoadCaster. If there is, then you'll just
> have to write a custom LoadCaster and specify it in the configuration. If
> not, you'll have to extend and roll out your own SequenceFileLoader.
>
>
> On Mon, Sep 16, 2013 at 6:43 PM, Yang  wrote:
>
>> I think my custom type has toString(), well at least writable() says it's
>> writable to bytes, so supposedly if I force it to bytes or string, pig
>> should be able to cast
>> like
>>
>> load ... AS ( k:chararray, v:chararray);
>>
>> but this actually fails
>>
>>
>> On Mon, Sep 16, 2013 at 6:22 PM, Pradeep Gollakota > >wrote:
>>
>> > The problem is that pig only speaks its data types. So you need to tell
>> it
>> > how to translate from your custom writable to a pig datatype.
>> >
>> > Apparently elephant-bird has some support for doing this type of
>> thing...
>> > take a look at this SO post
>> >
>> >
>> http://stackoverflow.com/questions/16540651/apache-pig-can-we-convert-a-custom-writable-object-to-pig-format
>> >
>> >
>> > On Mon, Sep 16, 2013 at 5:37 PM, Yang  wrote:
>> >
>> > > I tried to do a quick and dirty inspection of some of our data feeds,
>> > which
>> > > are encoded in gzipped SequenceFile.
>> > >
>> > > basically I did
>> > >
>> > > a = load 'myfile' using ..SequenceFileLoader() AS ( mykey,
>> myvalue);
>> > >
>> > > but it gave me some error:
>> > > 2013-09-16 17:34:28,915 [Thread-5] INFO
>> > >  org.apache.hadoop.io.compress.CodecPool - Got brand-new decompressor
>> > > 2013-09-16 17:34:28,915 [Thread-5] INFO
>> > >  org.apache.hadoop.io.compress.CodecPool - Got brand-new decompressor
>> > > 2013-09-16 17:34:28,915 [Thread-5] INFO
>> > >  org.apache.hadoop.io.compress.CodecPool - Got brand-new decompressor
>> > > 2013-09-16 17:34:28,961 [Thread-5] WARN
>> > >  org.apache.pig.piggybank.storage.SequenceFileLoader - Unable to
>> > translate
>> > > key class com.mycompany.model.VisitKey to a Pig datatype
>> > > 2013-09-16 17:34:28,962 [Thread-5] WARN
>> > >  org.apache.hadoop.mapred.FileOutputCommitter - Output path is null in
>> > > cleanup
>> > > 2013-09-16 17:34:28,963 [Thread-5] WARN
>> > >  org.apache.hadoop.mapred.LocalJobRunner - job_local_0001
>> > > org.apache.pig.backend.BackendException: ERROR 0: Unable to translate
>> > class
>> > > com.mycompany.model.VisitKey to a Pig datatype
>> > > at
>> > >
>> > >
>> >
>> org.apache.pig.piggybank.storage.SequenceFileLoader.setKeyType(SequenceFileLoader.java:78)
>> > >  at
>> > >
>> > >
>> >
>> org.apache.pig.piggybank.storage.SequenceFileLoader.getNext(SequenceFileLoader.java:133)
>> > >
>> > >
>> > > in the pig file, I have already REGISTERED the jar that contains the
>> > class
>> > >  com.mycompany.model.VisitKey
>> > >
>> > >
>> > > if PIG doesn't work, the only other approach is probably to use some
>> of
>> > the
>> > > newer "pseudo-scripting " languages like cascalog or scala
>> > > thanks
>> > > Yang
>> > >
>> >
>>
>
>


Re: how to load custom Writable class from sequence file?

2013-09-16 Thread Pradeep Gollakota
Thats correct...

The "load ... AS (k:chararray, v:charrary);" doesn't actually do what you
think it does. The AS statement tell Pig what the schema types are, so it
will call the appropriate LoadCaster method to get it into the right type.
A LoadCaster object defines how to map byte[] into appropriate Pig
datatypes. If the LoadFunc is not schema aware and you don't have the
schema defined when you load, everything will be loaded as a bytearray.

The problem you have is that the custom writable isn't a Pig datatype. I
don't think you'll be able to do this without writing some custom code.
I'll take a look at the source code for the SequenceFileLoader and see if
there's a way to specify your own LoadCaster. If there is, then you'll just
have to write a custom LoadCaster and specify it in the configuration. If
not, you'll have to extend and roll out your own SequenceFileLoader.


On Mon, Sep 16, 2013 at 6:43 PM, Yang  wrote:

> I think my custom type has toString(), well at least writable() says it's
> writable to bytes, so supposedly if I force it to bytes or string, pig
> should be able to cast
> like
>
> load ... AS ( k:chararray, v:chararray);
>
> but this actually fails
>
>
> On Mon, Sep 16, 2013 at 6:22 PM, Pradeep Gollakota  >wrote:
>
> > The problem is that pig only speaks its data types. So you need to tell
> it
> > how to translate from your custom writable to a pig datatype.
> >
> > Apparently elephant-bird has some support for doing this type of thing...
> > take a look at this SO post
> >
> >
> http://stackoverflow.com/questions/16540651/apache-pig-can-we-convert-a-custom-writable-object-to-pig-format
> >
> >
> > On Mon, Sep 16, 2013 at 5:37 PM, Yang  wrote:
> >
> > > I tried to do a quick and dirty inspection of some of our data feeds,
> > which
> > > are encoded in gzipped SequenceFile.
> > >
> > > basically I did
> > >
> > > a = load 'myfile' using ..SequenceFileLoader() AS ( mykey,
> myvalue);
> > >
> > > but it gave me some error:
> > > 2013-09-16 17:34:28,915 [Thread-5] INFO
> > >  org.apache.hadoop.io.compress.CodecPool - Got brand-new decompressor
> > > 2013-09-16 17:34:28,915 [Thread-5] INFO
> > >  org.apache.hadoop.io.compress.CodecPool - Got brand-new decompressor
> > > 2013-09-16 17:34:28,915 [Thread-5] INFO
> > >  org.apache.hadoop.io.compress.CodecPool - Got brand-new decompressor
> > > 2013-09-16 17:34:28,961 [Thread-5] WARN
> > >  org.apache.pig.piggybank.storage.SequenceFileLoader - Unable to
> > translate
> > > key class com.mycompany.model.VisitKey to a Pig datatype
> > > 2013-09-16 17:34:28,962 [Thread-5] WARN
> > >  org.apache.hadoop.mapred.FileOutputCommitter - Output path is null in
> > > cleanup
> > > 2013-09-16 17:34:28,963 [Thread-5] WARN
> > >  org.apache.hadoop.mapred.LocalJobRunner - job_local_0001
> > > org.apache.pig.backend.BackendException: ERROR 0: Unable to translate
> > class
> > > com.mycompany.model.VisitKey to a Pig datatype
> > > at
> > >
> > >
> >
> org.apache.pig.piggybank.storage.SequenceFileLoader.setKeyType(SequenceFileLoader.java:78)
> > >  at
> > >
> > >
> >
> org.apache.pig.piggybank.storage.SequenceFileLoader.getNext(SequenceFileLoader.java:133)
> > >
> > >
> > > in the pig file, I have already REGISTERED the jar that contains the
> > class
> > >  com.mycompany.model.VisitKey
> > >
> > >
> > > if PIG doesn't work, the only other approach is probably to use some of
> > the
> > > newer "pseudo-scripting " languages like cascalog or scala
> > > thanks
> > > Yang
> > >
> >
>


Re: how to load custom Writable class from sequence file?

2013-09-16 Thread Pradeep Gollakota
The problem is that pig only speaks its data types. So you need to tell it
how to translate from your custom writable to a pig datatype.

Apparently elephant-bird has some support for doing this type of thing...
take a look at this SO post
http://stackoverflow.com/questions/16540651/apache-pig-can-we-convert-a-custom-writable-object-to-pig-format


On Mon, Sep 16, 2013 at 5:37 PM, Yang  wrote:

> I tried to do a quick and dirty inspection of some of our data feeds, which
> are encoded in gzipped SequenceFile.
>
> basically I did
>
> a = load 'myfile' using ..SequenceFileLoader() AS ( mykey, myvalue);
>
> but it gave me some error:
> 2013-09-16 17:34:28,915 [Thread-5] INFO
>  org.apache.hadoop.io.compress.CodecPool - Got brand-new decompressor
> 2013-09-16 17:34:28,915 [Thread-5] INFO
>  org.apache.hadoop.io.compress.CodecPool - Got brand-new decompressor
> 2013-09-16 17:34:28,915 [Thread-5] INFO
>  org.apache.hadoop.io.compress.CodecPool - Got brand-new decompressor
> 2013-09-16 17:34:28,961 [Thread-5] WARN
>  org.apache.pig.piggybank.storage.SequenceFileLoader - Unable to translate
> key class com.mycompany.model.VisitKey to a Pig datatype
> 2013-09-16 17:34:28,962 [Thread-5] WARN
>  org.apache.hadoop.mapred.FileOutputCommitter - Output path is null in
> cleanup
> 2013-09-16 17:34:28,963 [Thread-5] WARN
>  org.apache.hadoop.mapred.LocalJobRunner - job_local_0001
> org.apache.pig.backend.BackendException: ERROR 0: Unable to translate class
> com.mycompany.model.VisitKey to a Pig datatype
> at
>
> org.apache.pig.piggybank.storage.SequenceFileLoader.setKeyType(SequenceFileLoader.java:78)
>  at
>
> org.apache.pig.piggybank.storage.SequenceFileLoader.getNext(SequenceFileLoader.java:133)
>
>
> in the pig file, I have already REGISTERED the jar that contains the class
>  com.mycompany.model.VisitKey
>
>
> if PIG doesn't work, the only other approach is probably to use some of the
> newer "pseudo-scripting " languages like cascalog or scala
> thanks
> Yang
>


Re: Problem while using merge join

2013-09-13 Thread Pradeep Gollakota
I think a better option is to completely bypass the HBaseStorage mechanism.
Since you've already modified it, just put your 2nd UDF in there and have
it return the data that you need right away.

Another question I have is, are you absolutely positive that your data will
continue to be sorted if you've projected away the row key? The columns are
only sorted intra-row.


On Fri, Sep 13, 2013 at 12:06 PM, John  wrote:

> Sure, it is not so fast while loading, but on the other hand I can safe the
> foreach operation after the load function. The best way would be to get all
> Columns and return a bag, but I see there no way because the LoadFunc
> return a Tuple and no Bag. I will try this way and see how fast it is. If
> there are other ideas to make that faster I will try it.
>
> regards,
> john
>
>
> 2013/9/13 Shahab Yunus 
>
> > Wouldn't this slow down your data retrieval? Once column in each call
> > instead of a batch?
> >
> > Regards,
> > Shahab
> >
> >
> > On Fri, Sep 13, 2013 at 2:34 PM, John 
> wrote:
> >
> > > I think I might have found a way to transform it directly into a bag.
> > > Inside the HBaseStorage() Load Function I have set the HBase scan batch
> > to
> > > 1, so I got for every scan.next() one column instead of all columns.
> See
> > >
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html
> > >
> > > setBatch(int batch)
> > > Set the maximum number of values to return for each call to next()
> > >
> > > I think this will work. Any idea if this way have disadvantages?
> > >
> > > regards
> > >
> > >
> > > 2013/9/13 John 
> > >
> > > > hi,
> > > >
> > > > the join key is in the bag, thats the problem. The Load Function
> > returns
> > > > only one element 0$ and that is the map. This map is transformed in
> the
> > > > next step with the UDF "MapToBagUDF" into a bag. for example the load
> > > > functions returns this ([col1,col2,col3), then this map inside the
> > tuple
> > > is
> > > > transformed to:
> > > >
> > > > (col1)
> > > > (col2)
> > > > (col3)
> > > >
> > > > Maybe there is is way to transform the map directly in the load
> > function
> > > > into a bag? The problem I see is that the next() Method in the
> LoadFunc
> > > has
> > > > to be a Tuple and no Bag. :/
> > > >
> > > >
> > > >
> > > > 2013/9/13 Pradeep Gollakota 
> > > >
> > > >> Since your join key is not in the Bag, can you do your join first
> and
> > > then
> > > >> execute your UDF?
> > > >>
> > > >>
> > > >> On Fri, Sep 13, 2013 at 10:04 AM, John 
> > > >> wrote:
> > > >>
> > > >> > Okay, I think I have found the problem here:
> > > >> > http://pig.apache.org/docs/r0.11.1/perf.html#merge-joins ...
> there
> > is
> > > >> > wirtten;
> > > >> >
> > > >> > There may be filter statements and foreach statements between the
> > > sorted
> > > >> > data source and the join statement. The foreach statement should
> > meet
> > > >> the
> > > >> > following conditions:
> > > >> >
> > > >> >- There should be no UDFs in the foreach statement.
> > > >> >- The foreach statement should not change the position of the
> > join
> > > >> keys.
> > > >> >- There should be no transformation on the join keys which will
> > > >> change
> > > >> >the sort order.
> > > >> >
> > > >> >
> > > >> > I have to use a UDF to transform the Map into a Bag ... any
> > Workaround
> > > >> > idea?
> > > >> >
> > > >> > thanks
> > > >> >
> > > >> >
> > > >> > 2013/9/13 John 
> > > >> >
> > > >> > > Hi,
> > > >> > >
> > > >> > > I try to use a merge join for 2 bags. Here is my pig code:
> > > >> > > http://pastebin.com/Y9b2UtNk .
> > > >> > >
> > > >> > > But I got this error:
> > > >> > >
> > > >> > > Caused by:
> > > >> > >
> > > >> >
> > > >>
> > >
> >
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.LogicalToPhysicalTranslatorException:
> > > >> > > ERROR 1103: Merge join/Cogroup only supports Filter, Foreach,
> > > >> Ascending
> > > >> > > Sort, or Load as its predecessors. Found
> > > >> > >
> > > >> > > I think the reason is that there is no sort function or
> something
> > > like
> > > >> > > this. But the bags are definitely sorted. How can I do the merge
> > > join?
> > > >> > >
> > > >> > > thanks
> > > >> > >
> > > >> >
> > > >>
> > > >
> > > >
> > >
> >
>


Re: Problem while using merge join

2013-09-13 Thread Pradeep Gollakota
Since your join key is not in the Bag, can you do your join first and then
execute your UDF?


On Fri, Sep 13, 2013 at 10:04 AM, John  wrote:

> Okay, I think I have found the problem here:
> http://pig.apache.org/docs/r0.11.1/perf.html#merge-joins ... there is
> wirtten;
>
> There may be filter statements and foreach statements between the sorted
> data source and the join statement. The foreach statement should meet the
> following conditions:
>
>- There should be no UDFs in the foreach statement.
>- The foreach statement should not change the position of the join keys.
>- There should be no transformation on the join keys which will change
>the sort order.
>
>
> I have to use a UDF to transform the Map into a Bag ... any Workaround
> idea?
>
> thanks
>
>
> 2013/9/13 John 
>
> > Hi,
> >
> > I try to use a merge join for 2 bags. Here is my pig code:
> > http://pastebin.com/Y9b2UtNk .
> >
> > But I got this error:
> >
> > Caused by:
> >
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.LogicalToPhysicalTranslatorException:
> > ERROR 1103: Merge join/Cogroup only supports Filter, Foreach, Ascending
> > Sort, or Load as its predecessors. Found
> >
> > I think the reason is that there is no sort function or something like
> > this. But the bags are definitely sorted. How can I do the merge join?
> >
> > thanks
> >
>


Re: Sort Order in HBase with Pig/Piglatin in Java

2013-09-13 Thread Pradeep Gollakota
No problem! In this case, insertion order is the same as natural order, so
I think a LinkedHashMap is probably a better choice for this particular use
case.

Here's a great SO post about the differences between HashMap, TreeMap and
LinkedHashMap.
http://stackoverflow.com/questions/2889777/difference-between-hashmap-linkedhashmap-and-sortedmap-in-java




On Fri, Sep 13, 2013 at 9:29 AM, John  wrote:

> Hi, thanks for your quick answer! I figured it out by my self since the
> mailing server was down the last 2hours?!  Btw. I did option 1. But I used
> a LinkedHashMap insead. Do you knows whats the better choice? TreeMap
> or LinkedHashMap?
>
> Anyway thanks :)
>
>
> 2013/9/13 Pradeep Gollakota 
>
> > Thats a great observation John! The problem is that HBaseStorage maps
> > columns families into a HashMap, so the sort ordering is completely lost.
> >
> > You have two options:
> >
> > 1. Modify HBaseStorage to use a SortedMap data structure (i.e. TreeMap)
> and
> > use the modified HBaseStorage. (or make it configurable)
> > 2. Since you convert the map to a bag, you can sort the bag in a nested
> > foreach statement.
> >
> > I prefer option 1 myself because it would be more performant than option
> 2.
> >
> >
> > On Fri, Sep 13, 2013 at 7:31 AM, John 
> wrote:
> >
> > > I have created a HBase Table in the hbase shell and added some data. In
> > > http://hbase.apache.org/book/dm.sort.html is written that the datasets
> > are
> > > first sorted by the rowkey and then the column. So I tried something in
> > the
> > > HBase Shell: http://pastebin.com/gLVAX0rJ
> > >
> > > Everything looks fine. I got the right order a -> c -> d like expected.
> > >
> > > Now I tried the same with Apache Pig in Java:
> > http://pastebin.com/jdTpj4Fu
> > >
> > > I got this result:
> > >
> > > (key1,[c#val,d#val,a#val])
> > >
> > > So, now the order is c -> d -> a. That seems a little odd to me,
> > shouldn't
> > > it be the same like in HBase? It's important for me to get the right
> > order
> > > because I transform the map afterwards into a bag and then join it with
> > > other tables. If both inputs are sorted I could use a merge join
> without
> > > sorting these two datasets. So does anyone know how it is possible to
> get
> > > the sorted map (or bag) of the columns?
> > >
> > >
> > > thanks
> > >
> >
>


Re: Sort Order in HBase with Pig/Piglatin in Java

2013-09-13 Thread Pradeep Gollakota
Thats a great observation John! The problem is that HBaseStorage maps
columns families into a HashMap, so the sort ordering is completely lost.

You have two options:

1. Modify HBaseStorage to use a SortedMap data structure (i.e. TreeMap) and
use the modified HBaseStorage. (or make it configurable)
2. Since you convert the map to a bag, you can sort the bag in a nested
foreach statement.

I prefer option 1 myself because it would be more performant than option 2.


On Fri, Sep 13, 2013 at 7:31 AM, John  wrote:

> I have created a HBase Table in the hbase shell and added some data. In
> http://hbase.apache.org/book/dm.sort.html is written that the datasets are
> first sorted by the rowkey and then the column. So I tried something in the
> HBase Shell: http://pastebin.com/gLVAX0rJ
>
> Everything looks fine. I got the right order a -> c -> d like expected.
>
> Now I tried the same with Apache Pig in Java: http://pastebin.com/jdTpj4Fu
>
> I got this result:
>
> (key1,[c#val,d#val,a#val])
>
> So, now the order is c -> d -> a. That seems a little odd to me, shouldn't
> it be the same like in HBase? It's important for me to get the right order
> because I transform the map afterwards into a bag and then join it with
> other tables. If both inputs are sorted I could use a merge join without
> sorting these two datasets. So does anyone know how it is possible to get
> the sorted map (or bag) of the columns?
>
>
> thanks
>


Re: Join Question

2013-09-04 Thread Pradeep Gollakota
I think there's probably some convoluted way to do this. First thing you'll
have to do is flatten your data.

data1 = A, B
_
X, X1
X, X2
Y, Y1
Y, Y2
Y, Y3

Then do a  join by "B" onto you second dataset. This should produce the
following

data2 = data1::A, data1::B, data2::A, data2::B, data2::C (I'm assuming data
set has exactly 4 columns).
___
X, X1, X1, 4, 5, 6
X, X2, X2, 3, 7, 3

Now do a group by data1::A to get
{X, {(X, X1, X1, 4, 5, 6), (X, X2, X2, 3, 7, 3), ...}}
{Y, {(Y, Y1, Y1, ...), (Y, Y2, Y2, ...), ...}}

This is as far as I got, I'm not sure if there's a built-in UDF to
transform that into what you're looking for. I thought maybe BagToTuple,
but it will return a single tuple with all elements of all tuples in the
bag. If the above data format supports your use cases, you're done. If not,
you can write a UDF to transform it into the required format.


On Wed, Sep 4, 2013 at 4:39 PM, F. Jerrell Schivers
wrote:

> Howdy folks,
>
> Let's say I have a set of data that looks like this:
>
> X, (X1, X2)
> Y, (Y1, Y2, Y3)
>
> So there could be an unknown number of members of each tuple per row.
>
> I also have a second set of data that looks like this:
>
> X1, 4, 5, 6
> X2, 3, 7, 3
>
> I'd like to join these such that I get:
>
> X, (X1, 4, 5, 6), (X2, 3, 7, 3)
> Y, (Y1, etc), (Y2, etc), (Y3, etc)
>
> Is this possible with Pig?
>
> Thanks,
> Jerrell
>


Re: Pig upgrade

2013-08-23 Thread Pradeep Gollakota
Most of the major changes were introduced in 0.9

The documentation listing the backward compatibility issues with 0.9 can be
found at
https://cwiki.apache.org/confluence/display/PIG/Pig+0.9+Backward+Compatibility

I believe other changes that are not listed there are the introduction of
macros.

A complete list of features/bug fixes/improvements by version number can be
found on JIRA at
https://issues.apache.org/jira/browse/PIG#selectedTab=com.atlassian.jira.plugin.system.project%3Achangelog-panel

Hope this helps.


On Sat, Aug 24, 2013 at 1:16 AM, Viswanathan J
wrote:

> Hi,
>
> I'm planning to upgrade pig version from 0.8.0 to 0.11.0, hope this is
> stable release.
>
> So what are the improvements, key features, benefits, advantages by
> upgrading this?
>
> Thanks, Viswa.J
>


Re: dev How can I add a row number per input file to the data

2013-08-21 Thread Pradeep Gollakota
That's an interesting approach! Although, I'm not sure if RANK is supported
as a nested foreach operator. If it is supported, then this approach would
work. The documentation doesn't show that RANK is a supported nested
foreach operator.

http://pig.apache.org/docs/r0.11.1/basic.html#foreach


On Wed, Aug 21, 2013 at 11:03 AM, Ruslan Al-Fakikh wrote:

> Hi!
>
> Probably these can help:
> http://pig.apache.org/docs/r0.11.1/basic.html#rank
> http://pig.apache.org/docs/r0.11.1/func.html#pigstorage (look for
> -tagsource)
>
> I've never tried this, but probably you could group by tagsource and then
> apply RANK
>
> Ruslan
>
>
> On Fri, Aug 16, 2013 at 6:17 AM, Leo  wrote:
>
> > Hi, I want to add a row/line number to the data I read from multiple
> CSVs.
> > However I want the running number reflect the line number *per input
> file*,
> > not overall.
> >
> > I am happy to write a Python UDF for this. So far I have in the UDF:
> >
> > --- Python file udf.py ---
> > lineNum = 0
> >
> > @outputSchema("lnum:int, f1:chararray")
> > def makeData(line):
> > global lineNum
> > lineNum += 1
> > return lineNum, line.tostring()
> >
> > which is called from Pig:
> >
> > --- Pig file use-udf.pig ---
> > register 'udf.py' using jython as udfs;
> >
> > data = load 'datadir' using TextLoader() as line;
> > udfified = foreach data generate udfs.makeData(line);
> >
> > dump udfified;
> >
> > This approach works, *but* the running number increases over multiple
> > files in the directory "datadir". That is *not* what I want! I need the
> row
> > number starting with 1 for each file in datadir. Maybe I can reset the
> > lineNum variable per input file?
> >
> > Any idea how to achieve this? Either with plain Pig or with Python UDFs?
> >
> > Many thanks, Leo
> >
>


Re: Pig Latin Program with special Load Function

2013-08-21 Thread Pradeep Gollakota
In your eval function, you can use the HBase Get/Scan API to retrieve the
data rather than using the MapReduce API.


On Wed, Aug 21, 2013 at 7:12 AM, John  wrote:

> Im currently writing a Pig Latin programm:
>
> A = load 'hbase://mytable1' my.packages.CustomizeHBaseStorage('VALUE',
> '-loadKey true', 'myrowkey1') as (rowkey:chararray, columncontent:map[]);
> ABag = foreach PATTERN_0 generate flatten(my.packages.MapToBag($1)) as
> (output:chararray);
>
> the CustimizeHbaseStorage is loading the row "myrowkey1" and after that the
> map for this rowkey is transformed to a Bag. That works fine so far.
>
> So, in the ABag are now some entries. With this entries I try to do load
> new row keys (every entry in the bag is the information for a new rowkey I
> have to load next). So I tried something like this:
>
> X= FOREACH ABag {
>  TMP = load 'hbase://mytable2'
> my.packages.customizeHBaseStorage('VALUE', '-loadKey true', '$0') as
> (rowkey:chararray, columncontent:map[]);
> GENERATE (TMP.$0);
> }
>
> This doesn't does not work, because as far as I now the load statement is
> not allowed for FOREACH.
>
> So I tried to build my own EvalFunc:
>
> X = FOREACH INTERMEDIATE_BAG_0 GENERATE my.packages.MyNewUDF($0);
>
> Here is the Java Code for the MyNewUDF:
>
> ...
> public DataBag exec(Tuple input) throws IOException {
>  DataBag result = null;
> try {
> result = bagFactory.newDefaultBag();
>  CustomizeHBaseStorage loader = new CustomizeHBaseStorage("VALUE",
> "-loadKey true", input
> .get(0).toString());
>  loader.getInputFormat();
> Tuple curTuple = loader.getNext();
>  while (curTuple != null) {
> result.add(curTuple);
> curTuple = loader.getNext();
>  }
> } catch (ParseException e) {
> e.printStackTrace();
>  }
> return result;
>
> }
> ...
>
> I think this would work, but the problem is I got a NullpointerException
> because the RecordReader in the HBaseStorage is not initialized when
> executing getNext(). So if anybody can say me how I can initialized the
> RecordReader (and I think the PigSplit too, because its necessary for
> CustomizeHBaseStorage .prepareToRead(RecordReader reader, PigSplit split))
> or maybe another approach I would be thankful.
>
>
> BTW. I know that I can load the whole mytable2 in a new alias and then JOIN
> ABag and the new alias, but I try to optimize my program, beacuse it is not
> necessary to load the whole mytable2. I try to build a "join" with
> information passing.
>
> Thanks
>


Re: How to optimize my request

2013-08-19 Thread Pradeep Gollakota
I have a couple of ideas that MAY help. I'm not familiar with your data,
but these techniques might help.

First, this probably won't affect the performance, but rather than having 3
FILTER statements at the top of your script, you can use the SPLIT operator
to split your dataset into 3 datasets.

I'm not sure what purpose the COGROUP is serving, but this seems to be the
source of the bottleneck. One optimization technique you can try is to
GROUP your data first and then use nested FILTER statements to get your
counts.

For example, you have the following:

A = LOAD 'data' USING MyUDFLoader('data.xml');
filter_response_time_less_than_1_s = FILTER A BY (response_time < 1000.0);
filter_response_time_between_1_s_and_2_s = FILTER A BY (response_time >=
1000.0 AND response_time < 1999.0);
filter_response_time_between_greater_than_2_s = FILTER A BY (response_time
>= 2000.0);
star__zne_asfo_access_log = FOREACH ( COGROUP A BY
(date_day,url,date_minute,ret_code,serveur),

 filter_response_time_between_greater_than_2_s BY
(date_day,url,date_minute,ret_code,serveur),

 filter_response_time_less_than_1_s BY (date_day,url,date_minute,ret_
code,serveur),

 filter_response_time_between_1_s_and_2_s BY
(date_day,url,date_minute,ret_code,serveur) )
{
GENERATE
FLATTEN(group) AS (date_day,zne_asfo_url,date_
minute,zne_http_code,zne_asfo_server),
(long)SUM((bag{tuple(long)})A.response_time) AS
response_time,
COUNT(filter_response_time_less_than_1_s) AS
response_time_less_than_1_s,
COUNT(filter_response_time_between_1_s_and_2_s) AS
response_time_between_1_s_and_2_s,
COUNT(filter_response_time_between_greater_than_2_s) AS
response_time_between_greater_than_2_s,
COUNT(A) AS nb_hit;
};

This can possibly be changed to

A = LOAD 'data' USING MyUDFLoader('data.xml');
star__zne_asfo_access_log = FOREACH (GROUP A BY
(date_day,url,date_minute,ret_code,serveur)) {
filter_response_time_less_than_1_s = FILTER A BY (response_time <
1000.0);
filter_response_time_between_1_s_and_2_s = FILTER A BY
(response_time >= 1000.0 AND response_time < 1999.0);
filter_response_time_between_greater_than_2_s = FILTER A BY
(response_time >= 2000.0);
GENERATE
FLATTEN(group) AS (date_day,zne_asfo_url,date_
minute,zne_http_code,zne_asfo_server),
(long) SUM((bag{tuple(long)})A.response_time) AS
response_time,
COUNT(filter_response_time_less_than_1_s) AS
response_time_less_than_1_s,
COUNT(filter_response_time_between_1_s_and_2_s) AS
response_time_between_1_s_and_2_s,
COUNT(filter_response_time_between_greater_than_2_s) AS
response_time_between_greater_than_2_s,
COUNT(A) AS nb_hit;
};

I think in the COGROUP, you're data has been duplicated 3 times (plus the
original) so you're joining 4 times the original size.


On Mon, Aug 19, 2013 at 10:49 AM, 35niavlys <35niav...@gmail.com> wrote:

> Hi,
>
> I want to execute a pig command in embedded java program. For moment, I try
> Pig in local mode. My data file size is around 15MB but the execution of
> this command is very long so I think my script need optimizations...
>
> My script :
>
> A = LOAD 'data' USING MyUDFLoader('data.xml');
> > filter_response_time_less_than_1_s = FILTER A BY (response_time <
> 1000.0);
> > filter_response_time_between_1_s_and_2_s = FILTER A BY (response_time >=
> 1000.0 AND response_time < 1999.0);
> > filter_response_time_between_greater_than_2_s = FILTER A BY
> (response_time >= 2000.0);
> > star__zne_asfo_access_log = FOREACH ( COGROUP A BY
> (date_day,url,date_minute,ret_code,serveur),
> filter_response_time_between_greater_than_2_s BY
> (date_day,url,date_minute,ret_code,serveur),
> filter_response_time_less_than_1_s BY
> (date_day,url,date_minute,ret_code,serveur),
> filter_response_time_between_1_s_and_2_s BY
> (date_day,url,date_minute,ret_code,serveur) )
> > {
> > GENERATE
> > FLATTEN(group) AS
> (date_day,zne_asfo_url,date_minute,zne_http_code,zne_asfo_server),
> > (long)SUM((bag{tuple(long)})A.response_time) AS
> response_time,
> > COUNT(filter_response_time_less_than_1_s) AS
> response_time_less_than_1_s,
> > COUNT(filter_response_time_between_1_s_and_2_s) AS
> response_time_between_1_s_and_2_s,
> > COUNT(filter_response_time_between_greater_than_2_s) AS
> response_time_between_greater_than_2_s,
> > COUNT(A) AS nb_hit;
> > };
> > agg__zne_asfo_access_log_ymd = FOREACH ( COGROUP A BY
> (date_day,date_year,date_month),
> filter_response_time_between_greater_than_2_s BY
> (date_day,date_year,date_month), filter_response_time_less_than_1_s BY
> (date_day,date_year,date_month), filter_response_time_between_1_s_and_2_s
> BY (date_day,date_year,date_month) )
> > {
> > GENERATE
> > FLATTEN(group) AS (date_day,date_year

Re: I think an example in the docs is wrong

2013-08-08 Thread Pradeep Gollakota
I believe the procedure is to file a bug report on JIRA and set the
component field to 'documentation'.

Pig veterans, please correct me if I'm wrong.


On Thu, Aug 8, 2013 at 10:19 PM, Paul Houle  wrote:

> I recently wrote a load function and to get started I cut-n-pasted from the
> SimpleTextLoader example on the page
>
> http://pig.apache.org/docs/r0.11.1/udf.html#load-store-functions
>
> This contains the following code:
>
> boolean notDone = in.nextKeyValue();
> if (notDone) {
> return null;
> }
>
> when data is available,  notDone is true,  and then null gets returned
> rather than proceeding to process the row.  Putting a ! operator in
> there quickly cleared up the problem.  I've seen the problem too in
> other versions of the doc.
>
> It would be nice to get this fixed so other people don't make this
> mistake I made.
>
> Is there an issue tracking system where I should put things like this?
>


Re: field name reference - alias

2013-08-08 Thread Pradeep Gollakota
This is expected behavior. The disambiguation comes only after two or more
relations are brought together.

As per the docs at
http://pig.apache.org/docs/r0.11.1/basic.html#disambiguate, the
disambiguate operator can only be used to identify field names after JOIN,
COGROUP, CROSS, or FLATTEN operators.

The difference between the first and third example is that in your first
example, you have a JOIN operator. You would get a syntax error if you
tried to say

C = JOIN A by A::x LEFT OUTER, B by a;

There are no fields named 'A::x' in A. However, in C, you have a field
named 'A::x'. You can refer to this field by 'x' (Because no other field is
also named 'x') or by 'A::x'.

Hope that helps.


On Thu, Aug 8, 2013 at 9:59 PM, Keren Ouaknine  wrote:

> Hello,
>
> Can one refer to a field name with no ambiguity by its full name (A::x
> instead of x)? Below are two contradictory behaviors:
> *
> *
> *First example:*
> A = load '1.txt'  using PigStorage(' ')  as (x:int, y:chararray,
> z:chararray);
> B = load '1_ext.txt'  using PigStorage(' ')  as (a:int, b:chararray,
> c:chararray);
> C = JOIN A by x LEFT OUTER, B BY a;
> D = FOREACH C GENERATE A::x as toto;
> describe C;
> describe D;
>
> *output:*
> C: {A::x: int,A::y: chararray,A::z: chararray,B::a: int,B::b:
> chararray,B::c: chararray}
> D: {toto: int}
>
> Works fine also if you refer to A:: x as x.
>
> *Second example with toMap:*
> A = load '1.txt'  using PigStorage(' ')  as (x:int, y:chararray,
> z:chararray);
> B = FOREACH A GENERATE TOMAP('toto', x);
> describe B;
> DUMP B;
> store B into '/home/kereno/Documents/pig-0.11.1/workspace/res';
>
> *output:*
> C: {map[]}
>
> If you change the script to refer to A::x, you would get an error as
> follow:
> A = load '1.txt'  using PigStorage(' ')  as (x:int, y:chararray,
> z:chararray);
> B = FOREACH A GENERATE TOMAP('toto', A::x);
> describe B;
> DUMP B;
> store B into '/home/kereno/Documents/pig-0.11.1/workspace/res';
>
> output
>  Invalid field projection. Projected
> field [A::x] does not exist in schema: x:int,y:chararray,z:chararray.
>
> My question is why is it that for the FOREACH I can use either and not for
> the TOMAP??
> side node: I am asking cause I am generating schemas of a Pig script and
> use these as input for another language (project translating Pig to
> Algebricks) and would like to be consistent with the Pig behavior :).
>
> Thanks,
> Keren
>
> --
> Keren Ouaknine
> Web: www.kereno.com
>


Re: Replace join with custom implementation

2013-08-02 Thread Pradeep Gollakota
Oh... sorry... I missed the part where you were saying that you want to
reimplement the replicated join algorithm


On Fri, Aug 2, 2013 at 9:13 AM, Pradeep Gollakota wrote:

> join BIG by key, SMALL by key using 'replicated';
>
>
> On Fri, Aug 2, 2013 at 5:29 AM, Serega Sheypak 
> wrote:
>
>> Hi. I've met a problem wth replicated join in pig 0.11
>> I have two relations:
>> BIG (3-6GB) and SMALL (100MB)
>> I do join them on four integer fields.
>> It takes  up to 30 minutes to join them.
>>
>> Join runs on 18 reducers: -Xmx=3072mb for Java, 128 GB in total
>> 32 cores on each TaskTracker.
>>
>> So our hardware is really powerful.
>>
>> I've ran a part of join locally and met terrible situation:
>> 50% of heap:
>> is Integers,
>> arrays of integers these integers
>> and ArrayLists for arrays with integers.
>>
>> GC overhead limit happens. The same happend on cluster. I did raise Xms,
>> Xms on cluster and problem is gone.
>>
>> Anyway, joining 6GB/18 and 00Mb  for 30 minutes is too much.
>> I would like to reiplement replicated join.
>> How can I do it?
>>
>
>


Re: Replace join with custom implementation

2013-08-02 Thread Pradeep Gollakota
join BIG by key, SMALL by key using 'replicated';


On Fri, Aug 2, 2013 at 5:29 AM, Serega Sheypak wrote:

> Hi. I've met a problem wth replicated join in pig 0.11
> I have two relations:
> BIG (3-6GB) and SMALL (100MB)
> I do join them on four integer fields.
> It takes  up to 30 minutes to join them.
>
> Join runs on 18 reducers: -Xmx=3072mb for Java, 128 GB in total
> 32 cores on each TaskTracker.
>
> So our hardware is really powerful.
>
> I've ran a part of join locally and met terrible situation:
> 50% of heap:
> is Integers,
> arrays of integers these integers
> and ArrayLists for arrays with integers.
>
> GC overhead limit happens. The same happend on cluster. I did raise Xms,
> Xms on cluster and problem is gone.
>
> Anyway, joining 6GB/18 and 00Mb  for 30 minutes is too much.
> I would like to reiplement replicated join.
> How can I do it?
>


Re: error while executing command after creating own udf

2013-07-31 Thread Pradeep Gollakota
I seem to remember another person asking a similar question on the mailing
list before.

I think the answer was a mismatch of the version number of pig that you're
executing with vs version of pig you compiled with.


On Wed, Jul 31, 2013 at 4:52 AM, manish dunani  wrote:

> Hello,
>
> I created my own udf .which convert the word from lowercase to uppercase.
>
> *Input file:*
> *
> *
> *
> *
> manish,23,7.06
> vigs,23,7.3
> amardas,23,8.9
>
> *Commands:*
>
> grunt> register /home/manish/Desktop/myownudf.jar  /*successfully done*/
> grunt> define mine com.my.own.udf.upper();  /*successfully done*/
> grunt> a = load '/home/manish/Desktop/marks' using PigStorage(',') as
> (name:chararray,age:int,gpa:float); /*successfully done*/
>
> *Got error while executing:*
> *
> *
> grunt> b = foreach a generate mine($0);
>
> *error:*
> *
> *
> 2013-07-31 01:33:34,256 [main] ERROR org.apache.pig.tools.grunt.Grunt -
> ERROR 1069: Problem resolving class version numbers for class
> com.my.own.udf.upper.
>
>
> I already use piggybanks' udf with same procedure.But,i couldn't face any
> error it's simply done..
>
> i can not resolve this error why this happen..
>
> can any one have an idea???
>
> Your help will be appreciated.
>
> --
> MANISH DUNANI
> -THANX
>


Re: Get the tree structure of a HDFS dir, similar to dir/files

2013-07-27 Thread Pradeep Gollakota
Huy,

I think this question probably belongs in the Hadoop mailing list over the
Pig mailing list.
However, I think you're looking for
http://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/fs/FileStatus.html
A FileStatus object can be acquired from a FileSystem object by calling the
.getFileStatus(Path path) method.

http://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/fs/FileSystem.html
https://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/fs/Path.html

Hope this helps.



On Tue, Jul 23, 2013 at 2:05 PM, Huy Pham  wrote:

> Hi All,
>Do any of you have or can refer me to some sample Java code that get
> the tree structure of a HDFS directory, similar to the file system?
>For example: I have a HDFS dir, called /data, inside data, there is
> /data/valid and /data/invalid, and so on, so I would need to be able to get
> the whole tree structure of that and know which is is a dir, which one is a
> file. Both program and HDFS are LOCAL.
>In other words, what I look for is something similar to File class in
> Java, which has isDirectory() and list() to list all the children (files
> and dirs) of a dir. Found something in stackoverflow but it does not work.
> Thanks
> Huy
>
>
>


Pig and Storm

2013-07-23 Thread Pradeep Gollakota
Hi Pig Users and Developers,

I asked a question on the dev mailing list, earlier today about Pig and
Storm. However, having thought more about it, I think the user list is more
appropriate. Here's the original email verbatim.

"I wanted to reach out to you all and ask for you opinion on something.

As a Pig user, I have come to love Pig as a framework. Pig provides a great
set of abstractions that make working with large datasets easy. Currently
Pig is only backed by hadoop. However, with the new rise of Twitter Storm
as a distributed real time processing engine, Pig users are missing out on
a great opportunity to be able to work with Pig in Storm. As a user of Pig,
Hadoop and Storm, and keeping with the Pig philosophy of "Pigs live
anywhere," I'd like to get your thoughts on starting the implementation of
a Pig backend for Storm."

Thanks
Pradeep


Re: pig 0.8.1 - Iterating contents of a Bag

2013-07-23 Thread Pradeep Gollakota
Amit,

It looks like the FLATTEN operator is exactly what you're looking for
(based on both the 'output you'd like to see' and the fact that your UDF
accept's chararry's and not Bags).

I'm not sure I understand you're question about iterating over bags. Do you
want to call your UDF on each tuple in the bag without flattening it first,
so that your bag will be transformed in place and that data is still
grouped?

If that is the case, you can't do it in Pig 0.8... nested FOREACH
statements were introduced in Pig 0.9. That being said, if you want your
data to be transformed in place without having to flatten first and then
regroup, you could either rewrite your UDF to accept bags instead of
chararrays, or write a wrapper UDF that calls your existing UDF.


On Tue, Jul 23, 2013 at 4:25 PM, Amit  wrote:

> Thanks for the quick response.
> However I do not want to flatten because I plan to invoke a previously
> written UDF which accepts a chararray to using each value in the Bag.
>
> I am not sure if it at all is possible with 0.8.1 but just thought to seek
> view from experts on this mailing list.
>
>
> Regards,
> Amit
>
>  From: Serega Sheypak 
>
> To: user@pig.apache.org; Amit 
> Sent: Tuesday, July 23, 2013 4:23 PM
> Subject: Re: pig 0.8.1 - Iterating contents of a Bag
>
>
>
> Hi, I'm new to pig, will try to help you.
> B = FOREACH A {
> GENERATE FLATTEN(keywords.keyword) as keyword;
> };
>
>
> OR
> B = FOREACH A {
> GENERATE FLATTEN(keywords.keyword) as (keyword);
> };
>
>
> You need flatten the bag.
>
>
>
>
> 2013/7/24 Amit 
>
> Hello there,
> >I am loading a data in form of
> >
> >A1: {key: chararray,keywords: {keywords_tuple: (keyword: chararray)}}
> >
> >I believe the Sample data would look like the following
> >
> >{1, {('amit'),('yahoo'),('pig')}
> >
> >I am trying to write a foreach where I can loop through the each keyword
> in the bag.
> >
> >I tried writing this but it seems to not dump the output the way I want
> to see
> >
> >
> >B = FOREACH A {
> >GENERATE keywords.keyword;
> >};
> >
> >I would like to see
> >
> >('amit')
> >('yahoo')
> >('pig')
> >
> >Instead it prints the entire bag at once like the one below.
> >
> >{('amit'),('yahoo'),('pig')}
> >
> >
> >
> >Please note I do not want to flatten the bag as what I want to process
> each keyword in the bag using a UDF later on.
> >
> >Appreciate any of your inputs.
> >
> >Regards,
> >Amit
> >
>


Re: Filter bag with multiple output

2013-07-23 Thread Pradeep Gollakota
You can do the SPLIT outside the nested FOREACH. I'm assuming you have UDF
defined for VALID.

So, your scrpit can be written as:

rawRecords = LOAD '/data' as ...;
grouped = GROUP rawRecords BY msisdn;
validAndNotValidRecords = FOREACH grouped {
 ordered = ORDER rawRecords BY ts;
 GENERATE group as group_key, ordered as data;
};
SPLIT validAndNotValidRecords INTO validRecords IF VALID(data), INTO
invalidRecords OTHERWISE;




On Tue, Jul 23, 2013 at 8:58 AM, Serega Sheypak wrote:

> Omg, thanks it's exactly the thing I need.
>
> I can't do it before GROUP. I need group by key, then sort by timestamp
> field inside each group.
> After sort is done I do can determine non valid records.
> I've provided simplified case.
>
> The only problem is that SPLIT is not allowed in nested FOREACH statement.
>
>
> 2013/7/23 Pradeep Gollakota 
>
> > You can use the SPLIT operator to split a relation into two (or more)
> > relations. http://pig.apache.org/docs/r0.11.1/basic.html#SPLIT
> >
> > Also, you should probably do this before GROUP. As a best practice (and
> > general pig optimization strategy), you should filter (and project) early
> > and often.
> >
> >
> > On Tue, Jul 23, 2013 at 4:27 AM, Serega Sheypak <
> serega.shey...@gmail.com
> > >wrote:
> >
> > > Hi, I have rather simple problem and I can't create nice solution.
> > > Here is my input:
> > > msisdn longitude latitude ts
> > > 1 20.30 40.50 123
> > > 1 0.0 null 456
> > > 2 60.70 34.67 678
> > > 2 null null 978
> > >
> > > I need:
> > > group by msisdn
> > > order by ts inside each group
> > > filter records in each group:
> > > 1. put all records where longitude, latitude are valid on one side
> > > 2. put all records where longitude/latidude = 0.0/null to the othe side
> > >
> > > Here is pig pseudo-code:
> > > rawRecords = LOAD '/data' as ...;
> > > grouped = GROUP rawRecords BY msisdn;
> > > validAndNotValidRecords = FOREACH grouped{
> > >  ordered = ORDER rawRecords BY ts;
> > >  --do sometihing here to filter valid and not valid
> > records
> > > }
> > > STORE notValidRecords INTO /not_valid_data;
> > >
> > > someOtherProjection = GROUP validRecords By msisdn;
> > > --continue to work with filtered valid records...
> > >
> > > Can I do it in a single pig script, or I need to create two scripts:
> > > the first one would filter not valid records and store them
> > > the second one will continue to process filtered set of records?
> > >
> >
>


Re: Filter bag with multiple output

2013-07-23 Thread Pradeep Gollakota
You can use the SPLIT operator to split a relation into two (or more)
relations. http://pig.apache.org/docs/r0.11.1/basic.html#SPLIT

Also, you should probably do this before GROUP. As a best practice (and
general pig optimization strategy), you should filter (and project) early
and often.


On Tue, Jul 23, 2013 at 4:27 AM, Serega Sheypak wrote:

> Hi, I have rather simple problem and I can't create nice solution.
> Here is my input:
> msisdn longitude latitude ts
> 1 20.30 40.50 123
> 1 0.0 null 456
> 2 60.70 34.67 678
> 2 null null 978
>
> I need:
> group by msisdn
> order by ts inside each group
> filter records in each group:
> 1. put all records where longitude, latitude are valid on one side
> 2. put all records where longitude/latidude = 0.0/null to the othe side
>
> Here is pig pseudo-code:
> rawRecords = LOAD '/data' as ...;
> grouped = GROUP rawRecords BY msisdn;
> validAndNotValidRecords = FOREACH grouped{
>  ordered = ORDER rawRecords BY ts;
>  --do sometihing here to filter valid and not valid records
> }
> STORE notValidRecords INTO /not_valid_data;
>
> someOtherProjection = GROUP validRecords By msisdn;
> --continue to work with filtered valid records...
>
> Can I do it in a single pig script, or I need to create two scripts:
> the first one would filter not valid records and store them
> the second one will continue to process filtered set of records?
>


Re: Large Bag (100GB of Data) in Reduce Step

2013-07-22 Thread Pradeep Gollakota
There's only one thing that comes to mind for this particular toy example.

>From the "Programming Pig" book,
"pig.cached.bag.memusage" property is the "Percentage of the heap that Pig
will allocate for all of the bags in a map or reduce task. Once the bags
fill up this amount, the data is spilled to disk. Setting this to a higher
value will reduce spills to disk during execution but increase the
likelihood of a task running out of heap."
The default value of this property is 0.1

So, you can try setting this to a higher value to see if it can improve
performance.

Other than the above setting, I can only quote the basic patterns for
optimizing performance (also from Programming Pig):
Filter early and often
Project early and often
Set up your joins properly
etc.



On Mon, Jul 22, 2013 at 9:31 AM, Jerry Lam  wrote:

> Hi Pig users,
>
> I have a question regarding how to handle a large bag of data in reduce
> step.
> It happens that after I do the following (see below), each group has about
> 100GB of data to process. The bag is spilled continuously and the job is
> very slow. What is your recommendation of speeding the processing when you
> find yourself a large bag of data (over 100GB) to process?
>
> A = LOAD '/tmp/data';
> B = GROUP A by $0;
> C = FOREACH B generate FLATTEN($1); -- this takes very very long because of
> a large bag
>
> Best Regards,
>
> Jerry
>


Re: Execute multiple PIG scripts parallely

2013-07-22 Thread Pradeep Gollakota
You could probably just use nohup if they're all parallel and send them
into the background.

Nohup pig script1.pig &
Nohup pig script2.pig &
Etc.
On Jul 22, 2013 7:12 AM, "manishbh...@rocketmail.com" <
manishbh...@rocketmail.com> wrote:

> You can create job flow in oozie.
>
> Sent via Rocket from my HTC
>
> - Reply message -
> From: "Bhavesh Shah" 
> To: "user@pig.apache.org" 
> Subject: Execute multiple PIG scripts parallely
> Date: Mon, Jul 22, 2013 4:04 PM
>
>
> Hello All,
>
>
>
> I have multiple PIG Script with and currently I am executing it in
> sequential manner using command
>
> pig -x mapreduce /path/to/Script/Script1.pig &&
> /path/to/Script/Script2.pig && /path/to/Script/Script3.pig
>
>
>
> But now I am looking for executing those scripts in parallel as all are
> independent of each other. I searched for it but not getting exactly.
>
>
>
> So is there any way through which I can execute my all scripts parallely?
>
>
>
>
>
> Thanks,
>
> Bhavesh Shah
>


Re: Getting dimension values for Facts

2013-07-18 Thread Pradeep Gollakota
Unfortunately I can't think of any good way of doing this (other than what
Bertrand suggested with using a different language to generate the script).

I'd also recommend Hive... it may be easier to do this in Hive since you
have SQL like syntax. (Haven't used Hive, but it looks like this type of
thing would be far more natural in Hive)


On Thu, Jul 18, 2013 at 12:09 PM, Something Something <
mailinglist...@gmail.com> wrote:

> I don't think this is macro-able, Pradeep.  Every step of the way a
> different column gets updated.  For example, for FACT_TABLE3 we update
> 'col1' from DIMENSION1, for FACT_TABLE5 we update 'col2' from DIMENSION2 &
> so on.
>
> Feel free to correct me if I am wrong.  Thanks.
>
>
>
>
>
> On Thu, Jul 18, 2013 at 8:25 AM, Pradeep Gollakota  >wrote:
>
> > Looks like this might be macroable. Not entirely sure how that can be
> done
> > yet... but I'd look into that if I were you.
> >
> >
> > On Thu, Jul 18, 2013 at 11:16 AM, Something Something <
> > mailinglist...@gmail.com> wrote:
> >
> > > Wow, Bertrand, on the Pig mailing list you're recommending not to use
> > > Pig... LOL!  Jokes apart, I would think this would be a common use case
> > for
> > > Pig, no?  Generating a Pig script on the fly is a decent idea, but
> we're
> > > hoping to avoid that - unless there's no other way.  Thanks for the
> > > pointers.
> > >
> > >
> > > On Thu, Jul 18, 2013 at 2:52 AM, Bertrand Dechoux  > > >wrote:
> > >
> > > > I would say either generate the script using another language (eg
> > Python)
> > > > or use a true programming language with an API having the same level
> of
> > > > abstraction (eg Java and Cascading).
> > > >
> > > > Bertrand
> > > >
> > > >
> > > > On Thu, Jul 18, 2013 at 8:44 AM, Something Something <
> > > > mailinglist...@gmail.com> wrote:
> > > >
> > > > > There must be a better way to do this in Pig.  Here's how my script
> > > looks
> > > > > like right now:  (omitted some snippet for saving space, but you
> will
> > > get
> > > > > the idea).
> > > > >
> > > > > FACT_TABLE = LOAD 'XYZ'  as (col1 :chararray,………. col30:
> chararray);
> > > > >
> > > > > FACT_TABLE1  = FOREACH FACT_TABLE GENERATE col1, udf1(col2) as
> > col2,…..
> > > > > udf10(col30) as col30;
> > > > >
> > > > > DIMENSION1 = LOAD 'DIM1' as (key, value);
> > > > >
> > > > > FACT_TABLE2 = JOIN FACT_TABLE1 BY col1 LEFT OUTER, DIMENSION1 BY
> key;
> > > > >
> > > > > FACT_TABLE3  = FOREACH FACT_TABLE2 GENERATE DIMENSION1::value as
> > > col1,…….
> > > > >  FACT_TABLE1::col30 as col30;
> > > > >
> > > > > DIMENSION2 = LOAD 'DIM2' as (key, value);
> > > > >
> > > > > FACT_TABLE4 = JOIN FACT_TABLE3 BY col2 LEFT OUTER, DIMENSION2 BY
> key;
> > > > >
> > > > > FACT_TABLE5  = FOREACH FACT_TABLE4 GENERATE  FACT_TABLE3::col1 as
> > > > > col1, DIMENSION2::value as col2,…….  FACT_TABLE3::col30 as col30;
> > > > >
> > > > > & so on!  There are 10 more such dimension tables to join.
> > > > >
> > > > > In short, each row on the fact table needs to be joined to a key
> > field
> > > > on a
> > > > > dimension table to get it's associated value.
> > > > >
> > > > > This is beginning to look ugly.  Plus it's maintenance nightmare
> when
> > > it
> > > > > comes to adding new fields.  What's the best way to code this in
> Pig?
> > > > >
> > > > > Thanks in advance.
> > > > >
> > > >
> > >
> >
>


Re: Getting dimension values for Facts

2013-07-18 Thread Pradeep Gollakota
Looks like this might be macroable. Not entirely sure how that can be done
yet... but I'd look into that if I were you.


On Thu, Jul 18, 2013 at 11:16 AM, Something Something <
mailinglist...@gmail.com> wrote:

> Wow, Bertrand, on the Pig mailing list you're recommending not to use
> Pig... LOL!  Jokes apart, I would think this would be a common use case for
> Pig, no?  Generating a Pig script on the fly is a decent idea, but we're
> hoping to avoid that - unless there's no other way.  Thanks for the
> pointers.
>
>
> On Thu, Jul 18, 2013 at 2:52 AM, Bertrand Dechoux  >wrote:
>
> > I would say either generate the script using another language (eg Python)
> > or use a true programming language with an API having the same level of
> > abstraction (eg Java and Cascading).
> >
> > Bertrand
> >
> >
> > On Thu, Jul 18, 2013 at 8:44 AM, Something Something <
> > mailinglist...@gmail.com> wrote:
> >
> > > There must be a better way to do this in Pig.  Here's how my script
> looks
> > > like right now:  (omitted some snippet for saving space, but you will
> get
> > > the idea).
> > >
> > > FACT_TABLE = LOAD 'XYZ'  as (col1 :chararray,………. col30: chararray);
> > >
> > > FACT_TABLE1  = FOREACH FACT_TABLE GENERATE col1, udf1(col2) as col2,…..
> > > udf10(col30) as col30;
> > >
> > > DIMENSION1 = LOAD 'DIM1' as (key, value);
> > >
> > > FACT_TABLE2 = JOIN FACT_TABLE1 BY col1 LEFT OUTER, DIMENSION1 BY key;
> > >
> > > FACT_TABLE3  = FOREACH FACT_TABLE2 GENERATE DIMENSION1::value as
> col1,…….
> > >  FACT_TABLE1::col30 as col30;
> > >
> > > DIMENSION2 = LOAD 'DIM2' as (key, value);
> > >
> > > FACT_TABLE4 = JOIN FACT_TABLE3 BY col2 LEFT OUTER, DIMENSION2 BY key;
> > >
> > > FACT_TABLE5  = FOREACH FACT_TABLE4 GENERATE  FACT_TABLE3::col1 as
> > > col1, DIMENSION2::value as col2,…….  FACT_TABLE3::col30 as col30;
> > >
> > > & so on!  There are 10 more such dimension tables to join.
> > >
> > > In short, each row on the fact table needs to be joined to a key field
> > on a
> > > dimension table to get it's associated value.
> > >
> > > This is beginning to look ugly.  Plus it's maintenance nightmare when
> it
> > > comes to adding new fields.  What's the best way to code this in Pig?
> > >
> > > Thanks in advance.
> > >
> >
>


RE: Want to add data in same file in Apache PIG?

2013-07-18 Thread Pradeep Gollakota
If you want persistent storage like that, you're best bet is to use a
database like HBase
On Jul 18, 2013 7:56 AM, "Bhavesh Shah"  wrote:

> Thanks for reply. :)
>
> I just came across one command -getmerge
>
>
>
> -getmerge  :  Get all the files in the directories that
>   match the source file pattern and merge and sort them to only
>   one file on local fs.  is kept.
>
>
>
> I am thinking if I STORE the data in some other file say TMP_Name
>
> and later If I use this command to dump the data in the required file.
>
>
>
> Is it possible to merge the data using this command in PIG? If yes, then
> is it good way to achieve my goal?
>
> Please let me know.
>
>
>
>
>
> Many Thanks,
>
> Bhavesh Shah
>
>
>
>
>
>
> > Date: Thu, 18 Jul 2013 15:49:47 +0400
> > Subject: Re: Want to add data in same file in Apache PIG?
> > From: serega.shey...@gmail.com
> > To: user@pig.apache.org
> > CC: d...@pig.apache.org
> >
> > it's not possible. It's HDFS.
> >
> >
> > 2013/7/18 Bhavesh Shah 
> >
> > > Hello,
> > >
> > > Actually I have a use case in which I will receive the data from some
> > > source and I have to dump it in the same file after every regular
> interval
> > > and use that file for further operation. I tried to search on it, but I
> > > didn't see the anything related to this.
> > >
> > > I am using STORE function, but STORE function always create new file
> with
> > > specified name and gives error if the specified file already exists.
> > > How should I do store the data in same file? Is it possible in Pig or
> have
> > > some work around for it?
> > > Please suggest me some solution over this.
> > >
> > >
> > > Thanks,
> > > Bhavesh Shah
>


Re: header of a tuple/bag

2013-07-16 Thread Pradeep Gollakota
It generally depends on what type of Storage mechanism is used. If it's
PigStorage() then this information is not encoded into the data.

Assuming that the storage is PigStorage() and that cookie_id is the first
field in the data, your load function should look as follows:

Data = LOAD '/user/xx/20130523/*' using PigStorage() as (cookie_id:
charray, ...);
x = FOREACH Data GENERATE cookie_id;

So, you not only have to define what Storage function to use, you (may)
also have to describe the schema when you load the data.


On Tue, Jul 16, 2013 at 2:04 PM, Mix Nin  wrote:

> Hi,
>
> I am trying query a data set on HDFS using PIG.
>
> Data = LOAD '/user/xx/20130523/*;
> x = FOREACH Data GENERATE cookie_id;
>
> I get below error.
>
>  Invalid field projection. Projected field [cookie_id]
> does not exist
>
> How do i find the column names in the bag "Data" .  The developer who
> created the file says, it is coookie_id.
> Is there any way I could get schema/header for this?
>
>
> Thanks
>


Re: Problem with nested FOREACH, bag semantics and UDF

2013-07-15 Thread Pradeep Gollakota
Just to confirm... you want your output to read as follows,

{1, {(1, count), (2, count), ..., (10, count)}}
{1, {(11, count), (12, count), ..., (20, count)}}
...
correct?

I think you also have a syntax error... I'm pretty sure you can't do
FOREACH and GROUP in the same statement. You can try the following:

B = GROUP A BY season;
C = FOREACH B {
sorted = ORDER A BY count DESC;
quantiles = FOREACH sorted GENERATE BagSplit(10, sorted) as (mybag,
index);
GENERATE group AS season, FLATTEN(quantiles.mybag);
};



On Mon, Jul 15, 2013 at 12:31 PM, Lars Francke wrote:

> Hi!
>
> I have a problem with the following Pig script:
>
> DESCRIBE A;
> A: {id: int, season: int, count: long}
>
> foo = FOREACH (GROUP A BY season) {
> sorted = ORDER A BY count DESC;
> quantiles = FOREACH sorted GENERATE BagSplit(10, sorted);
> GENERATE ;
>   };
>
> DESCRIBE foo::quantiles;
> foo::quantiles: {datafu.pig.bags.bagsplit_sorted_14586: {(data: {(id:
> int, season: int, count: long)},index: int)}}
>
>
> What I'd like to do is order A by "count" and then use DataFu's
> BagSplit UDF to create equal splits (deciles). I'm very very new to
> Pig and I think this can all be attributed to the fact that I
> misunderstand bags and FOREACH - especially the nested variant.
>
> I'd like my output to be:
> {season, {(id, count), (id, count), ...}}
>
> GENERATE quantiles: Is accepted but leads to "ERROR 2015: Invalid
> physical operators in the physical plan" on execution.
>
> GENERATE quantiles.$0: Same as above. In fact I can stick as many
> ".$0" at the end as I want to and it is always accepted but generates
> an error when duming the data.
>
> I'll reread the Pig Lating Basics tonight but if anyone has an idea
> what I'm doing wrong or how I can achieve my goal I'd be very
> grateful.
>
> Thanks,
> Lars
>


Re: Concatenate strings within a group

2013-07-14 Thread Pradeep Gollakota
I'm not aware of any native PIG commands that can do this. So you'll have
to implement a UDF to do this. My implementation would look as follows:

A = load 'data' as (id: int, seg_num: int, text: chararray);
B = group A by id;
C = foreach B {
D = order A by seg_num; -- assuming that data is not sorted by seg_num
generate id, CONCAT_UDF(D);
};
dump C;

Within the CONCAT_UDF implementation, you have a DataBag as input whose
tuples are sorted by seg_num, so you can use a StringBuilder to concat the
strings together and return the resulting string.

Hope this helps.


On Sun, Jul 14, 2013 at 10:39 AM, Shahab Yunus wrote:

> At least I am not aware of a PIG command which can do this. You can start
> by grouping on 'id',  and then try flattening the 'text' field. But then
> you run into the issue that you have lost the sorting order ('seg_no')
> which is required to construct a meaningful sentence. Here I think you need
> UDF where you pass both 'seq_no' and 'text' and do the work.
>
> I can think of doing some convoluted processing like concatenating the
> 'seg_no' and 'text' fields as one and then grouping on 'id' and then
> sorting on the new concatenated field within the group. But then once,
> you've done that, you will have to split back the combined field again. And
> doing all this might not help either. The main thing here is that, as far
> as I know, you cannot impose sort order in a bag or while flattening a
> group in one row. I would be interested to know if this is possible through
> native Pig.
>
> Regards,
> Shahab
>
>
> On Sat, Jul 13, 2013 at 9:45 PM, Karthik Natarajan <
> kan7...@dbmi.columbia.edu> wrote:
>
> > Hi,
> >
> > I'm new to Pig. I have a file that contains the contents of documents.
> The
> > problem is that the contents are not in one line of the file. The file is
> > actually an export of a database table. Below is an example of the table:
> >
> > id seg_no  text
> > -- -  -
> > 1  0  This is
> > 1  1  a
> > 1  2  test for
> > 1  3  Hello
> > 1  4  World!
> > 2  0  Test
> > 2  1  number
> > 2  2  two.
> >
> >
> > How do I get an output like this:
> >
> > id  text
> > --  
> > 1   This is a test for Hello World!
> > 2   Test number two.
> >
> >
> > I can do this in SQL, but I want to try it using Hadoop and Pig. I'm not
> > sure how to concatenate values of a column w/in a group. I wondering if
> > Pig's built-in functions can handle this or if I have to create a UDF.
> I'm
> > thinking I need to create a UDF, but am not sure how to go about this.
> Any
> > help/advice would be appreciated.
> >
> > Thanks.
> >
>


  1   2   >