Re: Problem while using merge join

2013-09-13 Thread Pradeep Gollakota
Since your join key is not in the Bag, can you do your join first and then execute your UDF? On Fri, Sep 13, 2013 at 10:04 AM, John johnnyenglish...@gmail.com wrote: Okay, I think I have found the problem here: http://pig.apache.org/docs/r0.11.1/perf.html#merge-joins ... there is wirtten;

Re: Problem while using merge join

2013-09-13 Thread Pradeep Gollakota
? The problem I see is that the next() Method in the LoadFunc has to be a Tuple and no Bag. :/ 2013/9/13 Pradeep Gollakota pradeep...@gmail.com Since your join key is not in the Bag, can you do your join first and then execute your UDF? On Fri, Sep 13

[jira] [Commented] (PIG-3453) Implement a Storm backend to Pig

2013-09-11 Thread Pradeep Gollakota (JIRA)
[ https://issues.apache.org/jira/browse/PIG-3453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13764812#comment-13764812 ] Pradeep Gollakota commented on PIG-3453: I personally don't have a concrete use case

Re: [How to optimize MapReduce performance using Pig]

2013-09-06 Thread Pradeep Gollakota
Hi Florencia, Welcome to Pig! Unfortunately without knowing the actual script that you're trying to execute, we won't be able to help you with optimizations. There are some very general guidelines for optimizing Pig scripts though. Take a look at

[jira] [Created] (PIG-3453) Implement a Storm backend to Pig

2013-09-06 Thread Pradeep Gollakota (JIRA)
Pradeep Gollakota created PIG-3453: -- Summary: Implement a Storm backend to Pig Key: PIG-3453 URL: https://issues.apache.org/jira/browse/PIG-3453 Project: Pig Issue Type: New Feature

Re: Join Question

2013-09-04 Thread Pradeep Gollakota
I think there's probably some convoluted way to do this. First thing you'll have to do is flatten your data. data1 = A, B _ X, X1 X, X2 Y, Y1 Y, Y2 Y, Y3 Then do a join by B onto you second dataset. This should produce the following data2 = data1::A, data1::B, data2::A, data2::B, data2::C

Re: Pig upgrade

2013-08-24 Thread Pradeep Gollakota
Most of the major changes were introduced in 0.9 The documentation listing the backward compatibility issues with 0.9 can be found at https://cwiki.apache.org/confluence/display/PIG/Pig+0.9+Backward+Compatibility I believe other changes that are not listed there are the introduction of macros.

Re: Pig Latin Program with special Load Function

2013-08-21 Thread Pradeep Gollakota
In your eval function, you can use the HBase Get/Scan API to retrieve the data rather than using the MapReduce API. On Wed, Aug 21, 2013 at 7:12 AM, John johnnyenglish...@gmail.com wrote: Im currently writing a Pig Latin programm: A = load 'hbase://mytable1'

Re: dev How can I add a row number per input file to the data

2013-08-21 Thread Pradeep Gollakota
That's an interesting approach! Although, I'm not sure if RANK is supported as a nested foreach operator. If it is supported, then this approach would work. The documentation doesn't show that RANK is a supported nested foreach operator. http://pig.apache.org/docs/r0.11.1/basic.html#foreach On

Re: How to optimize my request

2013-08-19 Thread Pradeep Gollakota
I have a couple of ideas that MAY help. I'm not familiar with your data, but these techniques might help. First, this probably won't affect the performance, but rather than having 3 FILTER statements at the top of your script, you can use the SPLIT operator to split your dataset into 3 datasets.

Re: field name reference - alias

2013-08-08 Thread Pradeep Gollakota
This is expected behavior. The disambiguation comes only after two or more relations are brought together. As per the docs at http://pig.apache.org/docs/r0.11.1/basic.html#disambiguate, the disambiguate operator can only be used to identify field names after JOIN, COGROUP, CROSS, or FLATTEN

Re: I think an example in the docs is wrong

2013-08-08 Thread Pradeep Gollakota
I believe the procedure is to file a bug report on JIRA and set the component field to 'documentation'. Pig veterans, please correct me if I'm wrong. On Thu, Aug 8, 2013 at 10:19 PM, Paul Houle ontolo...@gmail.com wrote: I recently wrote a load function and to get started I cut-n-pasted from

Re: Replace join with custom implementation

2013-08-02 Thread Pradeep Gollakota
join BIG by key, SMALL by key using 'replicated'; On Fri, Aug 2, 2013 at 5:29 AM, Serega Sheypak serega.shey...@gmail.comwrote: Hi. I've met a problem wth replicated join in pig 0.11 I have two relations: BIG (3-6GB) and SMALL (100MB) I do join them on four integer fields. It takes up to

Re: Get the tree structure of a HDFS dir, similar to dir/files

2013-07-27 Thread Pradeep Gollakota
Huy, I think this question probably belongs in the Hadoop mailing list over the Pig mailing list. However, I think you're looking for http://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/fs/FileStatus.html A FileStatus object can be acquired from a FileSystem object by calling the

Re: Pig and Storm

2013-07-24 Thread Pradeep Gollakota
layer and all the planning separate? Do you envision needing extensions/changes to the language to support Storm? Feel free to add a page to Pig's wiki with your thoughts on an approach. Alan. On Jul 23, 2013, at 9:52 AM, Pradeep Gollakota wrote: Hi Pig Developers, I wanted to reach

Pig and Storm

2013-07-23 Thread Pradeep Gollakota
Hi Pig Developers, I wanted to reach out to you all and ask for you opinion on something. As a Pig user, I have come to love Pig as a framework. Pig provides a great set of abstractions that make working with large datasets easy. Currently Pig is only backed by hadoop. However, with the new rise

[jira] [Commented] (PIG-3391) Issue with DataType- Long conversion in New AvroStorage()

2013-07-23 Thread Pradeep Gollakota (JIRA)
[ https://issues.apache.org/jira/browse/PIG-3391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13717697#comment-13717697 ] Pradeep Gollakota commented on PIG-3391: I have a couple of quick questions: 1

[jira] [Commented] (PIG-2495) Using merge JOIN from a HBaseStorage produces an error

2013-07-23 Thread Pradeep Gollakota (JIRA)
[ https://issues.apache.org/jira/browse/PIG-2495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13717915#comment-13717915 ] Pradeep Gollakota commented on PIG-2495: Hi Kevin, I have a very minor request

Re: Filter bag with multiple output

2013-07-23 Thread Pradeep Gollakota
You can use the SPLIT operator to split a relation into two (or more) relations. http://pig.apache.org/docs/r0.11.1/basic.html#SPLIT Also, you should probably do this before GROUP. As a best practice (and general pig optimization strategy), you should filter (and project) early and often. On

Re: Filter bag with multiple output

2013-07-23 Thread Pradeep Gollakota
it before GROUP. I need group by key, then sort by timestamp field inside each group. After sort is done I do can determine non valid records. I've provided simplified case. The only problem is that SPLIT is not allowed in nested FOREACH statement. 2013/7/23 Pradeep Gollakota pradeep...@gmail.com

Pig and Storm

2013-07-23 Thread Pradeep Gollakota
Hi Pig Users and Developers, I asked a question on the dev mailing list, earlier today about Pig and Storm. However, having thought more about it, I think the user list is more appropriate. Here's the original email verbatim. I wanted to reach out to you all and ask for you opinion on something.

Re: Execute multiple PIG scripts parallely

2013-07-22 Thread Pradeep Gollakota
You could probably just use nohup if they're all parallel and send them into the background. Nohup pig script1.pig Nohup pig script2.pig Etc. On Jul 22, 2013 7:12 AM, manishbh...@rocketmail.com manishbh...@rocketmail.com wrote: You can create job flow in oozie. Sent via Rocket from my HTC

Re: Large Bag (100GB of Data) in Reduce Step

2013-07-22 Thread Pradeep Gollakota
There's only one thing that comes to mind for this particular toy example. From the Programming Pig book, pig.cached.bag.memusage property is the Percentage of the heap that Pig will allocate for all of the bags in a map or reduce task. Once the bags fill up this amount, the data is spilled to

Re: Getting dimension values for Facts

2013-07-18 Thread Pradeep Gollakota
, for FACT_TABLE5 we update 'col2' from DIMENSION2 so on. Feel free to correct me if I am wrong. Thanks. On Thu, Jul 18, 2013 at 8:25 AM, Pradeep Gollakota pradeep...@gmail.com wrote: Looks like this might be macroable. Not entirely sure how that can be done yet... but I'd look

[jira] [Commented] (HBASE-3732) New configuration option for client-side compression

2013-07-17 Thread Pradeep Gollakota (JIRA)
[ https://issues.apache.org/jira/browse/HBASE-3732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13711038#comment-13711038 ] Pradeep Gollakota commented on HBASE-3732: -- Yes it does. I misread his comment

[jira] [Commented] (HBASE-3732) New configuration option for client-side compression

2013-07-16 Thread Pradeep Gollakota (JIRA)
[ https://issues.apache.org/jira/browse/HBASE-3732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13710623#comment-13710623 ] Pradeep Gollakota commented on HBASE-3732: -- I'd like to reopen discussion

Re: include a script in another script? ¿error in macro Pig?

2013-07-05 Thread Pradeep Gollakota
If it's just registering common jars and defining aliases for UDFs, I think you can do think in .pigrc or in .pigbootup On Fri, Jul 5, 2013 at 6:56 AM, Miguel Angel Martin junquera mianmarjun.mailingl...@gmail.com wrote: hi all: I am using pig 0.11.1 and I want to modularize my pig scripts.

Re: dereferencing bag of map

2013-06-21 Thread Pradeep Gollakota
dump X ({sId:003_w,cId:k}) ({sId:001_rf,cId:r}) ({sId:001_rf,cId:r}) ({sId:004_rf,cId:r}) Any idea how can I generate cId sId as separate chararray columns? TIA Ss On Tue, Jun 18, 2013 at 5:52 AM, Pradeep Gollakota pradeep...@gmail.com wrote: What's the error you

Re: Loading data from ranges of ordered subdirs

2013-06-10 Thread Pradeep Gollakota
There's two possibilites that come to mind. 1. Write a custom LoadFunc in which you can handle these regular expressions. *Not the most ideal solution* 2. Use HCatalog. The example they have in their documentation seems to fit your use case perfectly.

Re: Tracking parts of a job taking the most time

2013-06-06 Thread Pradeep Gollakota
This may not be what you're looking for, but you can also try using Twitter Ambrose to monitor your Pig scripts as a whole. https://github.com/twitter/ambrose Not sure what you mean by specific parts of the script. If you mean each operation, I don't think there's a mechanism for that. Pig

[jira] [Commented] (ACCUMULO-391) Multi-table Accumulo input format

2013-06-05 Thread Pradeep Gollakota (JIRA)
[ https://issues.apache.org/jira/browse/ACCUMULO-391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13676043#comment-13676043 ] Pradeep Gollakota commented on ACCUMULO-391: This would be a great addition

[jira] [Commented] (ACCUMULO-391) Multi-table Accumulo input format

2013-06-05 Thread Pradeep Gollakota (JIRA)
[ https://issues.apache.org/jira/browse/ACCUMULO-391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13676290#comment-13676290 ] Pradeep Gollakota commented on ACCUMULO-391: I'm also available to help

Re: Join on custom LoadFunc not working correctly

2013-06-03 Thread Pradeep Gollakota
Does anyone have any thoughts on this? I'm completely out of idea's on this. On Thu, May 30, 2013 at 3:12 PM, Pradeep Gollakota pradeep...@gmail.comwrote: Hey guys, I have a custom Storage function that loads from the Accumulo database (similar to HBase). I have the following script

Join on custom LoadFunc not working correctly

2013-05-30 Thread Pradeep Gollakota
Hey guys, I have a custom Storage function that loads from the Accumulo database (similar to HBase). I have the following script that I'm trying to execute: A = load 'accumulo://table_a' using org.apache.accumulo.pig.AccumuloStorage('cf:cq1 cf:cq2', '-loadKey') as (id:

Re: Synthetic keys

2013-05-28 Thread Pradeep Gollakota
I ran into a similar problem where I had a relation (A) which was massive and another relation (B) which had exactly 1 record. I needed to do a cross product of these two relations, and the default implementation was very slow. I worked around it by generating a synthetic key myself and then used

Re: Synthetic keys

2013-05-28 Thread Pradeep Gollakota
in it. An explicit replicated cross would be good though, since the implementation probably is pretty simple. On 5/28/13 10:30 AM, Pradeep Gollakota pradeep...@gmail.com wrote: I ran into a similar problem where I had a relation (A) which was massive and another relation (B) which had exactly

Re: Complex joins

2013-05-23 Thread Pradeep Gollakota
, but we don't need the duplicate x in the inner tuples, is there an efficient way to just render this? ( x, {(a1,b1), (a2,b2)} ) -Original Message- From: Pradeep Gollakota [mailto:pradeep...@gmail.com] Sent: Thursday, May 23, 2013 10:05 AM To: user@pig.apache.org Subject: Re

Re: Complex joins

2013-05-22 Thread Pradeep Gollakota
Hi All, I'm a beginner pig user and this is my first post to the Pig mailing list. Anyway, to answer your question, the first thing that comes to my mind is that Pig may not be able to do a complex join like that. However, you can first flatten the bag in A, then do your join and then do a

Re: Complex joins

2013-05-22 Thread Pradeep Gollakota
( '3', {( '5' ),('6')} ) dump X (( '3', {( '5' ),('6')} ),) dump Y ({}) dump Z (( '3', {( '5' ),('6')} )) On Wed, May 22, 2013 at 8:25 PM, Pradeep Gollakota pradeep...@gmail.com wrote: Hi All, I'm a beginner pig user and this is my first post to the Pig mailing list. Anyway

Re: Complex joins

2013-05-22 Thread Pradeep Gollakota
')} ),) dump Y ({}) dump Z (( '3', {( '5' ),('6')} )) On Wed, May 22, 2013 at 8:25 PM, Pradeep Gollakota pradeep...@gmail.com wrote: Hi All, I'm a beginner pig user and this is my first post to the Pig mailing list. Anyway, to answer

[jira] [Updated] (JENA-402) Move etc/*.rules to src/main/resources/etc/*.rules

2013-02-24 Thread Pradeep Gollakota (JIRA)
[ https://issues.apache.org/jira/browse/JENA-402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Gollakota updated JENA-402: --- Attachment: JENA-402-1.patch Moved *.rules from jena-core/etc/ to jena-core/src/main/resources

[jira] [Commented] (JENA-402) Move etc/*.rules to src/main/resources/etc/*.rules

2013-02-23 Thread Pradeep Gollakota (JIRA)
[ https://issues.apache.org/jira/browse/JENA-402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13585287#comment-13585287 ] Pradeep Gollakota commented on JENA-402: This appears to be complete. Should

[jira] [Commented] (GIRAPH-285) Release Giraph-0.2

2013-02-20 Thread Pradeep Gollakota (JIRA)
[ https://issues.apache.org/jira/browse/GIRAPH-285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13582405#comment-13582405 ] Pradeep Gollakota commented on GIRAPH-285: -- Any progress on this guys? The 0.2

[jira] [Updated] (JENA-228) Limiting query output centrally

2013-01-27 Thread Pradeep Gollakota (JIRA)
[ https://issues.apache.org/jira/browse/JENA-228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Gollakota updated JENA-228: --- Attachment: JENA-228-1.patch Submitting an initial patch. I chose to intercept the the query

[jira] [Commented] (JENA-228) Limiting query output centrally

2012-12-25 Thread Pradeep Gollakota (JIRA)
[ https://issues.apache.org/jira/browse/JENA-228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13539448#comment-13539448 ] Pradeep Gollakota commented on JENA-228: I'd like to start working on this if I may

[jira] [Commented] (AVRO-575) AvroOutputFormat doesn't work for map-only jobs if only the map output schema has been set

2012-12-25 Thread Pradeep Gollakota (JIRA)
[ https://issues.apache.org/jira/browse/AVRO-575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13539460#comment-13539460 ] Pradeep Gollakota commented on AVRO-575: This JIRA seems to be OBE. The patch

[jira] [Created] (AVRO-1180) Broken links on Code Review Checklist page on confluence

2012-10-18 Thread Pradeep Gollakota (JIRA)
Pradeep Gollakota created AVRO-1180: --- Summary: Broken links on Code Review Checklist page on confluence Key: AVRO-1180 URL: https://issues.apache.org/jira/browse/AVRO-1180 Project: Avro

[jira] [Created] (ACCUMULO-736) Add Column Pagination Filter

2012-08-22 Thread Pradeep Gollakota (JIRA)
Pradeep Gollakota created ACCUMULO-736: -- Summary: Add Column Pagination Filter Key: ACCUMULO-736 URL: https://issues.apache.org/jira/browse/ACCUMULO-736 Project: Accumulo Issue Type

[jira] [Updated] (ACCUMULO-736) Add Column Pagination Filter

2012-08-22 Thread Pradeep Gollakota (JIRA)
[ https://issues.apache.org/jira/browse/ACCUMULO-736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Gollakota updated ACCUMULO-736: --- Issue Type: Wish (was: Bug) Add Column Pagination Filter

[jira] [Commented] (ACCUMULO-736) Add Column Pagination Filter

2012-08-22 Thread Pradeep Gollakota (JIRA)
[ https://issues.apache.org/jira/browse/ACCUMULO-736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13440033#comment-13440033 ] Pradeep Gollakota commented on ACCUMULO-736: I myself have extremely limited

[jira] [Commented] (GIRAPH-182) Provide SequenceFileVertexOutputFormat as an available OutputFormat

2012-04-11 Thread Pradeep Gollakota (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/GIRAPH-182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13252127#comment-13252127 ] Pradeep Gollakota commented on GIRAPH-182: -- Thanks for the review Jakob. * I

[jira] [Created] (GIRAPH-182) Provide SequenceFileVertexOutputFormat as an available OutputFormat

2012-04-10 Thread Pradeep Gollakota (Created) (JIRA)
Feature Components: lib Reporter: Pradeep Gollakota Priority: Minor SequenceFile's are heavily used in Hadoop. We should provide SequenceFileVertexOutputFormat. Since SequenceFileVertexInputFormat is already provided, it makes sense to also provide a mirroring

[jira] [Commented] (GIRAPH-182) Provide SequenceFileVertexOutputFormat as an available OutputFormat

2012-04-10 Thread Pradeep Gollakota (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/GIRAPH-182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13251183#comment-13251183 ] Pradeep Gollakota commented on GIRAPH-182: -- Would be glad

[jira] [Updated] (GIRAPH-182) Provide SequenceFileVertexOutputFormat as an available OutputFormat

2012-04-10 Thread Pradeep Gollakota (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/GIRAPH-182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Gollakota updated GIRAPH-182: - Attachment: GIRAPH-182-1.patch Implemented an abstract SequenceFileVertexOutputFormat

<    1   2   3