Since your join key is not in the Bag, can you do your join first and then
execute your UDF?
On Fri, Sep 13, 2013 at 10:04 AM, John johnnyenglish...@gmail.com wrote:
Okay, I think I have found the problem here:
http://pig.apache.org/docs/r0.11.1/perf.html#merge-joins ... there is
wirtten;
? The problem I see is that the next() Method in the
LoadFunc
has
to be a Tuple and no Bag. :/
2013/9/13 Pradeep Gollakota pradeep...@gmail.com
Since your join key is not in the Bag, can you do your join first
and
then
execute your UDF?
On Fri, Sep 13
[
https://issues.apache.org/jira/browse/PIG-3453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13764812#comment-13764812
]
Pradeep Gollakota commented on PIG-3453:
I personally don't have a concrete use case
Hi Florencia,
Welcome to Pig!
Unfortunately without knowing the actual script that you're trying to
execute, we won't be able to help you with optimizations. There are some
very general guidelines for optimizing Pig scripts though.
Take a look at
Pradeep Gollakota created PIG-3453:
--
Summary: Implement a Storm backend to Pig
Key: PIG-3453
URL: https://issues.apache.org/jira/browse/PIG-3453
Project: Pig
Issue Type: New Feature
I think there's probably some convoluted way to do this. First thing you'll
have to do is flatten your data.
data1 = A, B
_
X, X1
X, X2
Y, Y1
Y, Y2
Y, Y3
Then do a join by B onto you second dataset. This should produce the
following
data2 = data1::A, data1::B, data2::A, data2::B, data2::C
Most of the major changes were introduced in 0.9
The documentation listing the backward compatibility issues with 0.9 can be
found at
https://cwiki.apache.org/confluence/display/PIG/Pig+0.9+Backward+Compatibility
I believe other changes that are not listed there are the introduction of
macros.
In your eval function, you can use the HBase Get/Scan API to retrieve the
data rather than using the MapReduce API.
On Wed, Aug 21, 2013 at 7:12 AM, John johnnyenglish...@gmail.com wrote:
Im currently writing a Pig Latin programm:
A = load 'hbase://mytable1'
That's an interesting approach! Although, I'm not sure if RANK is supported
as a nested foreach operator. If it is supported, then this approach would
work. The documentation doesn't show that RANK is a supported nested
foreach operator.
http://pig.apache.org/docs/r0.11.1/basic.html#foreach
On
I have a couple of ideas that MAY help. I'm not familiar with your data,
but these techniques might help.
First, this probably won't affect the performance, but rather than having 3
FILTER statements at the top of your script, you can use the SPLIT operator
to split your dataset into 3 datasets.
This is expected behavior. The disambiguation comes only after two or more
relations are brought together.
As per the docs at
http://pig.apache.org/docs/r0.11.1/basic.html#disambiguate, the
disambiguate operator can only be used to identify field names after JOIN,
COGROUP, CROSS, or FLATTEN
I believe the procedure is to file a bug report on JIRA and set the
component field to 'documentation'.
Pig veterans, please correct me if I'm wrong.
On Thu, Aug 8, 2013 at 10:19 PM, Paul Houle ontolo...@gmail.com wrote:
I recently wrote a load function and to get started I cut-n-pasted from
join BIG by key, SMALL by key using 'replicated';
On Fri, Aug 2, 2013 at 5:29 AM, Serega Sheypak serega.shey...@gmail.comwrote:
Hi. I've met a problem wth replicated join in pig 0.11
I have two relations:
BIG (3-6GB) and SMALL (100MB)
I do join them on four integer fields.
It takes up to
Huy,
I think this question probably belongs in the Hadoop mailing list over the
Pig mailing list.
However, I think you're looking for
http://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/fs/FileStatus.html
A FileStatus object can be acquired from a FileSystem object by calling the
layer and all the planning separate? Do you envision needing
extensions/changes to the language to support Storm? Feel free to add a
page to Pig's wiki with your thoughts on an approach.
Alan.
On Jul 23, 2013, at 9:52 AM, Pradeep Gollakota wrote:
Hi Pig Developers,
I wanted to reach
Hi Pig Developers,
I wanted to reach out to you all and ask for you opinion on something.
As a Pig user, I have come to love Pig as a framework. Pig provides a great
set of abstractions that make working with large datasets easy. Currently
Pig is only backed by hadoop. However, with the new rise
[
https://issues.apache.org/jira/browse/PIG-3391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13717697#comment-13717697
]
Pradeep Gollakota commented on PIG-3391:
I have a couple of quick questions:
1
[
https://issues.apache.org/jira/browse/PIG-2495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13717915#comment-13717915
]
Pradeep Gollakota commented on PIG-2495:
Hi Kevin,
I have a very minor request
You can use the SPLIT operator to split a relation into two (or more)
relations. http://pig.apache.org/docs/r0.11.1/basic.html#SPLIT
Also, you should probably do this before GROUP. As a best practice (and
general pig optimization strategy), you should filter (and project) early
and often.
On
it before GROUP. I need group by key, then sort by timestamp
field inside each group.
After sort is done I do can determine non valid records.
I've provided simplified case.
The only problem is that SPLIT is not allowed in nested FOREACH statement.
2013/7/23 Pradeep Gollakota pradeep...@gmail.com
Hi Pig Users and Developers,
I asked a question on the dev mailing list, earlier today about Pig and
Storm. However, having thought more about it, I think the user list is more
appropriate. Here's the original email verbatim.
I wanted to reach out to you all and ask for you opinion on something.
You could probably just use nohup if they're all parallel and send them
into the background.
Nohup pig script1.pig
Nohup pig script2.pig
Etc.
On Jul 22, 2013 7:12 AM, manishbh...@rocketmail.com
manishbh...@rocketmail.com wrote:
You can create job flow in oozie.
Sent via Rocket from my HTC
There's only one thing that comes to mind for this particular toy example.
From the Programming Pig book,
pig.cached.bag.memusage property is the Percentage of the heap that Pig
will allocate for all of the bags in a map or reduce task. Once the bags
fill up this amount, the data is spilled to
, for FACT_TABLE5 we update 'col2' from DIMENSION2
so on.
Feel free to correct me if I am wrong. Thanks.
On Thu, Jul 18, 2013 at 8:25 AM, Pradeep Gollakota pradeep...@gmail.com
wrote:
Looks like this might be macroable. Not entirely sure how that can be
done
yet... but I'd look
[
https://issues.apache.org/jira/browse/HBASE-3732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13711038#comment-13711038
]
Pradeep Gollakota commented on HBASE-3732:
--
Yes it does. I misread his comment
[
https://issues.apache.org/jira/browse/HBASE-3732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13710623#comment-13710623
]
Pradeep Gollakota commented on HBASE-3732:
--
I'd like to reopen discussion
If it's just registering common jars and defining aliases for UDFs, I think
you can do think in .pigrc or in .pigbootup
On Fri, Jul 5, 2013 at 6:56 AM, Miguel Angel Martin junquera
mianmarjun.mailingl...@gmail.com wrote:
hi all:
I am using pig 0.11.1 and I want to modularize my pig scripts.
dump X
({sId:003_w,cId:k})
({sId:001_rf,cId:r})
({sId:001_rf,cId:r})
({sId:004_rf,cId:r})
Any idea how can I generate cId sId as separate chararray columns? TIA
Ss
On Tue, Jun 18, 2013 at 5:52 AM, Pradeep Gollakota pradeep...@gmail.com
wrote:
What's the error you
There's two possibilites that come to mind.
1. Write a custom LoadFunc in which you can handle these regular
expressions. *Not the most ideal solution*
2. Use HCatalog. The example they have in their documentation seems to fit
your use case perfectly.
This may not be what you're looking for, but you can also try using Twitter
Ambrose to monitor your Pig scripts as a whole.
https://github.com/twitter/ambrose
Not sure what you mean by specific parts of the script. If you mean each
operation, I don't think there's a mechanism for that. Pig
[
https://issues.apache.org/jira/browse/ACCUMULO-391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13676043#comment-13676043
]
Pradeep Gollakota commented on ACCUMULO-391:
This would be a great addition
[
https://issues.apache.org/jira/browse/ACCUMULO-391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13676290#comment-13676290
]
Pradeep Gollakota commented on ACCUMULO-391:
I'm also available to help
Does anyone have any thoughts on this?
I'm completely out of idea's on this.
On Thu, May 30, 2013 at 3:12 PM, Pradeep Gollakota pradeep...@gmail.comwrote:
Hey guys,
I have a custom Storage function that loads from the Accumulo database
(similar to HBase).
I have the following script
Hey guys,
I have a custom Storage function that loads from the Accumulo database
(similar to HBase).
I have the following script that I'm trying to execute:
A = load 'accumulo://table_a'
using org.apache.accumulo.pig.AccumuloStorage('cf:cq1 cf:cq2',
'-loadKey')
as (id:
I ran into a similar problem where I had a relation (A) which was massive
and another relation (B) which had exactly 1 record. I needed to do a cross
product of these two relations, and the default implementation was very
slow. I worked around it by generating a synthetic key myself and then used
in it. An explicit replicated cross would be good though,
since the implementation probably is pretty simple.
On 5/28/13 10:30 AM, Pradeep Gollakota pradeep...@gmail.com wrote:
I ran into a similar problem where I had a relation (A) which was massive
and another relation (B) which had exactly
, but we don't need the duplicate x
in the inner tuples, is there an efficient way to just render this?
( x, {(a1,b1), (a2,b2)} )
-Original Message-
From: Pradeep Gollakota [mailto:pradeep...@gmail.com]
Sent: Thursday, May 23, 2013 10:05 AM
To: user@pig.apache.org
Subject: Re
Hi All,
I'm a beginner pig user and this is my first post to the Pig mailing list.
Anyway, to answer your question, the first thing that comes to my mind is
that Pig may not be able to do a complex join like that.
However, you can first flatten the bag in A, then do your join and then do
a
( '3', {( '5' ),('6')} )
dump X
(( '3', {( '5' ),('6')} ),)
dump Y
({})
dump Z
(( '3', {( '5' ),('6')} ))
On Wed, May 22, 2013 at 8:25 PM, Pradeep Gollakota pradeep...@gmail.com
wrote:
Hi All,
I'm a beginner pig user and this is my first post to the Pig mailing
list.
Anyway
')} ),)
dump Y
({})
dump Z
(( '3', {( '5' ),('6')} ))
On Wed, May 22, 2013 at 8:25 PM, Pradeep Gollakota
pradeep...@gmail.com
wrote:
Hi All,
I'm a beginner pig user and this is my first post to the Pig mailing
list.
Anyway, to answer
[
https://issues.apache.org/jira/browse/JENA-402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Pradeep Gollakota updated JENA-402:
---
Attachment: JENA-402-1.patch
Moved *.rules from jena-core/etc/ to jena-core/src/main/resources
[
https://issues.apache.org/jira/browse/JENA-402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13585287#comment-13585287
]
Pradeep Gollakota commented on JENA-402:
This appears to be complete. Should
[
https://issues.apache.org/jira/browse/GIRAPH-285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13582405#comment-13582405
]
Pradeep Gollakota commented on GIRAPH-285:
--
Any progress on this guys? The 0.2
[
https://issues.apache.org/jira/browse/JENA-228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Pradeep Gollakota updated JENA-228:
---
Attachment: JENA-228-1.patch
Submitting an initial patch. I chose to intercept the the query
[
https://issues.apache.org/jira/browse/JENA-228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13539448#comment-13539448
]
Pradeep Gollakota commented on JENA-228:
I'd like to start working on this if I may
[
https://issues.apache.org/jira/browse/AVRO-575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13539460#comment-13539460
]
Pradeep Gollakota commented on AVRO-575:
This JIRA seems to be OBE. The patch
Pradeep Gollakota created AVRO-1180:
---
Summary: Broken links on Code Review Checklist page on confluence
Key: AVRO-1180
URL: https://issues.apache.org/jira/browse/AVRO-1180
Project: Avro
Pradeep Gollakota created ACCUMULO-736:
--
Summary: Add Column Pagination Filter
Key: ACCUMULO-736
URL: https://issues.apache.org/jira/browse/ACCUMULO-736
Project: Accumulo
Issue Type
[
https://issues.apache.org/jira/browse/ACCUMULO-736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Pradeep Gollakota updated ACCUMULO-736:
---
Issue Type: Wish (was: Bug)
Add Column Pagination Filter
[
https://issues.apache.org/jira/browse/ACCUMULO-736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13440033#comment-13440033
]
Pradeep Gollakota commented on ACCUMULO-736:
I myself have extremely limited
[
https://issues.apache.org/jira/browse/GIRAPH-182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13252127#comment-13252127
]
Pradeep Gollakota commented on GIRAPH-182:
--
Thanks for the review Jakob.
* I
Feature
Components: lib
Reporter: Pradeep Gollakota
Priority: Minor
SequenceFile's are heavily used in Hadoop. We should provide
SequenceFileVertexOutputFormat. Since SequenceFileVertexInputFormat is already
provided, it makes sense to also provide a mirroring
[
https://issues.apache.org/jira/browse/GIRAPH-182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13251183#comment-13251183
]
Pradeep Gollakota commented on GIRAPH-182:
--
Would be glad
[
https://issues.apache.org/jira/browse/GIRAPH-182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Pradeep Gollakota updated GIRAPH-182:
-
Attachment: GIRAPH-182-1.patch
Implemented an abstract SequenceFileVertexOutputFormat
201 - 254 of 254 matches
Mail list logo