Re: HDFS issue

2014-03-25 Thread Keren Ouaknine
Issue solved. I configured eclipse with additional env variables and it
solved the error :)
Thanks.


On Mon, Mar 24, 2014 at 2:12 PM, Keren Ouaknine ker...@gmail.com wrote:

 Hello,

 I encounter an HDFS error running Pig from eclipse. The error doesn't
 occur when I run Pig from the command line, as I successfully connect to:
 *Connecting to hadoop file system at: hdfs://localhost:54310*

 However, trying to debug Pig's Main class from eclipse I get the following
 error:
 *2014-03-24 14:01:13,629 [main] INFO
  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
 Connecting to hadoop file system at: file:///*
 *2014-03-24 14:01:14,775 [main] ERROR org.apache.pig.PigServer - exception
 during parsing: Error during parsing. Failed to create DataStorage*

 I added to eclipse classpath an entry pointing to the conf file, and also
 made sure my environment variables in the Run configurations were set at
 follow:
 HADOOP_HOME=/home/kereno/Documents/hadoop/hadoop-1.2.0
 HADOOP_CONF_DIR=/home/kereno/Documents/hadoop/hadoop-1.2.0/conf
 HADOOPDIR=/home/kereno/Documents/hadoop/hadoop-1.2.0/conf

 Any clue what can be the problem?

 I added a screenshot of the error on:
 http://kereno.com/hdfs_error.jpg


 Thanks,
 Keren

 --
 Keren Ouaknine
 www.kereno.com




-- 
Keren Ouaknine
www.kereno.com


pig-0.12.0+PIG-3285: Encounter NoClassDefFoundError: org.cloudera.htrace.Trace during reading hbase table in pig grunt

2014-03-25 Thread lulynn_2008
Hi All,I am reading hbase table as following: A = LOAD 'APE1_RATED_EVENT' USING 
org.apache.pig.backend.hadoop.hbase.HBaseStorage('', '-loadKey true') AS 
(id:bytearray);
 B = GROUP A BY id;
 X = FOREACH B GENERATE COUNT_STAR(A);
 DUMP X

The job failed, and I found following error in hadoop task log. In PIG-3285, 
htrace*.jar has been added via addClassToJobIfExists(job, 
org.cloudera.htrace.Trace);. Any idea why this issue still happened? Thanks


ERROR:
2014-03-24 23:44:52,090 ERROR 
org.apache.hadoop.hbase.mapreduce.TableInputFormat: java.io.IOException: 
java.lang.reflect.InvocationTargetException
at 
org.apache.hadoop.hbase.client.HConnectionManager.createConnection(HConnectionManager.java:383)
at 
org.apache.hadoop.hbase.client.HConnectionManager.createConnection(HConnectionManager.java:360)
at 
org.apache.hadoop.hbase.client.HConnectionManager.getConnection(HConnectionManager.java:244)
at org.apache.hadoop.hbase.client.HTable.init(HTable.java:187)
at org.apache.hadoop.hbase.client.HTable.init(HTable.java:149)
at 
org.apache.hadoop.hbase.mapreduce.TableInputFormat.setConf(TableInputFormat.java:99)
at 
org.apache.pig.backend.hadoop.hbase.HBaseTableInputFormat$HBaseTableIFBuilder.build(HBaseTableInputFormat.java:78)
at 
org.apache.pig.backend.hadoop.hbase.HBaseStorage.getInputFormat(HBaseStorage.java:669)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.createRecordReader(PigInputFormat.java:117)
at 
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.init(MapTask.java:487)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:729)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:368)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at 
java.security.AccessController.doPrivileged(AccessController.java:362)
at javax.security.auth.Subject.doAs(Subject.java:573)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1502)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:56)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:39)
at java.lang.reflect.Constructor.newInstance(Constructor.java:527)
at 
org.apache.hadoop.hbase.client.HConnectionManager.createConnection(HConnectionManager.java:381)
... 16 more
Caused by: java.lang.NoClassDefFoundError: org.cloudera.htrace.Trace
at 
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:196)
at org.apache.hadoop.hbase.zookeeper.ZKUtil.checkExists(ZKUtil.java:479)
at 
org.apache.hadoop.hbase.zookeeper.ZKClusterId.readClusterIdZNode(ZKClusterId.java:65)
at 
org.apache.hadoop.hbase.client.ZooKeeperRegistry.getClusterId(ZooKeeperRegistry.java:83)
at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.retrieveClusterId(HConnectionManager.java:794)
at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.init(HConnectionManager.java:627)
... 21 more
Caused by: java.lang.ClassNotFoundException: org.cloudera.htrace.Trace
at java.net.URLClassLoader.findClass(URLClassLoader.java:434)
at java.lang.ClassLoader.loadClassHelper(ClassLoader.java:703)
at java.lang.ClassLoader.loadClass(ClassLoader.java:682)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:358)
at java.lang.ClassLoader.loadClass(ClassLoader.java:665)
... 27 more




Re: Could not estimate number of reducers

2014-03-25 Thread Vincent Barat

I hithttps://issues.apache.org/jira/browse/PIG-3512

Le 24/03/2014 14:40, Vincent Barat a écrit :

Hi,

Since I moved from Pig 0.10.0 to  0.11.0 or 0.12.0, the estimation 
of the number of reducers no longer work.


My script:

A = load 'data';
B = group A by $0;
store B into 'out';

My data:

grunt ls
hdfs://computation-master.dev.ubithere.com:9000/user/root/.staging 
dir
hdfs://computation-master.dev.ubithere.com:9000/user/root/datar 
31908911680


When I run my script (see the last line):

Apache Pig version 0.12.1-SNAPSHOT (rexported) compiled Feb 06 
2014, 16:57:49

Logging error messages to: /root/pig.log
Default bootup file /root/.pigbootup not found
Connecting to hadoop file system at: 
hdfs://computation-master.dev.ubithere.com:9000
Connecting to map-reduce job tracker at: 
computation-master.dev.ubithere.com:9001

Pig features used in the script: GROUP_BY
{RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, 
DuplicateForEachColumnRewrite, GroupByConstParallelSetter, 
ImplicitSplitInserter, LimitOptimizer, LoadTypeCastInserter, 
MergeFilter, MergeForEach, NewPartitionFilterOptimizer, 
PartitionFilterOptimizer, PushDownForEachFlatten, PushUpFilter, 
SplitFilter, StreamTypeCastInserter], 
RULES_DISABLED=[FilterLogicExpressionSimplifier]}

File concatenation threshold: 100 optimistic? false
MR plan size before optimization: 1
MR plan size after optimization: 1
Pig script settings are added to the job
mapred.job.reduce.markreset.buffer.percent is not set, set to 
default 0.3

creating jar file Job7470230163933306330.jar
jar file Job7470230163933306330.jar created
Setting up single store job
Reduce phase detected, estimating # of required reducers.
Using reducer estimator: 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.InputSizeReducerEstimator

BytesPerReducer=10 maxReducers=999 totalInputFileSize=-1
Could not estimate number of reducers and no requested or default 
parallelism set. Defaulting to 1 reducer.

Setting Parallelism to 1

I tried to debug; in the source code below, the 
PlanHelper.getPhysicalOperators always return an empty list.


public int estimateNumberOfReducers(Job job, MapReduceOper 
mapReduceOper) throws IOException {

Configuration conf = job.getConfiguration();

long bytesPerReducer = 
conf.getLong(BYTES_PER_REDUCER_PARAM, DEFAULT_BYTES_PER_REDUCER);
int maxReducers = conf.getInt(MAX_REDUCER_COUNT_PARAM, 
DEFAULT_MAX_REDUCER_COUNT_PARAM);


ListPOLoad poLoads = 
PlanHelper.getPhysicalOperators(mapReduceOper.mapPlan, POLoad.class);
long totalInputFileSize = getTotalInputFileSize(conf, 
poLoads, job);


Any idea ?

Thanks for your help




Recordings from Pig user meetup at Linkedin, Mar 14

2014-03-25 Thread Jarek Jarcec Cecho
Sadly I was not able to attend the last bay area user meetup at Linkedin that 
was held on March 14. I'm very interested to see some of the presentations, so 
I'm wondering if there are plans to publish the recordings?

Jarcec


signature.asc
Description: Digital signature


Any way to join two aliases without using CROSS

2014-03-25 Thread Christopher Surage
I am trying to perform the following action, but the only solution I have
been able to come up with is using a CROSS, but I don't want to use that
statement as it is a very expensive process.

(1,2,3,4,5)  (10,11)
(1,2,4,5,7)  (10,11)
(1,5,7,8,9)  (10,11)


I want to make it
(1,2,3,4,5,10,11)
(1,2,4,5,7,10,11)
(1,5,7,8,9,10,11)

any help would be much appreciated,

Chris


Re: Any way to join two aliases without using CROSS

2014-03-25 Thread Pradeep Gollakota
I don't understand what you're trying to do from your example.

If you perform a cross on the data you have, the output will be the
following:

(1,2,3,4,5,10,11)
(1,2,3,4,5,10,11)
(1,2,3,4,5,10,11)
(1,2,4,5,7,10,11)
(1,2,4,5,7,10,11)
(1,2,4,5,7,10,11)
(1,5,7,8,9,10,11)
(1,5,7,8,9,10,11)
(1,5,7,8,9,10,11)

On this, you'll have to do a distinct to get what you're looking for.

Let's change the example a little bit so we get a more clear understanding
of your problem. What would be the output if your two relations looked as
follows:

(1,2,3,4,5)  (10,11)
(1,2,4,5,7)  (10,12)
(1,5,7,8,9)  (10,13)


On Tue, Mar 25, 2014 at 12:18 PM, Shahab Yunus shahab.yu...@gmail.comwrote:

 Have you tried iterating over the first relation and in the nested
 *generate* clause, always appending the second relation? Your top level
 looping is on first relation but in the nested block you are sort of
 hardcoding appending of second relation.

 I am referring to the examples like in  Example: Nested Blocks section
 http://pig.apache.org/docs/r0.10.0/basic.html#foreach


 On Tue, Mar 25, 2014 at 3:01 PM, Christopher Surage csur...@gmail.com
 wrote:

  I am trying to perform the following action, but the only solution I have
  been able to come up with is using a CROSS, but I don't want to use that
  statement as it is a very expensive process.
 
  (1,2,3,4,5)  (10,11)
  (1,2,4,5,7)  (10,11)
  (1,5,7,8,9)  (10,11)
 
 
  I want to make it
  (1,2,3,4,5,10,11)
  (1,2,4,5,7,10,11)
  (1,5,7,8,9,10,11)
 
  any help would be much appreciated,
 
  Chris
 



Re: Any way to join two aliases without using CROSS

2014-03-25 Thread Christopher Surage
The output I would like to see is

(1,2,3,4,5,10,11)
(1,2,4,5,7,10,12)
(1,5,7,8,9,10,13)


On Tue, Mar 25, 2014 at 3:58 PM, Pradeep Gollakota pradeep...@gmail.comwrote:

 I don't understand what you're trying to do from your example.

 If you perform a cross on the data you have, the output will be the
 following:

 (1,2,3,4,5,10,11)
 (1,2,3,4,5,10,11)
 (1,2,3,4,5,10,11)
 (1,2,4,5,7,10,11)
 (1,2,4,5,7,10,11)
 (1,2,4,5,7,10,11)
 (1,5,7,8,9,10,11)
 (1,5,7,8,9,10,11)
 (1,5,7,8,9,10,11)

 On this, you'll have to do a distinct to get what you're looking for.

 Let's change the example a little bit so we get a more clear understanding
 of your problem. What would be the output if your two relations looked as
 follows:

 (1,2,3,4,5)  (10,11)
 (1,2,4,5,7)  (10,12)
 (1,5,7,8,9)  (10,13)


 On Tue, Mar 25, 2014 at 12:18 PM, Shahab Yunus shahab.yu...@gmail.com
 wrote:

  Have you tried iterating over the first relation and in the nested
  *generate* clause, always appending the second relation? Your top level
  looping is on first relation but in the nested block you are sort of
  hardcoding appending of second relation.
 
  I am referring to the examples like in  Example: Nested Blocks section
  http://pig.apache.org/docs/r0.10.0/basic.html#foreach
 
 
  On Tue, Mar 25, 2014 at 3:01 PM, Christopher Surage csur...@gmail.com
  wrote:
 
   I am trying to perform the following action, but the only solution I
 have
   been able to come up with is using a CROSS, but I don't want to use
 that
   statement as it is a very expensive process.
  
   (1,2,3,4,5)  (10,11)
   (1,2,4,5,7)  (10,11)
   (1,5,7,8,9)  (10,11)
  
  
   I want to make it
   (1,2,3,4,5,10,11)
   (1,2,4,5,7,10,11)
   (1,5,7,8,9,10,11)
  
   any help would be much appreciated,
  
   Chris
  
 



Re: Any way to join two aliases without using CROSS

2014-03-25 Thread John Meagher
Try this:  http://pig.apache.org/docs/r0.11.0/basic.html#rank
Rank each data set then join on the rank.

On Tue, Mar 25, 2014 at 4:03 PM, Christopher Surage csur...@gmail.com wrote:
 The output I would like to see is

 (1,2,3,4,5,10,11)
 (1,2,4,5,7,10,12)
 (1,5,7,8,9,10,13)


 On Tue, Mar 25, 2014 at 3:58 PM, Pradeep Gollakota 
 pradeep...@gmail.comwrote:

 I don't understand what you're trying to do from your example.

 If you perform a cross on the data you have, the output will be the
 following:

 (1,2,3,4,5,10,11)
 (1,2,3,4,5,10,11)
 (1,2,3,4,5,10,11)
 (1,2,4,5,7,10,11)
 (1,2,4,5,7,10,11)
 (1,2,4,5,7,10,11)
 (1,5,7,8,9,10,11)
 (1,5,7,8,9,10,11)
 (1,5,7,8,9,10,11)

 On this, you'll have to do a distinct to get what you're looking for.

 Let's change the example a little bit so we get a more clear understanding
 of your problem. What would be the output if your two relations looked as
 follows:

 (1,2,3,4,5)  (10,11)
 (1,2,4,5,7)  (10,12)
 (1,5,7,8,9)  (10,13)


 On Tue, Mar 25, 2014 at 12:18 PM, Shahab Yunus shahab.yu...@gmail.com
 wrote:

  Have you tried iterating over the first relation and in the nested
  *generate* clause, always appending the second relation? Your top level
  looping is on first relation but in the nested block you are sort of
  hardcoding appending of second relation.
 
  I am referring to the examples like in  Example: Nested Blocks section
  http://pig.apache.org/docs/r0.10.0/basic.html#foreach
 
 
  On Tue, Mar 25, 2014 at 3:01 PM, Christopher Surage csur...@gmail.com
  wrote:
 
   I am trying to perform the following action, but the only solution I
 have
   been able to come up with is using a CROSS, but I don't want to use
 that
   statement as it is a very expensive process.
  
   (1,2,3,4,5)  (10,11)
   (1,2,4,5,7)  (10,11)
   (1,5,7,8,9)  (10,11)
  
  
   I want to make it
   (1,2,3,4,5,10,11)
   (1,2,4,5,7,10,11)
   (1,5,7,8,9,10,11)
  
   any help would be much appreciated,
  
   Chris
  
 



Re: Any way to join two aliases without using CROSS

2014-03-25 Thread Christopher Surage
yes


On Tue, Mar 25, 2014 at 4:07 PM, Shahab Yunus shahab.yu...@gmail.comwrote:

 Oh, sorry. This new example is something different from what I understood
 before. I thought you were only trying to append one relation (with one
 tuple) to another (which has more than one tuple).

 So essentially you want to loop over 2 collection and combine their tuples.
 Are they always going to be same size (number of tuples)?


 On Tue, Mar 25, 2014 at 4:03 PM, Christopher Surage csur...@gmail.com
 wrote:

  The output I would like to see is
 
  (1,2,3,4,5,10,11)
  (1,2,4,5,7,10,12)
  (1,5,7,8,9,10,13)
 
 
  On Tue, Mar 25, 2014 at 3:58 PM, Pradeep Gollakota pradeep...@gmail.com
  wrote:
 
   I don't understand what you're trying to do from your example.
  
   If you perform a cross on the data you have, the output will be the
   following:
  
   (1,2,3,4,5,10,11)
   (1,2,3,4,5,10,11)
   (1,2,3,4,5,10,11)
   (1,2,4,5,7,10,11)
   (1,2,4,5,7,10,11)
   (1,2,4,5,7,10,11)
   (1,5,7,8,9,10,11)
   (1,5,7,8,9,10,11)
   (1,5,7,8,9,10,11)
  
   On this, you'll have to do a distinct to get what you're looking for.
  
   Let's change the example a little bit so we get a more clear
  understanding
   of your problem. What would be the output if your two relations looked
 as
   follows:
  
   (1,2,3,4,5)  (10,11)
   (1,2,4,5,7)  (10,12)
   (1,5,7,8,9)  (10,13)
  
  
   On Tue, Mar 25, 2014 at 12:18 PM, Shahab Yunus shahab.yu...@gmail.com
   wrote:
  
Have you tried iterating over the first relation and in the nested
*generate* clause, always appending the second relation? Your top
 level
looping is on first relation but in the nested block you are sort of
hardcoding appending of second relation.
   
I am referring to the examples like in  Example: Nested Blocks
  section
http://pig.apache.org/docs/r0.10.0/basic.html#foreach
   
   
On Tue, Mar 25, 2014 at 3:01 PM, Christopher Surage 
 csur...@gmail.com
wrote:
   
 I am trying to perform the following action, but the only solution
 I
   have
 been able to come up with is using a CROSS, but I don't want to use
   that
 statement as it is a very expensive process.

 (1,2,3,4,5)  (10,11)
 (1,2,4,5,7)  (10,11)
 (1,5,7,8,9)  (10,11)


 I want to make it
 (1,2,3,4,5,10,11)
 (1,2,4,5,7,10,11)
 (1,5,7,8,9,10,11)

 any help would be much appreciated,

 Chris

   
  
 



Re: Any way to join two aliases without using CROSS

2014-03-25 Thread Christopher Surage
@ pradeep, I know what the cross product will do, but I have many lines in
many files. So the cross will take far too long to complete.


On Tue, Mar 25, 2014 at 3:58 PM, Pradeep Gollakota pradeep...@gmail.comwrote:

 I don't understand what you're trying to do from your example.

 If you perform a cross on the data you have, the output will be the
 following:

 (1,2,3,4,5,10,11)
 (1,2,3,4,5,10,11)
 (1,2,3,4,5,10,11)
 (1,2,4,5,7,10,11)
 (1,2,4,5,7,10,11)
 (1,2,4,5,7,10,11)
 (1,5,7,8,9,10,11)
 (1,5,7,8,9,10,11)
 (1,5,7,8,9,10,11)

 On this, you'll have to do a distinct to get what you're looking for.

 Let's change the example a little bit so we get a more clear understanding
 of your problem. What would be the output if your two relations looked as
 follows:

 (1,2,3,4,5)  (10,11)
 (1,2,4,5,7)  (10,12)
 (1,5,7,8,9)  (10,13)


 On Tue, Mar 25, 2014 at 12:18 PM, Shahab Yunus shahab.yu...@gmail.com
 wrote:

  Have you tried iterating over the first relation and in the nested
  *generate* clause, always appending the second relation? Your top level
  looping is on first relation but in the nested block you are sort of
  hardcoding appending of second relation.
 
  I am referring to the examples like in  Example: Nested Blocks section
  http://pig.apache.org/docs/r0.10.0/basic.html#foreach
 
 
  On Tue, Mar 25, 2014 at 3:01 PM, Christopher Surage csur...@gmail.com
  wrote:
 
   I am trying to perform the following action, but the only solution I
 have
   been able to come up with is using a CROSS, but I don't want to use
 that
   statement as it is a very expensive process.
  
   (1,2,3,4,5)  (10,11)
   (1,2,4,5,7)  (10,11)
   (1,5,7,8,9)  (10,11)
  
  
   I want to make it
   (1,2,3,4,5,10,11)
   (1,2,4,5,7,10,11)
   (1,5,7,8,9,10,11)
  
   any help would be much appreciated,
  
   Chris
  
 



Re: Any way to join two aliases without using CROSS

2014-03-25 Thread Andrew Musselman
John's answer about RANK sounds like it should solve your problem

 On Mar 25, 2014, at 1:13 PM, Christopher Surage csur...@gmail.com wrote:
 
 @ pradeep, I know what the cross product will do, but I have many lines in
 many files. So the cross will take far too long to complete.
 
 
 On Tue, Mar 25, 2014 at 3:58 PM, Pradeep Gollakota 
 pradeep...@gmail.comwrote:
 
 I don't understand what you're trying to do from your example.
 
 If you perform a cross on the data you have, the output will be the
 following:
 
 (1,2,3,4,5,10,11)
 (1,2,3,4,5,10,11)
 (1,2,3,4,5,10,11)
 (1,2,4,5,7,10,11)
 (1,2,4,5,7,10,11)
 (1,2,4,5,7,10,11)
 (1,5,7,8,9,10,11)
 (1,5,7,8,9,10,11)
 (1,5,7,8,9,10,11)
 
 On this, you'll have to do a distinct to get what you're looking for.
 
 Let's change the example a little bit so we get a more clear understanding
 of your problem. What would be the output if your two relations looked as
 follows:
 
 (1,2,3,4,5)  (10,11)
 (1,2,4,5,7)  (10,12)
 (1,5,7,8,9)  (10,13)
 
 
 On Tue, Mar 25, 2014 at 12:18 PM, Shahab Yunus shahab.yu...@gmail.com
 wrote:
 
 Have you tried iterating over the first relation and in the nested
 *generate* clause, always appending the second relation? Your top level
 looping is on first relation but in the nested block you are sort of
 hardcoding appending of second relation.
 
 I am referring to the examples like in  Example: Nested Blocks section
 http://pig.apache.org/docs/r0.10.0/basic.html#foreach
 
 
 On Tue, Mar 25, 2014 at 3:01 PM, Christopher Surage csur...@gmail.com
 wrote:
 
 I am trying to perform the following action, but the only solution I
 have
 been able to come up with is using a CROSS, but I don't want to use
 that
 statement as it is a very expensive process.
 
 (1,2,3,4,5)  (10,11)
 (1,2,4,5,7)  (10,11)
 (1,5,7,8,9)  (10,11)
 
 
 I want to make it
 (1,2,3,4,5,10,11)
 (1,2,4,5,7,10,11)
 (1,5,7,8,9,10,11)
 
 any help would be much appreciated,
 
 Chris
 


RE: Any way to join two aliases without using CROSS

2014-03-25 Thread william.dowling
Here is how to use rank and join for this problem:

sh cat xxx
1,2,3,4,5
1,2,4,5,7
1,5,7,8,9

sh cat yyy
10,11
10,12
10,13


a= load 'xxx' using PigStorage(',');
b= load 'yyy' using PigStorage(',');

a2 = rank a;
b2 = rank b;

c = join a1 by $0, b2 by $0;
c2 = order c by $6;
c3 = foreach c2 generate $1 .. $5, $7 ..;

dump c3
(1,2,3,4,5,10,11)
(1,2,4,5,7,10,12)
(1,5,7,8,9,10,13)


William F Dowling
Senior Technologist
Thomson Reuters


-Original Message-
From: Christopher Surage [mailto:csur...@gmail.com] 
Sent: Tuesday, March 25, 2014 4:03 PM
To: user@pig.apache.org
Subject: Re: Any way to join two aliases without using CROSS

The output I would like to see is

(1,2,3,4,5,10,11)
(1,2,4,5,7,10,12)
(1,5,7,8,9,10,13)


On Tue, Mar 25, 2014 at 3:58 PM, Pradeep Gollakota pradeep...@gmail.comwrote:

 I don't understand what you're trying to do from your example.

 If you perform a cross on the data you have, the output will be the
 following:

 (1,2,3,4,5,10,11)
 (1,2,3,4,5,10,11)
 (1,2,3,4,5,10,11)
 (1,2,4,5,7,10,11)
 (1,2,4,5,7,10,11)
 (1,2,4,5,7,10,11)
 (1,5,7,8,9,10,11)
 (1,5,7,8,9,10,11)
 (1,5,7,8,9,10,11)

 On this, you'll have to do a distinct to get what you're looking for.

 Let's change the example a little bit so we get a more clear understanding
 of your problem. What would be the output if your two relations looked as
 follows:

 (1,2,3,4,5)  (10,11)
 (1,2,4,5,7)  (10,12)
 (1,5,7,8,9)  (10,13)


 On Tue, Mar 25, 2014 at 12:18 PM, Shahab Yunus shahab.yu...@gmail.com
 wrote:

  Have you tried iterating over the first relation and in the nested
  *generate* clause, always appending the second relation? Your top level
  looping is on first relation but in the nested block you are sort of
  hardcoding appending of second relation.
 
  I am referring to the examples like in  Example: Nested Blocks section
  http://pig.apache.org/docs/r0.10.0/basic.html#foreach
 
 
  On Tue, Mar 25, 2014 at 3:01 PM, Christopher Surage csur...@gmail.com
  wrote:
 
   I am trying to perform the following action, but the only solution I
 have
   been able to come up with is using a CROSS, but I don't want to use
 that
   statement as it is a very expensive process.
  
   (1,2,3,4,5)  (10,11)
   (1,2,4,5,7)  (10,11)
   (1,5,7,8,9)  (10,11)
  
  
   I want to make it
   (1,2,3,4,5,10,11)
   (1,2,4,5,7,10,11)
   (1,5,7,8,9,10,11)
  
   any help would be much appreciated,
  
   Chris
  
 



Re: Any way to join two aliases without using CROSS

2014-03-25 Thread Pradeep Gollakota
CROSS is by definition a very very expensive operation. Regardless, CROSS
is the wrong operator for what you're trying to do.

As was suggested by others, you want to RANK the relations then do a JOIN
by the rank.


On Tue, Mar 25, 2014 at 1:27 PM, william.dowl...@thomsonreuters.com wrote:

 Here is how to use rank and join for this problem:

 sh cat xxx
 1,2,3,4,5
 1,2,4,5,7
 1,5,7,8,9

 sh cat yyy
 10,11
 10,12
 10,13


 a= load 'xxx' using PigStorage(',');
 b= load 'yyy' using PigStorage(',');

 a2 = rank a;
 b2 = rank b;

 c = join a1 by $0, b2 by $0;
 c2 = order c by $6;
 c3 = foreach c2 generate $1 .. $5, $7 ..;

 dump c3
 (1,2,3,4,5,10,11)
 (1,2,4,5,7,10,12)
 (1,5,7,8,9,10,13)


 William F Dowling
 Senior Technologist
 Thomson Reuters


 -Original Message-
 From: Christopher Surage [mailto:csur...@gmail.com]
 Sent: Tuesday, March 25, 2014 4:03 PM
 To: user@pig.apache.org
 Subject: Re: Any way to join two aliases without using CROSS

 The output I would like to see is

 (1,2,3,4,5,10,11)
 (1,2,4,5,7,10,12)
 (1,5,7,8,9,10,13)


 On Tue, Mar 25, 2014 at 3:58 PM, Pradeep Gollakota pradeep...@gmail.com
 wrote:

  I don't understand what you're trying to do from your example.
 
  If you perform a cross on the data you have, the output will be the
  following:
 
  (1,2,3,4,5,10,11)
  (1,2,3,4,5,10,11)
  (1,2,3,4,5,10,11)
  (1,2,4,5,7,10,11)
  (1,2,4,5,7,10,11)
  (1,2,4,5,7,10,11)
  (1,5,7,8,9,10,11)
  (1,5,7,8,9,10,11)
  (1,5,7,8,9,10,11)
 
  On this, you'll have to do a distinct to get what you're looking for.
 
  Let's change the example a little bit so we get a more clear
 understanding
  of your problem. What would be the output if your two relations looked as
  follows:
 
  (1,2,3,4,5)  (10,11)
  (1,2,4,5,7)  (10,12)
  (1,5,7,8,9)  (10,13)
 
 
  On Tue, Mar 25, 2014 at 12:18 PM, Shahab Yunus shahab.yu...@gmail.com
  wrote:
 
   Have you tried iterating over the first relation and in the nested
   *generate* clause, always appending the second relation? Your top level
   looping is on first relation but in the nested block you are sort of
   hardcoding appending of second relation.
  
   I am referring to the examples like in  Example: Nested Blocks
 section
   http://pig.apache.org/docs/r0.10.0/basic.html#foreach
  
  
   On Tue, Mar 25, 2014 at 3:01 PM, Christopher Surage csur...@gmail.com
   wrote:
  
I am trying to perform the following action, but the only solution I
  have
been able to come up with is using a CROSS, but I don't want to use
  that
statement as it is a very expensive process.
   
(1,2,3,4,5)  (10,11)
(1,2,4,5,7)  (10,11)
(1,5,7,8,9)  (10,11)
   
   
I want to make it
(1,2,3,4,5,10,11)
(1,2,4,5,7,10,11)
(1,5,7,8,9,10,11)
   
any help would be much appreciated,
   
Chris
   
  
 



Re: Any way to join two aliases without using CROSS

2014-03-25 Thread Christopher Surage
I don't think my version of PIG supports the rank function, I keep getting
Internal Error. I would update it, but I am not in control of the cluster.


On Tue, Mar 25, 2014 at 4:16 PM, Andrew Musselman 
andrew.mussel...@gmail.com wrote:

 John's answer about RANK sounds like it should solve your problem

  On Mar 25, 2014, at 1:13 PM, Christopher Surage csur...@gmail.com
 wrote:
 
  @ pradeep, I know what the cross product will do, but I have many lines
 in
  many files. So the cross will take far too long to complete.
 
 
  On Tue, Mar 25, 2014 at 3:58 PM, Pradeep Gollakota pradeep...@gmail.com
 wrote:
 
  I don't understand what you're trying to do from your example.
 
  If you perform a cross on the data you have, the output will be the
  following:
 
  (1,2,3,4,5,10,11)
  (1,2,3,4,5,10,11)
  (1,2,3,4,5,10,11)
  (1,2,4,5,7,10,11)
  (1,2,4,5,7,10,11)
  (1,2,4,5,7,10,11)
  (1,5,7,8,9,10,11)
  (1,5,7,8,9,10,11)
  (1,5,7,8,9,10,11)
 
  On this, you'll have to do a distinct to get what you're looking for.
 
  Let's change the example a little bit so we get a more clear
 understanding
  of your problem. What would be the output if your two relations looked
 as
  follows:
 
  (1,2,3,4,5)  (10,11)
  (1,2,4,5,7)  (10,12)
  (1,5,7,8,9)  (10,13)
 
 
  On Tue, Mar 25, 2014 at 12:18 PM, Shahab Yunus shahab.yu...@gmail.com
  wrote:
 
  Have you tried iterating over the first relation and in the nested
  *generate* clause, always appending the second relation? Your top level
  looping is on first relation but in the nested block you are sort of
  hardcoding appending of second relation.
 
  I am referring to the examples like in  Example: Nested Blocks
 section
  http://pig.apache.org/docs/r0.10.0/basic.html#foreach
 
 
  On Tue, Mar 25, 2014 at 3:01 PM, Christopher Surage csur...@gmail.com
  wrote:
 
  I am trying to perform the following action, but the only solution I
  have
  been able to come up with is using a CROSS, but I don't want to use
  that
  statement as it is a very expensive process.
 
  (1,2,3,4,5)  (10,11)
  (1,2,4,5,7)  (10,11)
  (1,5,7,8,9)  (10,11)
 
 
  I want to make it
  (1,2,3,4,5,10,11)
  (1,2,4,5,7,10,11)
  (1,5,7,8,9,10,11)
 
  any help would be much appreciated,
 
  Chris
 



Re: Any way to join two aliases without using CROSS

2014-03-25 Thread Andrew Musselman
In that situation you could write a script that tacks on the equivalent value 
that rank does, and stream the ordered relations through it.

I'm assuming you have a sense of order on both these relations.

After that join like you would after rank.

I'm not at a computer so can't type up an example.

 On Mar 25, 2014, at 1:57 PM, Christopher Surage csur...@gmail.com wrote:
 
 I don't think my version of PIG supports the rank function, I keep getting
 Internal Error. I would update it, but I am not in control of the cluster.
 
 
 On Tue, Mar 25, 2014 at 4:16 PM, Andrew Musselman 
 andrew.mussel...@gmail.com wrote:
 
 John's answer about RANK sounds like it should solve your problem
 
 On Mar 25, 2014, at 1:13 PM, Christopher Surage csur...@gmail.com
 wrote:
 
 @ pradeep, I know what the cross product will do, but I have many lines
 in
 many files. So the cross will take far too long to complete.
 
 
 On Tue, Mar 25, 2014 at 3:58 PM, Pradeep Gollakota pradeep...@gmail.com
 wrote:
 
 I don't understand what you're trying to do from your example.
 
 If you perform a cross on the data you have, the output will be the
 following:
 
 (1,2,3,4,5,10,11)
 (1,2,3,4,5,10,11)
 (1,2,3,4,5,10,11)
 (1,2,4,5,7,10,11)
 (1,2,4,5,7,10,11)
 (1,2,4,5,7,10,11)
 (1,5,7,8,9,10,11)
 (1,5,7,8,9,10,11)
 (1,5,7,8,9,10,11)
 
 On this, you'll have to do a distinct to get what you're looking for.
 
 Let's change the example a little bit so we get a more clear
 understanding
 of your problem. What would be the output if your two relations looked
 as
 follows:
 
 (1,2,3,4,5)  (10,11)
 (1,2,4,5,7)  (10,12)
 (1,5,7,8,9)  (10,13)
 
 
 On Tue, Mar 25, 2014 at 12:18 PM, Shahab Yunus shahab.yu...@gmail.com
 wrote:
 
 Have you tried iterating over the first relation and in the nested
 *generate* clause, always appending the second relation? Your top level
 looping is on first relation but in the nested block you are sort of
 hardcoding appending of second relation.
 
 I am referring to the examples like in  Example: Nested Blocks
 section
 http://pig.apache.org/docs/r0.10.0/basic.html#foreach
 
 
 On Tue, Mar 25, 2014 at 3:01 PM, Christopher Surage csur...@gmail.com
 wrote:
 
 I am trying to perform the following action, but the only solution I
 have
 been able to come up with is using a CROSS, but I don't want to use
 that
 statement as it is a very expensive process.
 
 (1,2,3,4,5)  (10,11)
 (1,2,4,5,7)  (10,11)
 (1,5,7,8,9)  (10,11)
 
 
 I want to make it
 (1,2,3,4,5,10,11)
 (1,2,4,5,7,10,11)
 (1,5,7,8,9,10,11)
 
 any help would be much appreciated,
 
 Chris
 


??????Re: Any way to join two aliases without using CROSS

2014-03-25 Thread James
Hello,

There is a similar UDF in DataFu named Enumerate. 
http://datafu.incubator.apache.org/docs/datafu/1.2.0/datafu/pig/bags/Enumerate.html

I wish it may help. 

James

Re: 回复:Re: Any way to join two aliases without using CROSS

2014-03-25 Thread Pradeep Gollakota
Unfortunately, the Enumerate UDF from DataFu would not work in this case.
The UDF works on Bags and in this case, we want to enumerate a relation.
Implementing RANK is a very tricky thing to do correctly. I'm not even sure
if it's doable just by using Pig operators, UDFs or macros. Best option is
probably to request a Pig upgrade.


On Tue, Mar 25, 2014 at 6:21 PM, James alcaid1...@gmail.com wrote:

 Hello,

 There is a similar UDF in DataFu named Enumerate.

 http://datafu.incubator.apache.org/docs/datafu/1.2.0/datafu/pig/bags/Enumerate.html

 I wish it may help.

 James