Re: HDFS issue
Issue solved. I configured eclipse with additional env variables and it solved the error :) Thanks. On Mon, Mar 24, 2014 at 2:12 PM, Keren Ouaknine ker...@gmail.com wrote: Hello, I encounter an HDFS error running Pig from eclipse. The error doesn't occur when I run Pig from the command line, as I successfully connect to: *Connecting to hadoop file system at: hdfs://localhost:54310* However, trying to debug Pig's Main class from eclipse I get the following error: *2014-03-24 14:01:13,629 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: file:///* *2014-03-24 14:01:14,775 [main] ERROR org.apache.pig.PigServer - exception during parsing: Error during parsing. Failed to create DataStorage* I added to eclipse classpath an entry pointing to the conf file, and also made sure my environment variables in the Run configurations were set at follow: HADOOP_HOME=/home/kereno/Documents/hadoop/hadoop-1.2.0 HADOOP_CONF_DIR=/home/kereno/Documents/hadoop/hadoop-1.2.0/conf HADOOPDIR=/home/kereno/Documents/hadoop/hadoop-1.2.0/conf Any clue what can be the problem? I added a screenshot of the error on: http://kereno.com/hdfs_error.jpg Thanks, Keren -- Keren Ouaknine www.kereno.com -- Keren Ouaknine www.kereno.com
pig-0.12.0+PIG-3285: Encounter NoClassDefFoundError: org.cloudera.htrace.Trace during reading hbase table in pig grunt
Hi All,I am reading hbase table as following: A = LOAD 'APE1_RATED_EVENT' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('', '-loadKey true') AS (id:bytearray); B = GROUP A BY id; X = FOREACH B GENERATE COUNT_STAR(A); DUMP X The job failed, and I found following error in hadoop task log. In PIG-3285, htrace*.jar has been added via addClassToJobIfExists(job, org.cloudera.htrace.Trace);. Any idea why this issue still happened? Thanks ERROR: 2014-03-24 23:44:52,090 ERROR org.apache.hadoop.hbase.mapreduce.TableInputFormat: java.io.IOException: java.lang.reflect.InvocationTargetException at org.apache.hadoop.hbase.client.HConnectionManager.createConnection(HConnectionManager.java:383) at org.apache.hadoop.hbase.client.HConnectionManager.createConnection(HConnectionManager.java:360) at org.apache.hadoop.hbase.client.HConnectionManager.getConnection(HConnectionManager.java:244) at org.apache.hadoop.hbase.client.HTable.init(HTable.java:187) at org.apache.hadoop.hbase.client.HTable.init(HTable.java:149) at org.apache.hadoop.hbase.mapreduce.TableInputFormat.setConf(TableInputFormat.java:99) at org.apache.pig.backend.hadoop.hbase.HBaseTableInputFormat$HBaseTableIFBuilder.build(HBaseTableInputFormat.java:78) at org.apache.pig.backend.hadoop.hbase.HBaseStorage.getInputFormat(HBaseStorage.java:669) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.createRecordReader(PigInputFormat.java:117) at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.init(MapTask.java:487) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:729) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:368) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(AccessController.java:362) at javax.security.auth.Subject.doAs(Subject.java:573) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1502) at org.apache.hadoop.mapred.Child.main(Child.java:249) Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:56) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:39) at java.lang.reflect.Constructor.newInstance(Constructor.java:527) at org.apache.hadoop.hbase.client.HConnectionManager.createConnection(HConnectionManager.java:381) ... 16 more Caused by: java.lang.NoClassDefFoundError: org.cloudera.htrace.Trace at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:196) at org.apache.hadoop.hbase.zookeeper.ZKUtil.checkExists(ZKUtil.java:479) at org.apache.hadoop.hbase.zookeeper.ZKClusterId.readClusterIdZNode(ZKClusterId.java:65) at org.apache.hadoop.hbase.client.ZooKeeperRegistry.getClusterId(ZooKeeperRegistry.java:83) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.retrieveClusterId(HConnectionManager.java:794) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.init(HConnectionManager.java:627) ... 21 more Caused by: java.lang.ClassNotFoundException: org.cloudera.htrace.Trace at java.net.URLClassLoader.findClass(URLClassLoader.java:434) at java.lang.ClassLoader.loadClassHelper(ClassLoader.java:703) at java.lang.ClassLoader.loadClass(ClassLoader.java:682) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:358) at java.lang.ClassLoader.loadClass(ClassLoader.java:665) ... 27 more
Re: Could not estimate number of reducers
I hithttps://issues.apache.org/jira/browse/PIG-3512 Le 24/03/2014 14:40, Vincent Barat a écrit : Hi, Since I moved from Pig 0.10.0 to 0.11.0 or 0.12.0, the estimation of the number of reducers no longer work. My script: A = load 'data'; B = group A by $0; store B into 'out'; My data: grunt ls hdfs://computation-master.dev.ubithere.com:9000/user/root/.staging dir hdfs://computation-master.dev.ubithere.com:9000/user/root/datar 31908911680 When I run my script (see the last line): Apache Pig version 0.12.1-SNAPSHOT (rexported) compiled Feb 06 2014, 16:57:49 Logging error messages to: /root/pig.log Default bootup file /root/.pigbootup not found Connecting to hadoop file system at: hdfs://computation-master.dev.ubithere.com:9000 Connecting to map-reduce job tracker at: computation-master.dev.ubithere.com:9001 Pig features used in the script: GROUP_BY {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, DuplicateForEachColumnRewrite, GroupByConstParallelSetter, ImplicitSplitInserter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, NewPartitionFilterOptimizer, PartitionFilterOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter], RULES_DISABLED=[FilterLogicExpressionSimplifier]} File concatenation threshold: 100 optimistic? false MR plan size before optimization: 1 MR plan size after optimization: 1 Pig script settings are added to the job mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3 creating jar file Job7470230163933306330.jar jar file Job7470230163933306330.jar created Setting up single store job Reduce phase detected, estimating # of required reducers. Using reducer estimator: org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.InputSizeReducerEstimator BytesPerReducer=10 maxReducers=999 totalInputFileSize=-1 Could not estimate number of reducers and no requested or default parallelism set. Defaulting to 1 reducer. Setting Parallelism to 1 I tried to debug; in the source code below, the PlanHelper.getPhysicalOperators always return an empty list. public int estimateNumberOfReducers(Job job, MapReduceOper mapReduceOper) throws IOException { Configuration conf = job.getConfiguration(); long bytesPerReducer = conf.getLong(BYTES_PER_REDUCER_PARAM, DEFAULT_BYTES_PER_REDUCER); int maxReducers = conf.getInt(MAX_REDUCER_COUNT_PARAM, DEFAULT_MAX_REDUCER_COUNT_PARAM); ListPOLoad poLoads = PlanHelper.getPhysicalOperators(mapReduceOper.mapPlan, POLoad.class); long totalInputFileSize = getTotalInputFileSize(conf, poLoads, job); Any idea ? Thanks for your help
Recordings from Pig user meetup at Linkedin, Mar 14
Sadly I was not able to attend the last bay area user meetup at Linkedin that was held on March 14. I'm very interested to see some of the presentations, so I'm wondering if there are plans to publish the recordings? Jarcec signature.asc Description: Digital signature
Any way to join two aliases without using CROSS
I am trying to perform the following action, but the only solution I have been able to come up with is using a CROSS, but I don't want to use that statement as it is a very expensive process. (1,2,3,4,5) (10,11) (1,2,4,5,7) (10,11) (1,5,7,8,9) (10,11) I want to make it (1,2,3,4,5,10,11) (1,2,4,5,7,10,11) (1,5,7,8,9,10,11) any help would be much appreciated, Chris
Re: Any way to join two aliases without using CROSS
I don't understand what you're trying to do from your example. If you perform a cross on the data you have, the output will be the following: (1,2,3,4,5,10,11) (1,2,3,4,5,10,11) (1,2,3,4,5,10,11) (1,2,4,5,7,10,11) (1,2,4,5,7,10,11) (1,2,4,5,7,10,11) (1,5,7,8,9,10,11) (1,5,7,8,9,10,11) (1,5,7,8,9,10,11) On this, you'll have to do a distinct to get what you're looking for. Let's change the example a little bit so we get a more clear understanding of your problem. What would be the output if your two relations looked as follows: (1,2,3,4,5) (10,11) (1,2,4,5,7) (10,12) (1,5,7,8,9) (10,13) On Tue, Mar 25, 2014 at 12:18 PM, Shahab Yunus shahab.yu...@gmail.comwrote: Have you tried iterating over the first relation and in the nested *generate* clause, always appending the second relation? Your top level looping is on first relation but in the nested block you are sort of hardcoding appending of second relation. I am referring to the examples like in Example: Nested Blocks section http://pig.apache.org/docs/r0.10.0/basic.html#foreach On Tue, Mar 25, 2014 at 3:01 PM, Christopher Surage csur...@gmail.com wrote: I am trying to perform the following action, but the only solution I have been able to come up with is using a CROSS, but I don't want to use that statement as it is a very expensive process. (1,2,3,4,5) (10,11) (1,2,4,5,7) (10,11) (1,5,7,8,9) (10,11) I want to make it (1,2,3,4,5,10,11) (1,2,4,5,7,10,11) (1,5,7,8,9,10,11) any help would be much appreciated, Chris
Re: Any way to join two aliases without using CROSS
The output I would like to see is (1,2,3,4,5,10,11) (1,2,4,5,7,10,12) (1,5,7,8,9,10,13) On Tue, Mar 25, 2014 at 3:58 PM, Pradeep Gollakota pradeep...@gmail.comwrote: I don't understand what you're trying to do from your example. If you perform a cross on the data you have, the output will be the following: (1,2,3,4,5,10,11) (1,2,3,4,5,10,11) (1,2,3,4,5,10,11) (1,2,4,5,7,10,11) (1,2,4,5,7,10,11) (1,2,4,5,7,10,11) (1,5,7,8,9,10,11) (1,5,7,8,9,10,11) (1,5,7,8,9,10,11) On this, you'll have to do a distinct to get what you're looking for. Let's change the example a little bit so we get a more clear understanding of your problem. What would be the output if your two relations looked as follows: (1,2,3,4,5) (10,11) (1,2,4,5,7) (10,12) (1,5,7,8,9) (10,13) On Tue, Mar 25, 2014 at 12:18 PM, Shahab Yunus shahab.yu...@gmail.com wrote: Have you tried iterating over the first relation and in the nested *generate* clause, always appending the second relation? Your top level looping is on first relation but in the nested block you are sort of hardcoding appending of second relation. I am referring to the examples like in Example: Nested Blocks section http://pig.apache.org/docs/r0.10.0/basic.html#foreach On Tue, Mar 25, 2014 at 3:01 PM, Christopher Surage csur...@gmail.com wrote: I am trying to perform the following action, but the only solution I have been able to come up with is using a CROSS, but I don't want to use that statement as it is a very expensive process. (1,2,3,4,5) (10,11) (1,2,4,5,7) (10,11) (1,5,7,8,9) (10,11) I want to make it (1,2,3,4,5,10,11) (1,2,4,5,7,10,11) (1,5,7,8,9,10,11) any help would be much appreciated, Chris
Re: Any way to join two aliases without using CROSS
Try this: http://pig.apache.org/docs/r0.11.0/basic.html#rank Rank each data set then join on the rank. On Tue, Mar 25, 2014 at 4:03 PM, Christopher Surage csur...@gmail.com wrote: The output I would like to see is (1,2,3,4,5,10,11) (1,2,4,5,7,10,12) (1,5,7,8,9,10,13) On Tue, Mar 25, 2014 at 3:58 PM, Pradeep Gollakota pradeep...@gmail.comwrote: I don't understand what you're trying to do from your example. If you perform a cross on the data you have, the output will be the following: (1,2,3,4,5,10,11) (1,2,3,4,5,10,11) (1,2,3,4,5,10,11) (1,2,4,5,7,10,11) (1,2,4,5,7,10,11) (1,2,4,5,7,10,11) (1,5,7,8,9,10,11) (1,5,7,8,9,10,11) (1,5,7,8,9,10,11) On this, you'll have to do a distinct to get what you're looking for. Let's change the example a little bit so we get a more clear understanding of your problem. What would be the output if your two relations looked as follows: (1,2,3,4,5) (10,11) (1,2,4,5,7) (10,12) (1,5,7,8,9) (10,13) On Tue, Mar 25, 2014 at 12:18 PM, Shahab Yunus shahab.yu...@gmail.com wrote: Have you tried iterating over the first relation and in the nested *generate* clause, always appending the second relation? Your top level looping is on first relation but in the nested block you are sort of hardcoding appending of second relation. I am referring to the examples like in Example: Nested Blocks section http://pig.apache.org/docs/r0.10.0/basic.html#foreach On Tue, Mar 25, 2014 at 3:01 PM, Christopher Surage csur...@gmail.com wrote: I am trying to perform the following action, but the only solution I have been able to come up with is using a CROSS, but I don't want to use that statement as it is a very expensive process. (1,2,3,4,5) (10,11) (1,2,4,5,7) (10,11) (1,5,7,8,9) (10,11) I want to make it (1,2,3,4,5,10,11) (1,2,4,5,7,10,11) (1,5,7,8,9,10,11) any help would be much appreciated, Chris
Re: Any way to join two aliases without using CROSS
yes On Tue, Mar 25, 2014 at 4:07 PM, Shahab Yunus shahab.yu...@gmail.comwrote: Oh, sorry. This new example is something different from what I understood before. I thought you were only trying to append one relation (with one tuple) to another (which has more than one tuple). So essentially you want to loop over 2 collection and combine their tuples. Are they always going to be same size (number of tuples)? On Tue, Mar 25, 2014 at 4:03 PM, Christopher Surage csur...@gmail.com wrote: The output I would like to see is (1,2,3,4,5,10,11) (1,2,4,5,7,10,12) (1,5,7,8,9,10,13) On Tue, Mar 25, 2014 at 3:58 PM, Pradeep Gollakota pradeep...@gmail.com wrote: I don't understand what you're trying to do from your example. If you perform a cross on the data you have, the output will be the following: (1,2,3,4,5,10,11) (1,2,3,4,5,10,11) (1,2,3,4,5,10,11) (1,2,4,5,7,10,11) (1,2,4,5,7,10,11) (1,2,4,5,7,10,11) (1,5,7,8,9,10,11) (1,5,7,8,9,10,11) (1,5,7,8,9,10,11) On this, you'll have to do a distinct to get what you're looking for. Let's change the example a little bit so we get a more clear understanding of your problem. What would be the output if your two relations looked as follows: (1,2,3,4,5) (10,11) (1,2,4,5,7) (10,12) (1,5,7,8,9) (10,13) On Tue, Mar 25, 2014 at 12:18 PM, Shahab Yunus shahab.yu...@gmail.com wrote: Have you tried iterating over the first relation and in the nested *generate* clause, always appending the second relation? Your top level looping is on first relation but in the nested block you are sort of hardcoding appending of second relation. I am referring to the examples like in Example: Nested Blocks section http://pig.apache.org/docs/r0.10.0/basic.html#foreach On Tue, Mar 25, 2014 at 3:01 PM, Christopher Surage csur...@gmail.com wrote: I am trying to perform the following action, but the only solution I have been able to come up with is using a CROSS, but I don't want to use that statement as it is a very expensive process. (1,2,3,4,5) (10,11) (1,2,4,5,7) (10,11) (1,5,7,8,9) (10,11) I want to make it (1,2,3,4,5,10,11) (1,2,4,5,7,10,11) (1,5,7,8,9,10,11) any help would be much appreciated, Chris
Re: Any way to join two aliases without using CROSS
@ pradeep, I know what the cross product will do, but I have many lines in many files. So the cross will take far too long to complete. On Tue, Mar 25, 2014 at 3:58 PM, Pradeep Gollakota pradeep...@gmail.comwrote: I don't understand what you're trying to do from your example. If you perform a cross on the data you have, the output will be the following: (1,2,3,4,5,10,11) (1,2,3,4,5,10,11) (1,2,3,4,5,10,11) (1,2,4,5,7,10,11) (1,2,4,5,7,10,11) (1,2,4,5,7,10,11) (1,5,7,8,9,10,11) (1,5,7,8,9,10,11) (1,5,7,8,9,10,11) On this, you'll have to do a distinct to get what you're looking for. Let's change the example a little bit so we get a more clear understanding of your problem. What would be the output if your two relations looked as follows: (1,2,3,4,5) (10,11) (1,2,4,5,7) (10,12) (1,5,7,8,9) (10,13) On Tue, Mar 25, 2014 at 12:18 PM, Shahab Yunus shahab.yu...@gmail.com wrote: Have you tried iterating over the first relation and in the nested *generate* clause, always appending the second relation? Your top level looping is on first relation but in the nested block you are sort of hardcoding appending of second relation. I am referring to the examples like in Example: Nested Blocks section http://pig.apache.org/docs/r0.10.0/basic.html#foreach On Tue, Mar 25, 2014 at 3:01 PM, Christopher Surage csur...@gmail.com wrote: I am trying to perform the following action, but the only solution I have been able to come up with is using a CROSS, but I don't want to use that statement as it is a very expensive process. (1,2,3,4,5) (10,11) (1,2,4,5,7) (10,11) (1,5,7,8,9) (10,11) I want to make it (1,2,3,4,5,10,11) (1,2,4,5,7,10,11) (1,5,7,8,9,10,11) any help would be much appreciated, Chris
Re: Any way to join two aliases without using CROSS
John's answer about RANK sounds like it should solve your problem On Mar 25, 2014, at 1:13 PM, Christopher Surage csur...@gmail.com wrote: @ pradeep, I know what the cross product will do, but I have many lines in many files. So the cross will take far too long to complete. On Tue, Mar 25, 2014 at 3:58 PM, Pradeep Gollakota pradeep...@gmail.comwrote: I don't understand what you're trying to do from your example. If you perform a cross on the data you have, the output will be the following: (1,2,3,4,5,10,11) (1,2,3,4,5,10,11) (1,2,3,4,5,10,11) (1,2,4,5,7,10,11) (1,2,4,5,7,10,11) (1,2,4,5,7,10,11) (1,5,7,8,9,10,11) (1,5,7,8,9,10,11) (1,5,7,8,9,10,11) On this, you'll have to do a distinct to get what you're looking for. Let's change the example a little bit so we get a more clear understanding of your problem. What would be the output if your two relations looked as follows: (1,2,3,4,5) (10,11) (1,2,4,5,7) (10,12) (1,5,7,8,9) (10,13) On Tue, Mar 25, 2014 at 12:18 PM, Shahab Yunus shahab.yu...@gmail.com wrote: Have you tried iterating over the first relation and in the nested *generate* clause, always appending the second relation? Your top level looping is on first relation but in the nested block you are sort of hardcoding appending of second relation. I am referring to the examples like in Example: Nested Blocks section http://pig.apache.org/docs/r0.10.0/basic.html#foreach On Tue, Mar 25, 2014 at 3:01 PM, Christopher Surage csur...@gmail.com wrote: I am trying to perform the following action, but the only solution I have been able to come up with is using a CROSS, but I don't want to use that statement as it is a very expensive process. (1,2,3,4,5) (10,11) (1,2,4,5,7) (10,11) (1,5,7,8,9) (10,11) I want to make it (1,2,3,4,5,10,11) (1,2,4,5,7,10,11) (1,5,7,8,9,10,11) any help would be much appreciated, Chris
RE: Any way to join two aliases without using CROSS
Here is how to use rank and join for this problem: sh cat xxx 1,2,3,4,5 1,2,4,5,7 1,5,7,8,9 sh cat yyy 10,11 10,12 10,13 a= load 'xxx' using PigStorage(','); b= load 'yyy' using PigStorage(','); a2 = rank a; b2 = rank b; c = join a1 by $0, b2 by $0; c2 = order c by $6; c3 = foreach c2 generate $1 .. $5, $7 ..; dump c3 (1,2,3,4,5,10,11) (1,2,4,5,7,10,12) (1,5,7,8,9,10,13) William F Dowling Senior Technologist Thomson Reuters -Original Message- From: Christopher Surage [mailto:csur...@gmail.com] Sent: Tuesday, March 25, 2014 4:03 PM To: user@pig.apache.org Subject: Re: Any way to join two aliases without using CROSS The output I would like to see is (1,2,3,4,5,10,11) (1,2,4,5,7,10,12) (1,5,7,8,9,10,13) On Tue, Mar 25, 2014 at 3:58 PM, Pradeep Gollakota pradeep...@gmail.comwrote: I don't understand what you're trying to do from your example. If you perform a cross on the data you have, the output will be the following: (1,2,3,4,5,10,11) (1,2,3,4,5,10,11) (1,2,3,4,5,10,11) (1,2,4,5,7,10,11) (1,2,4,5,7,10,11) (1,2,4,5,7,10,11) (1,5,7,8,9,10,11) (1,5,7,8,9,10,11) (1,5,7,8,9,10,11) On this, you'll have to do a distinct to get what you're looking for. Let's change the example a little bit so we get a more clear understanding of your problem. What would be the output if your two relations looked as follows: (1,2,3,4,5) (10,11) (1,2,4,5,7) (10,12) (1,5,7,8,9) (10,13) On Tue, Mar 25, 2014 at 12:18 PM, Shahab Yunus shahab.yu...@gmail.com wrote: Have you tried iterating over the first relation and in the nested *generate* clause, always appending the second relation? Your top level looping is on first relation but in the nested block you are sort of hardcoding appending of second relation. I am referring to the examples like in Example: Nested Blocks section http://pig.apache.org/docs/r0.10.0/basic.html#foreach On Tue, Mar 25, 2014 at 3:01 PM, Christopher Surage csur...@gmail.com wrote: I am trying to perform the following action, but the only solution I have been able to come up with is using a CROSS, but I don't want to use that statement as it is a very expensive process. (1,2,3,4,5) (10,11) (1,2,4,5,7) (10,11) (1,5,7,8,9) (10,11) I want to make it (1,2,3,4,5,10,11) (1,2,4,5,7,10,11) (1,5,7,8,9,10,11) any help would be much appreciated, Chris
Re: Any way to join two aliases without using CROSS
CROSS is by definition a very very expensive operation. Regardless, CROSS is the wrong operator for what you're trying to do. As was suggested by others, you want to RANK the relations then do a JOIN by the rank. On Tue, Mar 25, 2014 at 1:27 PM, william.dowl...@thomsonreuters.com wrote: Here is how to use rank and join for this problem: sh cat xxx 1,2,3,4,5 1,2,4,5,7 1,5,7,8,9 sh cat yyy 10,11 10,12 10,13 a= load 'xxx' using PigStorage(','); b= load 'yyy' using PigStorage(','); a2 = rank a; b2 = rank b; c = join a1 by $0, b2 by $0; c2 = order c by $6; c3 = foreach c2 generate $1 .. $5, $7 ..; dump c3 (1,2,3,4,5,10,11) (1,2,4,5,7,10,12) (1,5,7,8,9,10,13) William F Dowling Senior Technologist Thomson Reuters -Original Message- From: Christopher Surage [mailto:csur...@gmail.com] Sent: Tuesday, March 25, 2014 4:03 PM To: user@pig.apache.org Subject: Re: Any way to join two aliases without using CROSS The output I would like to see is (1,2,3,4,5,10,11) (1,2,4,5,7,10,12) (1,5,7,8,9,10,13) On Tue, Mar 25, 2014 at 3:58 PM, Pradeep Gollakota pradeep...@gmail.com wrote: I don't understand what you're trying to do from your example. If you perform a cross on the data you have, the output will be the following: (1,2,3,4,5,10,11) (1,2,3,4,5,10,11) (1,2,3,4,5,10,11) (1,2,4,5,7,10,11) (1,2,4,5,7,10,11) (1,2,4,5,7,10,11) (1,5,7,8,9,10,11) (1,5,7,8,9,10,11) (1,5,7,8,9,10,11) On this, you'll have to do a distinct to get what you're looking for. Let's change the example a little bit so we get a more clear understanding of your problem. What would be the output if your two relations looked as follows: (1,2,3,4,5) (10,11) (1,2,4,5,7) (10,12) (1,5,7,8,9) (10,13) On Tue, Mar 25, 2014 at 12:18 PM, Shahab Yunus shahab.yu...@gmail.com wrote: Have you tried iterating over the first relation and in the nested *generate* clause, always appending the second relation? Your top level looping is on first relation but in the nested block you are sort of hardcoding appending of second relation. I am referring to the examples like in Example: Nested Blocks section http://pig.apache.org/docs/r0.10.0/basic.html#foreach On Tue, Mar 25, 2014 at 3:01 PM, Christopher Surage csur...@gmail.com wrote: I am trying to perform the following action, but the only solution I have been able to come up with is using a CROSS, but I don't want to use that statement as it is a very expensive process. (1,2,3,4,5) (10,11) (1,2,4,5,7) (10,11) (1,5,7,8,9) (10,11) I want to make it (1,2,3,4,5,10,11) (1,2,4,5,7,10,11) (1,5,7,8,9,10,11) any help would be much appreciated, Chris
Re: Any way to join two aliases without using CROSS
I don't think my version of PIG supports the rank function, I keep getting Internal Error. I would update it, but I am not in control of the cluster. On Tue, Mar 25, 2014 at 4:16 PM, Andrew Musselman andrew.mussel...@gmail.com wrote: John's answer about RANK sounds like it should solve your problem On Mar 25, 2014, at 1:13 PM, Christopher Surage csur...@gmail.com wrote: @ pradeep, I know what the cross product will do, but I have many lines in many files. So the cross will take far too long to complete. On Tue, Mar 25, 2014 at 3:58 PM, Pradeep Gollakota pradeep...@gmail.com wrote: I don't understand what you're trying to do from your example. If you perform a cross on the data you have, the output will be the following: (1,2,3,4,5,10,11) (1,2,3,4,5,10,11) (1,2,3,4,5,10,11) (1,2,4,5,7,10,11) (1,2,4,5,7,10,11) (1,2,4,5,7,10,11) (1,5,7,8,9,10,11) (1,5,7,8,9,10,11) (1,5,7,8,9,10,11) On this, you'll have to do a distinct to get what you're looking for. Let's change the example a little bit so we get a more clear understanding of your problem. What would be the output if your two relations looked as follows: (1,2,3,4,5) (10,11) (1,2,4,5,7) (10,12) (1,5,7,8,9) (10,13) On Tue, Mar 25, 2014 at 12:18 PM, Shahab Yunus shahab.yu...@gmail.com wrote: Have you tried iterating over the first relation and in the nested *generate* clause, always appending the second relation? Your top level looping is on first relation but in the nested block you are sort of hardcoding appending of second relation. I am referring to the examples like in Example: Nested Blocks section http://pig.apache.org/docs/r0.10.0/basic.html#foreach On Tue, Mar 25, 2014 at 3:01 PM, Christopher Surage csur...@gmail.com wrote: I am trying to perform the following action, but the only solution I have been able to come up with is using a CROSS, but I don't want to use that statement as it is a very expensive process. (1,2,3,4,5) (10,11) (1,2,4,5,7) (10,11) (1,5,7,8,9) (10,11) I want to make it (1,2,3,4,5,10,11) (1,2,4,5,7,10,11) (1,5,7,8,9,10,11) any help would be much appreciated, Chris
Re: Any way to join two aliases without using CROSS
In that situation you could write a script that tacks on the equivalent value that rank does, and stream the ordered relations through it. I'm assuming you have a sense of order on both these relations. After that join like you would after rank. I'm not at a computer so can't type up an example. On Mar 25, 2014, at 1:57 PM, Christopher Surage csur...@gmail.com wrote: I don't think my version of PIG supports the rank function, I keep getting Internal Error. I would update it, but I am not in control of the cluster. On Tue, Mar 25, 2014 at 4:16 PM, Andrew Musselman andrew.mussel...@gmail.com wrote: John's answer about RANK sounds like it should solve your problem On Mar 25, 2014, at 1:13 PM, Christopher Surage csur...@gmail.com wrote: @ pradeep, I know what the cross product will do, but I have many lines in many files. So the cross will take far too long to complete. On Tue, Mar 25, 2014 at 3:58 PM, Pradeep Gollakota pradeep...@gmail.com wrote: I don't understand what you're trying to do from your example. If you perform a cross on the data you have, the output will be the following: (1,2,3,4,5,10,11) (1,2,3,4,5,10,11) (1,2,3,4,5,10,11) (1,2,4,5,7,10,11) (1,2,4,5,7,10,11) (1,2,4,5,7,10,11) (1,5,7,8,9,10,11) (1,5,7,8,9,10,11) (1,5,7,8,9,10,11) On this, you'll have to do a distinct to get what you're looking for. Let's change the example a little bit so we get a more clear understanding of your problem. What would be the output if your two relations looked as follows: (1,2,3,4,5) (10,11) (1,2,4,5,7) (10,12) (1,5,7,8,9) (10,13) On Tue, Mar 25, 2014 at 12:18 PM, Shahab Yunus shahab.yu...@gmail.com wrote: Have you tried iterating over the first relation and in the nested *generate* clause, always appending the second relation? Your top level looping is on first relation but in the nested block you are sort of hardcoding appending of second relation. I am referring to the examples like in Example: Nested Blocks section http://pig.apache.org/docs/r0.10.0/basic.html#foreach On Tue, Mar 25, 2014 at 3:01 PM, Christopher Surage csur...@gmail.com wrote: I am trying to perform the following action, but the only solution I have been able to come up with is using a CROSS, but I don't want to use that statement as it is a very expensive process. (1,2,3,4,5) (10,11) (1,2,4,5,7) (10,11) (1,5,7,8,9) (10,11) I want to make it (1,2,3,4,5,10,11) (1,2,4,5,7,10,11) (1,5,7,8,9,10,11) any help would be much appreciated, Chris
??????Re: Any way to join two aliases without using CROSS
Hello, There is a similar UDF in DataFu named Enumerate. http://datafu.incubator.apache.org/docs/datafu/1.2.0/datafu/pig/bags/Enumerate.html I wish it may help. James
Re: 回复:Re: Any way to join two aliases without using CROSS
Unfortunately, the Enumerate UDF from DataFu would not work in this case. The UDF works on Bags and in this case, we want to enumerate a relation. Implementing RANK is a very tricky thing to do correctly. I'm not even sure if it's doable just by using Pig operators, UDFs or macros. Best option is probably to request a Pig upgrade. On Tue, Mar 25, 2014 at 6:21 PM, James alcaid1...@gmail.com wrote: Hello, There is a similar UDF in DataFu named Enumerate. http://datafu.incubator.apache.org/docs/datafu/1.2.0/datafu/pig/bags/Enumerate.html I wish it may help. James