[jira] Created: (PIG-1660) Consider passing result of COUNT/COUNT_STAR to LIMIT
Consider passing result of COUNT/COUNT_STAR to LIMIT - Key: PIG-1660 URL: https://issues.apache.org/jira/browse/PIG-1660 Project: Pig Issue Type: Improvement Affects Versions: 0.7.0 Reporter: Viraj Bhat Fix For: 0.9.0 In realistic scenarios we need to split a dataset into segments by using LIMIT, and like to achieve that goal within the same pig script. Here is a case: {code} A = load '$DATA' using PigStorage(',') as (id, pvs); B = group A by ALL; C = foreach B generate COUNT_STAR(A) as row_cnt; -- get the low 50% segment D = order A by pvs; E = limit D (C.row_cnt * 0.2); store E in '$Eoutput'; -- get the high 20% segment F = order A by pvs DESC; G = limit F (C.row_cnt * 0.2); store G in '$Goutput'; {code} Since LIMIT only accepts constants, we have to split the operation to two steps in order to pass in the constants for the LIMIT statements. Please consider bringing this feature in so the processing can be more efficient. Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1634) Multiple names for the "group" field
Multiple names for the "group" field Key: PIG-1634 URL: https://issues.apache.org/jira/browse/PIG-1634 Project: Pig Issue Type: New Feature Affects Versions: 0.7.0, 0.6.0, 0.5.0, 0.4.0, 0.3.0, 0.2.0, 0.1.0 Reporter: Viraj Bhat I am hoping that in Pig if I type {quote} c = cogroup a by foo, b by bar", the fields c.group, c.foo and c.bar should all map to c.$0 {quote} This would improve the readability of the Pig script. Here's a real usecase: {code} --- pages = LOAD 'pages.dat' AS (url, pagerank); visits = LOAD 'user_log.dat' AS (user_id, url); page_visits = COGROUP pages BY url, visits BY url; frequent_visits = FILTER page_visits BY COUNT(visits) >= 2; answer = FOREACH frequent_visits GENERATE url, FLATTEN(pages.pagerank); --- {code} (The important part is the final GENERATE statement, which references the field "url", which was the grouping field in the earlier COGROUP.) To get it to work I have to write it in a less intuitive way. Maybe with the new parser changes in Pig 0.9 it would be easier to specify that. Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1633) Using an alias withing Nested Foreach causes indeterminate behaviour
Using an alias withing Nested Foreach causes indeterminate behaviour Key: PIG-1633 URL: https://issues.apache.org/jira/browse/PIG-1633 Project: Pig Issue Type: Bug Affects Versions: 0.7.0, 0.6.0, 0.5.0, 0.4.0 Reporter: Viraj Bhat I have created a RANDOMINT function which generates random numbers between (0 and specified value), For example RANDOMINT(4) gives random numbers between 0 and 3 (inclusive) {code} $hadoop fs -cat rand.dat f g h i j k l m {code} The pig script is as follows: {code} register math.jar; A = load 'rand.dat' using PigStorage() as (data); B = foreach A { r = math.RANDOMINT(4); generate data, r as random, ((r == 3)?1:0) as quarter; }; dump B; {code} The results are as follows: {code} {color:red} (f,0,0) (g,3,0) (h,0,0) (i,2,0) (j,3,0) (k,2,0) (l,0,1) (m,1,0) {color} {code} If you observe, (j,3,0) is created because r is used both in the foreach and generate clauses and generate different values. Modifying the above script to below solves the issue. The M/R jobs from both scripts are the same. It is just a matter of convenience. {code} A = load 'rand.dat' using PigStorage() as (data); B = foreach A generate data, math.RANDOMINT(4) as r; C = foreach B generate data, r, ((r == 3)?1:0) as quarter; dump C; {code} Is this issue related to PIG:747? Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1631) Support to 2 level nested foreach
Support to 2 level nested foreach - Key: PIG-1631 URL: https://issues.apache.org/jira/browse/PIG-1631 Project: Pig Issue Type: New Feature Affects Versions: 0.7.0 Reporter: Viraj Bhat What I would like to do is generate certain metrics for every listing impression in the context of a page like clicks on the page etc. So, I first group by to get clicks and impression together. Now, I would want to iterate through the mini-table (one per serve-id) and compute metrics. Since nested foreach within foreach is not supported I ended up writing a UDF that took both the bags and computed the metric. It would have been elegant to keep the logic of iterating over the records outside in the PIG script. Here is some pseudocode of how I would have liked to write it: {code} -- Let us say in our page context there was click on rank 2 for which there were 3 ads A1 = LOAD '...' AS (page_id, rank); -- clicks. A2 = Load '...' AS (page_id, rank); -- impressions B = COGROUP A1 by (page_id), A2 by (page_id); -- Let us say B contains the following schema -- (group, {(A1...)} {(A2...)}) -- Each record would be in B would be: -- page_id_1, {(page_id_1, 2)} {(page_id_1, 1) (page_id_1, 2) (page_id_1, 3))} C = FOREACH B GENERATE { D = FLATTEN(A1), FLATTEN(A2); -- This wont work in current pig as well. Basically, I would like a mini-table which represents an entire serve. FOREACH D GENERATE page_id_1, A2:rank, SOMEUDF(A1:rank, A2::rank); -- This UDF returns a value (like v1, v2, v3 depending on A1::rank and A2::rank) }; # output # page_id, 1, v1 # page_id, 2, v2 # page_id, 3, v3 DUMP C; {code} P.S: I understand that I could have alternatively, flattened the fields of B and then done a GROUP on page_id and then iterated through the records calling 'SOMEUDF' appropriately but that would be 2 map-reduce operations AFAIK. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1630) Support param_files to be loaded into HDFS
Support param_files to be loaded into HDFS -- Key: PIG-1630 URL: https://issues.apache.org/jira/browse/PIG-1630 Project: Pig Issue Type: New Feature Affects Versions: 0.7.0 Reporter: Viraj Bhat I want to place the parameters of a Pig script in a param_file. But instead of this file being in the local file system where I run my java command, I want this to be on HDFS. {code} $ java -cp pig.jar org.apache.pig.Main -param_file hdfs://namenode/paramfile myscript.pig {code} Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1615) Return code from Pig is 0 even if the job fails when using -M flag
[ https://issues.apache.org/jira/browse/PIG-1615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12910414#action_12910414 ] Viraj Bhat commented on PIG-1615: - I tested this on Pig 0.8, but with a downloaded version, which was little old. I re-downloaded the latest source, seems to be fixed. Viraj > Return code from Pig is 0 even if the job fails when using -M flag > -- > > Key: PIG-1615 > URL: https://issues.apache.org/jira/browse/PIG-1615 > Project: Pig > Issue Type: Bug >Affects Versions: 0.6.0, 0.7.0 >Reporter: Viraj Bhat > Fix For: 0.8.0 > > > I have a Pig script of this form, which I used inside a workflow system such > as Oozie. > {code} > A = load '$INPUT' using PigStorage(); > store A into '$OUTPUT'; > {code} > I run this as with Multi-query optimization turned off : > {quote} > $java -cp ~/pig-svn/trunk/pig.jar:$HADOOP_CONF_DIR org.apache.pig.Main -p > INPUT=/user/viraj/junk1 -M -p OUTPUT=/user/viraj/junk2 loadpigstorage.pig > {quote} > The directory "/user/viraj/junk1" is not present > I get the following results: > {quote} > Input(s): > Failed to read data from "/user/viraj/junk1" > Output(s): > Failed to produce result in "/user/viraj/junk2" > {quote} > This is expected, but the return code is still 0 > {code} > $ echo $? > 0 > {code} > If I run this script with Multi-query optimization turned on, it gives, a > return code of 2, which is correct. > {code} > $ java -cp ~/pig-svn/trunk/pig.jar:$HADOOP_CONF_DIR org.apache.pig.Main -p > INPUT=/user/viraj/junk1 -p OUTPUT=/user/viraj/junk2 loadpigstorage.pig > ... > $ echo $? > 2 > {code} > I believe a wrong return code from Pig, is causing Oozie to believe that Pig > script succeeded. > Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1615) Return code from Pig is 0 even if the job fails when using -M flag
Return code from Pig is 0 even if the job fails when using -M flag -- Key: PIG-1615 URL: https://issues.apache.org/jira/browse/PIG-1615 Project: Pig Issue Type: Bug Affects Versions: 0.7.0, 0.6.0 Reporter: Viraj Bhat Fix For: 0.8.0 I have a Pig script of this form, which I used inside a workflow system such as Oozie. {code} A = load '$INPUT' using PigStorage(); store A into '$OUTPUT'; {code} I run this as with Multi-query optimization turned off : {quote} $java -cp ~/pig-svn/trunk/pig.jar:$HADOOP_CONF_DIR org.apache.pig.Main -p INPUT=/user/viraj/junk1 -M -p OUTPUT=/user/viraj/junk2 loadpigstorage.pig {quote} The directory "/user/viraj/junk1" is not present I get the following results: {quote} Input(s): Failed to read data from "/user/viraj/junk1" Output(s): Failed to produce result in "/user/viraj/junk2" {quote} This is expected, but the return code is still 0 {code} $ echo $? 0 {code} If I run this script with Multi-query optimization turned on, it gives, a return code of 2, which is correct. {code} $ java -cp ~/pig-svn/trunk/pig.jar:$HADOOP_CONF_DIR org.apache.pig.Main -p INPUT=/user/viraj/junk1 -p OUTPUT=/user/viraj/junk2 loadpigstorage.pig ... $ echo $? 2 {code} I believe a wrong return code from Pig, is causing Oozie to believe that Pig script succeeded. Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-282) Custom Partitioner
[ https://issues.apache.org/jira/browse/PIG-282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Bhat updated PIG-282: --- Release Note: This feature allows to specify Hadoop Partitioner for the following operations: GROUP/COGROUP, CROSS, DISTINCT, JOIN (except 'skewed' join). Partitioner controls the partitioning of the keys of the intermediate map-outputs. See http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/Partitioner.html for more details. To use this feature you can add PARTITION BY clause to the appropriate operator: A = load 'input_data'; B = group A by $0 PARTITION BY org.apache.pig.test.utils.SimpleCustomPartitioner parallel 2; . Here is the code for SimpleCustomPartitioner public class SimpleCustomPartitioner extends Partitioner { //@Override public int getPartition(PigNullableWritable key, Writable value, int numPartitions) { if(key.getValueAsPigType() instanceof Integer) { int ret = (((Integer)key.getValueAsPigType()).intValue() % numPartitions); return ret; } else { return (key.hashCode()) % numPartitions; } } } was: This feature allows to specify Hadoop Partitioner for the following operations: GROUP/COGROUP, CROSS, DISTINCT, JOIN (except 'skewed' join). Partitioner controls the partitioning of the keys of the intermediate map-outputs. See http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/Partitioner.html for more details. To use this feature you can add PARTITION BY clause to the appropriate operator: A = load 'input_data'; B = group A by $0 PARTITION BY org.apache.pig.test.utils.SimpleCustomPartitioner parallel 2; . Here is the code for SimpleCustomPartitioner public class SimpleCustomPartitioner extends Partitioner { //@Override public int getPartition(PigNullableWritable key, Writable value, int numPartitions) { if(key.getValueAsPigType() instanceof Integer) { int ret = (((Integer)key.getValueAsPigType()).intValue() % numPartitions); return ret; } else { return (key.hashCode()) % numPartitions; } } } > Custom Partitioner > -- > > Key: PIG-282 > URL: https://issues.apache.org/jira/browse/PIG-282 > Project: Pig > Issue Type: New Feature >Affects Versions: 0.7.0 >Reporter: Amir Youssefi >Assignee: Aniket Mokashi >Priority: Minor > Fix For: 0.8.0 > > Attachments: CustomPartitioner.patch, CustomPartitionerFinale.patch, > CustomPartitionerTest.patch > > > By adding custom partitioner we can give control over which output partition > a key (/value) goes to. We can add keywords to language e.g. > PARTITION BY UDF(...) > or a similar syntax. UDF returns a number between 0 and n-1 where n is number > of output partitions. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1586) Parameter subsitution using -param option runs into problems when substituing entire pig statements in a shell script (maybe this is a bash problem)
[ https://issues.apache.org/jira/browse/PIG-1586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Bhat updated PIG-1586: Description: I have a Pig script as a template: {code} register Countwords.jar; A = $INPUT; B = FOREACH A GENERATE examples.udf.SubString($0,0,1), $1 as num; C = GROUP B BY $0; D = FOREACH C GENERATE group, SUM(B.num); STORE D INTO $OUTPUT; {code} I attempt to do Parameter substitutions using the following: Using Shell script: {code} #!/bin/bash java -cp ~/pig-svn/trunk/pig.jar:$HADOOP_CONF_DIR org.apache.pig.Main -r -file sub.pig \ -param INPUT="(foreach (COGROUP(load '/user/viraj/dataset1' USING PigStorage() AS (word:chararray,num:int)) by (word),(load '/user/viraj/dataset2' USING PigStorage() AS (word:chararray,num:int)) by (word)) generate flatten(examples.udf.CountWords(\\$0,\\$1,\\$2)))" \ -param OUTPUT="\'/user/viraj/output\' USING PigStorage()" {code} {code} register Countwords.jar; A = (foreach (COGROUP(load '/user/viraj/dataset1' USING PigStorage() AS (word:chararray,num:int)) by (word),(load '/user/viraj/dataset2' USING PigStorage() AS (word:chararray,num:int)) by (word)) generate flatten(examples.udf.CountWords(runsub.sh,,))); B = FOREACH A GENERATE examples.udf.SubString($0,0,1), $1 as num; C = GROUP B BY $0; D = FOREACH C GENERATE group, SUM(B.num); STORE D INTO /user/viraj/output; {code} The shell substitutes the $0 before passing it to java. a) Is there a workaround for this? b) Is this is Pig param problem? Viraj was: I have a Pig script as a template: {code} register Countwords.jar; A = $INPUT; B = FOREACH A GENERATE examples.udf.SubString($0,0,1), $1 as num; C = GROUP B BY $0; D = FOREACH C GENERATE group, SUM(B.num); STORE D INTO $OUTPUT; {code} I attempt to do Parameter substitutions using the following: Using Shell script: {code} #!/bin/bash java -cp ~/pig-svn/trunk/pig.jar:$HADOOP_CONF_DIR org.apache.pig.Main -r -file sub.pig \ -param INPUT="(foreach (COGROUP(load '/user/viraj/dataset1' USING PigStorage() AS (word:chararray,num:int)) by (word),(load '/user/viraj/dataset2' USING PigStorage() AS (word:chararray,num:int)) by (word)) generate flatten(examples.udf.CountWords(\\$0,\\$1,\\$2)))" \ -param OUTPUT="\'/user/viraj/output\' USING PigStorage()" {code} register Countwords.jar; A = (foreach (COGROUP(load '/user/viraj/dataset1' USING PigStorage() AS (word:chararray,num:int)) by (word),(load '/user/viraj/dataset2' USING PigStorage() AS (word:chararray,num:int)) by (word)) generate flatten(examples.udf.CountWords(runsub.sh,,))); B = FOREACH A GENERATE examples.udf.SubString($0,0,1), $1 as num; C = GROUP B BY $0; D = FOREACH C GENERATE group, SUM(B.num); STORE D INTO /user/viraj/output; {code} The shell substitutes the $0 before passing it to java. a) Is there a workaround for this? b) Is this is Pig param problem? Viraj > Parameter subsitution using -param option runs into problems when substituing > entire pig statements in a shell script (maybe this is a bash problem) > > > Key: PIG-1586 > URL: https://issues.apache.org/jira/browse/PIG-1586 > Project: Pig > Issue Type: Bug >Affects Versions: 0.8.0 >Reporter: Viraj Bhat > > I have a Pig script as a template: > {code} > register Countwords.jar; > A = $INPUT; > B = FOREACH A GENERATE > examples.udf.SubString($0,0,1), > $1 as num; > C = GROUP B BY $0; > D = FOREACH C GENERATE group, SUM(B.num); > STORE D INTO $OUTPUT; > {code} > I attempt to do Parameter substitutions using the following: > Using Shell script: > {code} > #!/bin/bash > java -cp ~/pig-svn/trunk/pig.jar:$HADOOP_CONF_DIR org.apache.pig.Main -r > -file sub.pig \ > -param INPUT="(foreach (COGROUP(load '/user/viraj/dataset1' > USING PigStorage() AS (word:chararray,num:int)) by (word),(load > '/user/viraj/dataset2' USING PigStorage() AS (word:chararray,num:int)) by > (word)) generate flatten(examples.udf.CountWords(\\$0,\\$1,\\$2)))" \ > -param OUTPUT="\'/user/viraj/output\' USING PigStorage()" > {code} > {code} > register Countwords.jar; > A = (foreach (COGROUP(load '/user/viraj/dataset1' USING PigStorage() AS > (word:chararray,num:int)) by (word),(load '/user/viraj/dataset2' USING > PigStorage() AS (word:chararray,num:int)) by (word)) generate > flatten(examples.udf.CountWords(runsub.sh,,))); > B = FOREACH A GENERATE > examples.udf.SubString($0,0,1), > $1 as num; > C = GROUP B BY $0; > D = FOREACH C GENERATE group, SUM(B.num); > STORE D INTO /user/viraj/output; > {code} > The shell substitutes the $0 before passing it to java. > a) Is there a workaround for this? > b) Is this is Pig param problem? > Viraj -- This message is automatically g
[jira] Created: (PIG-1586) Parameter subsitution using -param option runs into problems when substituing entire pig statements in a shell script (maybe this is a bash problem)
Parameter subsitution using -param option runs into problems when substituing entire pig statements in a shell script (maybe this is a bash problem) Key: PIG-1586 URL: https://issues.apache.org/jira/browse/PIG-1586 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Viraj Bhat I have a Pig script as a template: {code} register Countwords.jar; A = $INPUT; B = FOREACH A GENERATE examples.udf.SubString($0,0,1), $1 as num; C = GROUP B BY $0; D = FOREACH C GENERATE group, SUM(B.num); STORE D INTO $OUTPUT; {code} I attempt to do Parameter substitutions using the following: Using Shell script: {code} #!/bin/bash java -cp ~/pig-svn/trunk/pig.jar:$HADOOP_CONF_DIR org.apache.pig.Main -r -file sub.pig \ -param INPUT="(foreach (COGROUP(load '/user/viraj/dataset1' USING PigStorage() AS (word:chararray,num:int)) by (word),(load '/user/viraj/dataset2' USING PigStorage() AS (word:chararray,num:int)) by (word)) generate flatten(examples.udf.CountWords(\\$0,\\$1,\\$2)))" \ -param OUTPUT="\'/user/viraj/output\' USING PigStorage()" {code} register Countwords.jar; A = (foreach (COGROUP(load '/user/viraj/dataset1' USING PigStorage() AS (word:chararray,num:int)) by (word),(load '/user/viraj/dataset2' USING PigStorage() AS (word:chararray,num:int)) by (word)) generate flatten(examples.udf.CountWords(runsub.sh,,))); B = FOREACH A GENERATE examples.udf.SubString($0,0,1), $1 as num; C = GROUP B BY $0; D = FOREACH C GENERATE group, SUM(B.num); STORE D INTO /user/viraj/output; {code} The shell substitutes the $0 before passing it to java. a) Is there a workaround for this? b) Is this is Pig param problem? Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1576) Difference in Semantics between Load statement in Pig and HDFS client on Command line
Difference in Semantics between Load statement in Pig and HDFS client on Command line - Key: PIG-1576 URL: https://issues.apache.org/jira/browse/PIG-1576 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0, 0.6.0 Reporter: Viraj Bhat Here is my directory structure on HDFS which I want to access using Pig. This is a sample, but in real use case I have more than 100 of these directories. {code} $ hadoop fs -ls /user/viraj/recursive/ Found 3 items drwxr-xr-x - viraj supergroup 0 2010-08-26 11:25 /user/viraj/recursive/20080615 drwxr-xr-x - viraj supergroup 0 2010-08-26 11:25 /user/viraj/recursive/20080616 drwxr-xr-x - viraj supergroup 0 2010-08-26 11:25 /user/viraj/recursive/20080617 {code} Using the command line I am access them using variety of options: {code} $ hadoop fs -ls /user/viraj/recursive/{200806}{15..17}/ -rw-r--r-- 1 viraj supergroup 5791 2010-08-26 11:25 /user/viraj/recursive/20080615/kv2.txt -rw-r--r-- 1 viraj supergroup 5791 2010-08-26 11:25 /user/viraj/recursive/20080616/kv2.txt -rw-r--r-- 1 viraj supergroup 5791 2010-08-26 11:25 /user/viraj/recursive/20080617/kv2.txt $ hadoop fs -ls /user/viraj/recursive/{20080615..20080617}/ -rw-r--r-- 1 viraj supergroup 5791 2010-08-26 11:25 /user/viraj/recursive/20080615/kv2.txt -rw-r--r-- 1 viraj supergroup 5791 2010-08-26 11:25 /user/viraj/recursive/20080616/kv2.txt -rw-r--r-- 1 viraj supergroup 5791 2010-08-26 11:25 /user/viraj/recursive/20080617/kv2.txt {code} I have written a Pig script, all the below combination of load statements do not work? {code} --A = load '/user/viraj/recursive/{200806}{15..17}/' using PigStorage('\u0001') as (k:int, v:chararray); A = load '/user/viraj/recursive/{20080615..20080617}/' using PigStorage('\u0001') as (k:int, v:chararray); AL = limit A 10; dump AL; {code} I get the following error in Pig 0.8 {noformat} 2010-08-27 16:34:27,704 [main] ERROR org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed! 2010-08-27 16:34:27,711 [main] INFO org.apache.pig.tools.pigstats.PigStats - Script Statistics: HadoopVersion PigVersion UserId StartedAt FinishedAt Features 0.20.2 0.8.0-SNAPSHOT viraj 2010-08-27 16:34:24 2010-08-27 16:34:27 LIMIT Failed! Failed Jobs: JobId Alias Feature Message Outputs N/A A,ALMessage: org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to create input splits for: /user/viraj/recursive/{20080615..20080617}/ at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:279) at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:885) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:779) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:378) at org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247) at org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java:279) at java.lang.Thread.run(Thread.java:619) Caused by: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input Pattern hdfs://localhost:9000/user/viraj/recursive/{20080615..20080617} matches 0 files at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:224) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigTextInputFormat.listStatus(PigTextInputFormat.java:36) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:241) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:268) ... 7 more hdfs://localhost:9000/tmp/temp241388470/tmp987803889, {noformat} The following works: {code} A = load '/user/viraj/recursive/{200806}{15,16,17}/' using PigStorage('\u0001') as (k:int, v:chararray); AL = limit A 10; dump AL; {code} Why is there an inconsistency between HDFS client and Pig? Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1561) XMLLoader in Piggybank does not support bz2 or gzip compressed XML files
XMLLoader in Piggybank does not support bz2 or gzip compressed XML files Key: PIG-1561 URL: https://issues.apache.org/jira/browse/PIG-1561 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Reporter: Viraj Bhat I have a simple Pig script which uses the XMLLoader after the Piggybank is built. {code} register piggybank.jar; A = load '/user/viraj/capacity-scheduler.xml.gz' using org.apache.pig.piggybank.storage.XMLLoader('property') as (docs:chararray); B = limit A 1; dump B; --store B into '/user/viraj/handlegz' using PigStorage(); {code} returns empty tuple {code} () {code} If you supply the uncompressed XML file, you get {code} ( mapred.capacity-scheduler.queue.my.capacity 10 Percentage of the number of slots in the cluster that are guaranteed to be available for jobs in this queue. ) {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1547) Piggybank MultiStorage does not scale when processing around 7k records per bucket
Piggybank MultiStorage does not scale when processing around 7k records per bucket -- Key: PIG-1547 URL: https://issues.apache.org/jira/browse/PIG-1547 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Viraj Bhat I am trying to use the MultiStorage piggybank UDF {code} register pig-svn/trunk/contrib/piggybank/java/piggybank.jar; A = load '/user/viraj/largebucketinput.txt' using PigStorage('\u0001') as (a,b,c); STORE A INTO '/user/viraj/multistore' USING org.apache.pig.piggybank.storage.MultiStorage('/user/viraj/multistore', '1', 'none', '\u0001'); {code} The file "largebucketinput.txt" is around 85MB in size and for each "b" we have 512 values starting from 0-511 and each value of b or a bucket contains 7k records a) On a multi-node hadoop installation: The above Pig script which spawn a single Map only job does not succeed and is killed by the TT, for running above the memory limit. == Message == TaskTree [pid=24584,tipID=attempt_201008110143_101976_m_00_0] is running beyond memory-limits. Current usage : 1661034496bytes. Limit : 1610612736bytes. == Message == We tried increasing the Map slots but it does not succeed. b) On a single node hadoop installation: The pig script fails with the following message in the mappers: 2010-08-17 16:37:24,597 INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream java.io.EOFException 2010-08-17 16:37:24,597 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_7687609983190239805_126509 2010-08-17 16:37:30,601 INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream java.io.EOFException 2010-08-17 16:37:30,601 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_2734778934507357565_126509 2010-08-17 16:37:36,606 INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream java.io.EOFException 2010-08-17 16:37:36,606 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_-1293917224803067377_126509 2010-08-17 16:37:42,611 INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream java.io.EOFException 2010-08-17 16:37:42,611 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_-2272713260404734116_126509 2010-08-17 16:37:48,614 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception: java.io.IOException: Unable to create new block. at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2781) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2046) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2232) 2010-08-17 16:37:48,614 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_-2272713260404734116_126509 bad datanode[0] nodes == null 2010-08-17 16:37:48,614 WARN org.apache.hadoop.hdfs.DFSClient: Could not get block locations. Source file "/user/viraj/multistore/_temporary/_attempt_201005141440_0178_m_01_0/444/444-1" - Aborting... 2010-08-17 16:37:48,619 WARN org.apache.hadoop.mapred.TaskTracker: Error running child java.io.EOFException at java.io.DataInputStream.readByte(DataInputStream.java:250) at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:298) at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:319) at org.apache.hadoop.io.Text.readString(Text.java:400) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2837) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2762) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2046) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2232) 2010-08-17 16:37:48,622 INFO org.apache.hadoop.mapred.TaskRunner: Runnning cleanup for the task Need to investigate more. Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1537) Column pruner causes wrong results when using both Custom Store UDF and PigStorage
[ https://issues.apache.org/jira/browse/PIG-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12895858#action_12895858 ] Viraj Bhat commented on PIG-1537: - Hi Olga, I have given the specific script with UDF's for Daniel to test. Thanks Daniel for your help. The script which does not use Column Pruner optimization or disables it using -t gives correct results. Viraj > Column pruner causes wrong results when using both Custom Store UDF and > PigStorage > -- > > Key: PIG-1537 > URL: https://issues.apache.org/jira/browse/PIG-1537 > Project: Pig > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Viraj Bhat >Assignee: Daniel Dai > Fix For: 0.8.0 > > > I have script which is of this pattern and it uses 2 StoreFunc's: > {code} > register loader.jar > register piggy-bank/java/build/storage.jar; > %DEFAULT OUTPUTDIR /user/viraj/prunecol/ > ss_sc_0 = LOAD '/data/click/20100707/0' USING Loader() AS (a, b, c); > ss_sc_filtered_0 = FILTER ss_sc_0 BY > a#'id' matches '1.*' OR > a#'id' matches '2.*' OR > a#'id' matches '3.*' OR > a#'id' matches '4.*'; > ss_sc_1 = LOAD '/data/click/20100707/1' USING Loader() AS (a, b, c); > ss_sc_filtered_1 = FILTER ss_sc_1 BY > a#'id' matches '65.*' OR > a#'id' matches '466.*' OR > a#'id' matches '043.*' OR > a#'id' matches '044.*' OR > a#'id' matches '0650.*' OR > a#'id' matches '001.*'; > ss_sc_all = UNION ss_sc_filtered_0,ss_sc_filtered_1; > ss_sc_all_proj = FOREACH ss_sc_all GENERATE > a#'query' as query, > a#'testid' as testid, > a#'timestamp' as timestamp, > a, > b, > c; > ss_sc_all_ord = ORDER ss_sc_all_proj BY query,testid,timestamp PARALLEL 10; > ss_sc_all_map = FOREACH ss_sc_all_ord GENERATE a, b, c; > STORE ss_sc_all_map INTO '$OUTPUTDIR/data/20100707' using Storage(); > ss_sc_all_map_count = group ss_sc_all_map all; > count = FOREACH ss_sc_all_map_count GENERATE 'record_count' as > record_count,COUNT($1); > STORE count INTO '$OUTPUTDIR/count/20100707' using PigStorage('\u0009'); > {code} > I run this script using: > a) java -cp pig0.7.jar script.pig > b) java -cp pig0.7.jar -t PruneColumns script.pig > What I observe is that the alias "count" produces the same number of records > but "ss_sc_all_map" have different sizes when run with above 2 options. > Is due to the fact that there are 2 store func's used? > Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1537) Column pruner causes wrong results when using both Custom Store UDF and PigStorage
Column pruner causes wrong results when using both Custom Store UDF and PigStorage -- Key: PIG-1537 URL: https://issues.apache.org/jira/browse/PIG-1537 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Viraj Bhat I have script which is of this pattern and it uses 2 StoreFunc's: {code} register loader.jar register piggy-bank/java/build/storage.jar; %DEFAULT OUTPUTDIR /user/viraj/prunecol/ ss_sc_0 = LOAD '/data/click/20100707/0' USING Loader() AS (a, b, c); ss_sc_filtered_0 = FILTER ss_sc_0 BY a#'id' matches '1.*' OR a#'id' matches '2.*' OR a#'id' matches '3.*' OR a#'id' matches '4.*'; ss_sc_1 = LOAD '/data/click/20100707/1' USING Loader() AS (a, b, c); ss_sc_filtered_1 = FILTER ss_sc_1 BY a#'id' matches '65.*' OR a#'id' matches '466.*' OR a#'id' matches '043.*' OR a#'id' matches '044.*' OR a#'id' matches '0650.*' OR a#'id' matches '001.*'; ss_sc_all = UNION ss_sc_filtered_0,ss_sc_filtered_1; ss_sc_all_proj = FOREACH ss_sc_all GENERATE a#'query' as query, a#'testid' as testid, a#'timestamp' as timestamp, a, b, c; ss_sc_all_ord = ORDER ss_sc_all_proj BY query,testid,timestamp PARALLEL 10; ss_sc_all_map = FOREACH ss_sc_all_ord GENERATE a, b, c; STORE ss_sc_all_map INTO '$OUTPUTDIR/data/20100707' using Storage(); ss_sc_all_map_count = group ss_sc_all_map all; count = FOREACH ss_sc_all_map_count GENERATE 'record_count' as record_count,COUNT($1); STORE count INTO '$OUTPUTDIR/count/20100707' using PigStorage('\u0009'); I run this script using: a) java -cp pig0.7.jar script.pig b) java -cp pig0.7.jar -t PruneColumns script.pig What I observe is that the alias "count" produces the same number of records but "ss_sc_all_map" have different sizes when run with above 2 options. Is due to the fact that there are 2 store func's used? Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1537) Column pruner causes wrong results when using both Custom Store UDF and PigStorage
[ https://issues.apache.org/jira/browse/PIG-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Bhat updated PIG-1537: Description: I have script which is of this pattern and it uses 2 StoreFunc's: {code} register loader.jar register piggy-bank/java/build/storage.jar; %DEFAULT OUTPUTDIR /user/viraj/prunecol/ ss_sc_0 = LOAD '/data/click/20100707/0' USING Loader() AS (a, b, c); ss_sc_filtered_0 = FILTER ss_sc_0 BY a#'id' matches '1.*' OR a#'id' matches '2.*' OR a#'id' matches '3.*' OR a#'id' matches '4.*'; ss_sc_1 = LOAD '/data/click/20100707/1' USING Loader() AS (a, b, c); ss_sc_filtered_1 = FILTER ss_sc_1 BY a#'id' matches '65.*' OR a#'id' matches '466.*' OR a#'id' matches '043.*' OR a#'id' matches '044.*' OR a#'id' matches '0650.*' OR a#'id' matches '001.*'; ss_sc_all = UNION ss_sc_filtered_0,ss_sc_filtered_1; ss_sc_all_proj = FOREACH ss_sc_all GENERATE a#'query' as query, a#'testid' as testid, a#'timestamp' as timestamp, a, b, c; ss_sc_all_ord = ORDER ss_sc_all_proj BY query,testid,timestamp PARALLEL 10; ss_sc_all_map = FOREACH ss_sc_all_ord GENERATE a, b, c; STORE ss_sc_all_map INTO '$OUTPUTDIR/data/20100707' using Storage(); ss_sc_all_map_count = group ss_sc_all_map all; count = FOREACH ss_sc_all_map_count GENERATE 'record_count' as record_count,COUNT($1); STORE count INTO '$OUTPUTDIR/count/20100707' using PigStorage('\u0009'); {code} I run this script using: a) java -cp pig0.7.jar script.pig b) java -cp pig0.7.jar -t PruneColumns script.pig What I observe is that the alias "count" produces the same number of records but "ss_sc_all_map" have different sizes when run with above 2 options. Is due to the fact that there are 2 store func's used? Viraj was: I have script which is of this pattern and it uses 2 StoreFunc's: {code} register loader.jar register piggy-bank/java/build/storage.jar; %DEFAULT OUTPUTDIR /user/viraj/prunecol/ ss_sc_0 = LOAD '/data/click/20100707/0' USING Loader() AS (a, b, c); ss_sc_filtered_0 = FILTER ss_sc_0 BY a#'id' matches '1.*' OR a#'id' matches '2.*' OR a#'id' matches '3.*' OR a#'id' matches '4.*'; ss_sc_1 = LOAD '/data/click/20100707/1' USING Loader() AS (a, b, c); ss_sc_filtered_1 = FILTER ss_sc_1 BY a#'id' matches '65.*' OR a#'id' matches '466.*' OR a#'id' matches '043.*' OR a#'id' matches '044.*' OR a#'id' matches '0650.*' OR a#'id' matches '001.*'; ss_sc_all = UNION ss_sc_filtered_0,ss_sc_filtered_1; ss_sc_all_proj = FOREACH ss_sc_all GENERATE a#'query' as query, a#'testid' as testid, a#'timestamp' as timestamp, a, b, c; ss_sc_all_ord = ORDER ss_sc_all_proj BY query,testid,timestamp PARALLEL 10; ss_sc_all_map = FOREACH ss_sc_all_ord GENERATE a, b, c; STORE ss_sc_all_map INTO '$OUTPUTDIR/data/20100707' using Storage(); ss_sc_all_map_count = group ss_sc_all_map all; count = FOREACH ss_sc_all_map_count GENERATE 'record_count' as record_count,COUNT($1); STORE count INTO '$OUTPUTDIR/count/20100707' using PigStorage('\u0009'); I run this script using: a) java -cp pig0.7.jar script.pig b) java -cp pig0.7.jar -t PruneColumns script.pig What I observe is that the alias "count" produces the same number of records but "ss_sc_all_map" have different sizes when run with above 2 options. Is due to the fact that there are 2 store func's used? Viraj > Column pruner causes wrong results when using both Custom Store UDF and > PigStorage > -- > > Key: PIG-1537 > URL: https://issues.apache.org/jira/browse/PIG-1537 > Project: Pig > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Viraj Bhat > > I have script which is of this pattern and it uses 2 StoreFunc's: > {code} > register loader.jar > register piggy-bank/java/build/storage.jar; > %DEFAULT OUTPUTDIR /user/viraj/prunecol/ > ss_sc_0 = LOAD '/data/click/20100707/0' USING Loader() AS (a, b, c); > ss_sc_filtered_0 = FILTER ss_sc_0 BY > a#'id' matches '1.*' OR > a#'id' matches '2.*' OR > a#'id' matches '3.*' OR > a#'id' matches '
[jira] Created: (PIG-1529) Equating aliases does not work (B = A)
Equating aliases does not work (B = A) -- Key: PIG-1529 URL: https://issues.apache.org/jira/browse/PIG-1529 Project: Pig Issue Type: Improvement Affects Versions: 0.7.0 Reporter: Viraj Bhat I wanted to do a self-join {code} 1 one 1 uno 2 two 2 dos 3 three 3 tres {code} vi...@machine~/pigscripts >pig -x local script.pig script.pig -- since the below does not work {code} A = load 'Adataset.txt' as (key:int, value:chararray); C = join A by key, A by key; dump C; {code} -- i tried the below it fails with: {code} A = load 'Adataset.txt' as (key:int, value:chararray); B = A; C = join A by key, B by key; dump C; {code} 2010-07-30 23:19:32,789 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Currently PIG does not support assigning an existing relation (B) to another alias (A) Details at logfile: /homes/viraj/pigscripts/pig_1280531249235.log There is a workaround currently: {code} A = load 'Adataset.txt' as (key:int, value:chararray); B = foreach A generate *; C = join A by key, B by key; dump C; {code} Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1528) Enable use of similar aliases when doing a join :(ERROR 1108: Duplicate schema alias:)
Enable use of similar aliases when doing a join :(ERROR 1108: Duplicate schema alias:) -- Key: PIG-1528 URL: https://issues.apache.org/jira/browse/PIG-1528 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.7.0 Reporter: Viraj Bhat I am doing a self join: Input file is tab separated: {code} 1 one 1 uno 2 two 2 dos 3 three 3 tres {code} vi...@machine~/pigscripts >pig -x local script.pig {code} A = load 'Adataset.txt' as (key:int, value:chararray); C = join A by key, A by key; dump C; {code} 2010-07-30 23:09:05,422 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1108: Duplicate schema alias: A::key in "C" Details at logfile: /homes/viraj/pigscripts/pig_1280531249235.log -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1345) Link casting errors in POCast to actual lines numbers in Pig script
[ https://issues.apache.org/jira/browse/PIG-1345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12864963#action_12864963 ] Viraj Bhat commented on PIG-1345: - Richard thanks for suggesting a workaround. The error message is definitely more verbose than the original one. At least in one way the user can know as to where the cast is an issue in the, maybe in some addition taking place in the script. This Jira was originally created as task to correlate exactly on which line "int is implicitly cast to float", which I believe is hard to do in the current parser as we do not keep track of line number. Viraj > Link casting errors in POCast to actual lines numbers in Pig script > --- > > Key: PIG-1345 > URL: https://issues.apache.org/jira/browse/PIG-1345 > Project: Pig > Issue Type: Sub-task > Components: impl >Affects Versions: 0.6.0 >Reporter: Viraj Bhat > > For the purpose of easy debugging, I would be nice to find out where my > warnings are coming from is in the pig script. > The only known process is to comment out lines in the Pig script and see if > these warnings go away. > 2010-01-13 21:34:13,697 [main] WARN org.apache.pig.PigServer - Encountered > Warning IMPLICIT_CAST_TO_MAP 2 time(s) line 22 > 2010-01-13 21:34:13,698 [main] WARN org.apache.pig.PigServer - Encountered > Warning IMPLICIT_CAST_TO_LONG 2 time(s) line 23 > 2010-01-13 21:34:13,698 [main] WARN org.apache.pig.PigServer - Encountered > Warning IMPLICIT_CAST_TO_BAG 1 time(s). line 26 > I think this may need us to keep track of the line numbers of the Pig script > (via out javacc parser) and maintain it in the logical and physical plan. > It would help users in debugging simple errors/warning related to casting. > Is this enhancement listed in the http://wiki.apache.org/pig/PigJournal? > Do we need to change the parser to something other than javacc to make this > task simpler? > "Standardize on Parser and Scanner Technology" > Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Reopened: (PIG-1378) har url not usable in Pig scripts
[ https://issues.apache.org/jira/browse/PIG-1378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Bhat reopened PIG-1378: - Pradeep, After rerunning with patch the following revision Apache Pig version 0.8.0-dev (r940560) compiled May 03 2010, 12:22:35 {code} grunt> a = load 'har:///user/viraj/project/dev/subproject/5m/data/201003042355/0/0_1/part-0' using PigStorage('\u0001'); grunt> alimit = limit a 10; grunt> dump alimit; {code} {noformat} 2010-05-04 02:17:22,196 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2118: Unable to create input splits for: har:///user/viraj/project/dev/subproject/5m/data/201003042355/0/0_1/part-0 at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:269) at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:907) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:801) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:752) at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:378) at org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247) at org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java:279) at java.lang.Thread.run(Thread.java:619) Caused by: java.io.IOException: No FileSystem for scheme: myhdfs at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1375) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196) at org.apache.hadoop.fs.HarFileSystem.initialize(HarFileSystem.java:104) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1378) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:193) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:175) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:208) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigTextInputFormat.listStatus(PigTextInputFormat.java:36) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:246) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:258) ... 7 more {noformat} Is this a problem with Hadoop/Pig? > har url not usable in Pig scripts > - > > Key: PIG-1378 > URL: https://issues.apache.org/jira/browse/PIG-1378 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.7.0 >Reporter: Viraj Bhat >Assignee: Pradeep Kamath > Fix For: 0.8.0 > > Attachments: PIG-1378-2.patch, PIG-1378-3.patch, PIG-1378-4.patch, > PIG-1378.patch > > > I am trying to use har (Hadoop Archives) in my Pig script. > I can use them through the HDFS shell > {noformat} > $hadoop fs -ls 'har:///user/viraj/project/subproject/files/size/data' > Found 1 items > -rw--- 5 viraj users1537234 2010-04-14 09:49 > user/viraj/project/subproject/files/size/data/part-1 > {noformat} > Using similar URL's in grunt yields > {noformat} > grunt> a = load 'har:///user/viraj/project/subproject/files/size/data'; > grunt> dump a; > {noformat} > {noformat} > 2010-04-14 22:08:48,814 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR > 2998: Unhandled internal error. > org.apache.pig.impl.logicalLayer.FrontendException: ERROR 0: Incompatible > file URI scheme: har : hdfs > 2010-04-14 22:08:48,814 [main] WARN org.apache.pig.tools.grunt.Grunt - There > is no log file to write to. > 2010-04-14 22:08:48,814 [main] ERROR org.apache.pig.tools.grunt.Grunt - > java.lang.Error: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 0: > Incompatible file URI scheme: har : hdfs > at > org.apache.pig.impl.logicalLayer.parser.QueryParser.LoadClause(QueryParser.java:1483) > at > org.apache.pig.impl.logicalLayer.parser.QueryParser.BaseExpr(QueryParser.java:1245) > at > org.apache.pig.impl.logicalLayer.parser.QueryParser.Expr(QueryParser.java:911) > at > org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:700) > at > org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63) > at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1164) > at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1114) > at org.apache.pig.PigServer.registerQuery(PigServer.java:425) > at > org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:737) > at > org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptPars
[jira] Commented: (PIG-798) Schema errors when using PigStorage and none when using BinStorage in FOREACH??
[ https://issues.apache.org/jira/browse/PIG-798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12861134#action_12861134 ] Viraj Bhat commented on PIG-798: Ashutosh thanks for clarifying, we will wait till that bug is fixed in BinStorage Viraj > Schema errors when using PigStorage and none when using BinStorage in > FOREACH?? > --- > > Key: PIG-798 > URL: https://issues.apache.org/jira/browse/PIG-798 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.2.0, 0.3.0, 0.4.0, 0.5.0, 0.6.0, 0.7.0, 0.8.0 >Reporter: Viraj Bhat > Attachments: binstoragecreateop, schemaerr.pig, visits.txt > > > In the following script I have a tab separated text file, which I load using > PigStorage() and store using BinStorage() > {code} > A = load '/user/viraj/visits.txt' using PigStorage() as (name:chararray, > url:chararray, time:chararray); > B = group A by name; > store B into '/user/viraj/binstoragecreateop' using BinStorage(); > dump B; > {code} > I later load file 'binstoragecreateop' in the following way. > {code} > A = load '/user/viraj/binstoragecreateop' using BinStorage(); > B = foreach A generate $0 as name:chararray; > dump B; > {code} > Result > === > (Amy) > (Fred) > === > The above code work properly and returns the right results. If I use > PigStorage() to achieve the same, I get the following error. > {code} > A = load '/user/viraj/visits.txt' using PigStorage(); > B = foreach A generate $0 as name:chararray; > dump B; > {code} > === > {code} > 2009-05-02 03:58:50,662 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR > 1022: Type mismatch merging schema prefix. Field Schema: bytearray. Other > Field Schema: name: chararray > Details at logfile: /home/viraj/pig-svn/trunk/pig_1241236728311.log > {code} > === > So why should the semantics of BinStorage() be different from PigStorage() > where is ok not to specify a schema??? Should it not be consistent across > both. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1211) Pig script runs half way after which it reports syntax error
[ https://issues.apache.org/jira/browse/PIG-1211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12861106#action_12861106 ] Viraj Bhat commented on PIG-1211: - Ashutosh, yes as more and more people adopt Pig, they expect some type of guarantees, since Pig is designed to help people with no experience in writing M/R programs. If I am a novice user I have a small typo, do I wait for 3-4 hours to discover that there is a syntax error? I have not only wasted the CPU cycles but also the users productivity. The problem here is that dump and hadoop shell commands are treated differently in Pig scripts and Multi-query optimizations are ignored. I have listed what Milind and Dmitry is suggesting. Maybe this is the way future Pig Language will compile to give you a hadoop jar file in sequence or as a DAG. Pigcc -L myScript.pig -> parses pig script, generates logical plan, and stores it in myScript.pig.l Pigcc -P myScript.pig.l -> produces physical plan from the logical plan, and stores it in myScript.pig.p Pigcc -M myScript.pig.p -> produces map-reduce plan, myScript.pig.m Pig myScript.pig.m -> interprets the MR plan. This can be split into multiple sequential MR jobs plans too, myScript.pig.m.{1,2,3..}, so that a way to execute the pig script is to run Hadoop jar pigRT.jar myScript.pig.m.1 Hadoop jar pigRT.jar myScript.pig.m.2 Hadoop jar pigRT.jar myScript.pig.m.3 Hadoop jar pigRT.jar myScript.pig.m.4 Thanks Viraj > Pig script runs half way after which it reports syntax error > > > Key: PIG-1211 > URL: https://issues.apache.org/jira/browse/PIG-1211 > Project: Pig > Issue Type: Improvement > Components: impl >Affects Versions: 0.6.0 >Reporter: Viraj Bhat > Fix For: 0.8.0 > > > I have a Pig script which is structured in the following way > {code} > register cp.jar > dataset = load '/data/dataset/' using PigStorage('\u0001') as (col1, col2, > col3, col4, col5); > filtered_dataset = filter dataset by (col1 == 1); > proj_filtered_dataset = foreach filtered_dataset generate col2, col3; > rmf $output1; > store proj_filtered_dataset into '$output1' using PigStorage(); > second_stream = foreach filtered_dataset generate col2, col4, col5; > group_second_stream = group second_stream by col4; > output2 = foreach group_second_stream { > a = second_stream.col2 > b = distinct second_stream.col5; > c = order b by $0; > generate 1 as key, group as keyword, MYUDF(c, 100) as finalcalc; > } > rmf $output2; > --syntax error here > store output2 to '$output2' using PigStorage(); > {code} > I run this script using the Multi-query option, it runs successfully till the > first store but later fails with a syntax error. > The usage of HDFS option, "rmf" causes the first store to execute. > The only option the I have is to run an explain before running his script > grunt> explain -script myscript.pig -out explain.out > or moving the rmf statements to the top of the script > Here are some questions: > a) Can we have an option to do something like "checkscript" instead of > explain to get the same syntax error? In this way I can ensure that I do not > run for 3-4 hours before encountering a syntax error > b) Can pig not figure out a way to re-order the rmf statements since all the > store directories are variables > Thanks > Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-798) Schema errors when using PigStorage and none when using BinStorage in FOREACH??
[ https://issues.apache.org/jira/browse/PIG-798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12861097#action_12861097 ] Viraj Bhat commented on PIG-798: Hi Ashutosh, Yes that is possible, I know that we can do that in PigStorage() but why can we not do this in PigStorage? What do I need to cast as (chararray) ? {code} A = load 'somedata' using PigStorage(); B = foreach A generate $0 as name:chararray; dump B; {code} But this is possible in BinStorage(), why is this not consistent? Is it that BinStorage() has schemas embedded while PigStorage() does not? Should this not be fixed to make it consistent across storage formats? Viraj > Schema errors when using PigStorage and none when using BinStorage in > FOREACH?? > --- > > Key: PIG-798 > URL: https://issues.apache.org/jira/browse/PIG-798 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.2.0, 0.3.0, 0.4.0, 0.5.0, 0.6.0, 0.7.0, 0.8.0 >Reporter: Viraj Bhat > Attachments: binstoragecreateop, schemaerr.pig, visits.txt > > > In the following script I have a tab separated text file, which I load using > PigStorage() and store using BinStorage() > {code} > A = load '/user/viraj/visits.txt' using PigStorage() as (name:chararray, > url:chararray, time:chararray); > B = group A by name; > store B into '/user/viraj/binstoragecreateop' using BinStorage(); > dump B; > {code} > I later load file 'binstoragecreateop' in the following way. > {code} > A = load '/user/viraj/binstoragecreateop' using BinStorage(); > B = foreach A generate $0 as name:chararray; > dump B; > {code} > Result > === > (Amy) > (Fred) > === > The above code work properly and returns the right results. If I use > PigStorage() to achieve the same, I get the following error. > {code} > A = load '/user/viraj/visits.txt' using PigStorage(); > B = foreach A generate $0 as name:chararray; > dump B; > {code} > === > {code} > 2009-05-02 03:58:50,662 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR > 1022: Type mismatch merging schema prefix. Field Schema: bytearray. Other > Field Schema: name: chararray > Details at logfile: /home/viraj/pig-svn/trunk/pig_1241236728311.log > {code} > === > So why should the semantics of BinStorage() be different from PigStorage() > where is ok not to specify a schema??? Should it not be consistent across > both. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-798) Schema errors when using PigStorage and none when using BinStorage in FOREACH??
[ https://issues.apache.org/jira/browse/PIG-798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Bhat updated PIG-798: --- Affects Version/s: 0.6.0 0.5.0 0.4.0 0.3.0 0.7.0 0.8.0 > Schema errors when using PigStorage and none when using BinStorage in > FOREACH?? > --- > > Key: PIG-798 > URL: https://issues.apache.org/jira/browse/PIG-798 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.2.0, 0.3.0, 0.4.0, 0.5.0, 0.6.0, 0.7.0, 0.8.0 >Reporter: Viraj Bhat > Attachments: binstoragecreateop, schemaerr.pig, visits.txt > > > In the following script I have a tab separated text file, which I load using > PigStorage() and store using BinStorage() > {code} > A = load '/user/viraj/visits.txt' using PigStorage() as (name:chararray, > url:chararray, time:chararray); > B = group A by name; > store B into '/user/viraj/binstoragecreateop' using BinStorage(); > dump B; > {code} > I later load file 'binstoragecreateop' in the following way. > {code} > A = load '/user/viraj/binstoragecreateop' using BinStorage(); > B = foreach A generate $0 as name:chararray; > dump B; > {code} > Result > === > (Amy) > (Fred) > === > The above code work properly and returns the right results. If I use > PigStorage() to achieve the same, I get the following error. > {code} > A = load '/user/viraj/visits.txt' using PigStorage(); > B = foreach A generate $0 as name:chararray; > dump B; > {code} > === > {code} > 2009-05-02 03:58:50,662 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR > 1022: Type mismatch merging schema prefix. Field Schema: bytearray. Other > Field Schema: name: chararray > Details at logfile: /home/viraj/pig-svn/trunk/pig_1241236728311.log > {code} > === > So why should the semantics of BinStorage() be different from PigStorage() > where is ok not to specify a schema??? Should it not be consistent across > both. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-798) Schema errors when using PigStorage and none when using BinStorage in FOREACH??
[ https://issues.apache.org/jira/browse/PIG-798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12860452#action_12860452 ] Viraj Bhat commented on PIG-798: Hi Ashutosh, The problem here is not about using the data interchangeably between BinStorage() and PigStorage(), it is about the consistency issues in schema. Sorry if the description was unclear. I can see that it is possible to write statements such as this using BinStorage() {code} A = load 'somedata' using BinStorage(); B = foreach A generate $0 as name:chararray; dump B; {code} and not write it using PigStorage(). Should we not support the following statement, as a user I am interested in projecting the first column and casting it to a chararray. I am not interested in knowing what the schemas are of other columns!! Fails when I do the following: {code} A = load 'somedata' using PigStorage(); B = foreach A generate $0 as name:chararray; dump B; {code} Can you tell me why the schema specification in FOREACH GENERATE works with BinStorage and not in PigStorage? Viraj > Schema errors when using PigStorage and none when using BinStorage in > FOREACH?? > --- > > Key: PIG-798 > URL: https://issues.apache.org/jira/browse/PIG-798 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.2.0 >Reporter: Viraj Bhat > Attachments: binstoragecreateop, schemaerr.pig, visits.txt > > > In the following script I have a tab separated text file, which I load using > PigStorage() and store using BinStorage() > {code} > A = load '/user/viraj/visits.txt' using PigStorage() as (name:chararray, > url:chararray, time:chararray); > B = group A by name; > store B into '/user/viraj/binstoragecreateop' using BinStorage(); > dump B; > {code} > I later load file 'binstoragecreateop' in the following way. > {code} > A = load '/user/viraj/binstoragecreateop' using BinStorage(); > B = foreach A generate $0 as name:chararray; > dump B; > {code} > Result > === > (Amy) > (Fred) > === > The above code work properly and returns the right results. If I use > PigStorage() to achieve the same, I get the following error. > {code} > A = load '/user/viraj/visits.txt' using PigStorage(); > B = foreach A generate $0 as name:chararray; > dump B; > {code} > === > {code} > 2009-05-02 03:58:50,662 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR > 1022: Type mismatch merging schema prefix. Field Schema: bytearray. Other > Field Schema: name: chararray > Details at logfile: /home/viraj/pig-svn/trunk/pig_1241236728311.log > {code} > === > So why should the semantics of BinStorage() be different from PigStorage() > where is ok not to specify a schema??? Should it not be consistent across > both. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1339) International characters in column names not supported
[ https://issues.apache.org/jira/browse/PIG-1339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Bhat updated PIG-1339: Affects Version/s: 0.7.0 0.8.0 > International characters in column names not supported > -- > > Key: PIG-1339 > URL: https://issues.apache.org/jira/browse/PIG-1339 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.6.0, 0.7.0, 0.8.0 >Reporter: Viraj Bhat > > There is a particular use-case in which someone specifies a column name to be > in International characters. > {code} > inputdata = load '/user/viraj/inputdata.txt' using PigStorage() as (あいうえお); > describe inputdata; > dump inputdata; > {code} > == > Pig Stack Trace > --- > ERROR 1000: Error during parsing. Lexical error at line 1, column 64. > Encountered: "\u3042" (12354), after : "" > org.apache.pig.impl.logicalLayer.parser.TokenMgrError: Lexical error at line > 1, column 64. Encountered: "\u3042" (12354), after : "" > at > org.apache.pig.impl.logicalLayer.parser.QueryParserTokenManager.getNextToken(QueryParserTokenManager.java:1791) > at > org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_scan_token(QueryParser.java:8959) > at > org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_51(QueryParser.java:7462) > at > org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_120(QueryParser.java:7769) > at > org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_106(QueryParser.java:7787) > at > org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_63(QueryParser.java:8609) > at > org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_32(QueryParser.java:8621) > at > org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3_4(QueryParser.java:8354) > at > org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_2_4(QueryParser.java:6903) > at > org.apache.pig.impl.logicalLayer.parser.QueryParser.BaseExpr(QueryParser.java:1249) > at > org.apache.pig.impl.logicalLayer.parser.QueryParser.Expr(QueryParser.java:911) > at > org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:700) > at > org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63) > at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1164) > at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1114) > at org.apache.pig.PigServer.registerQuery(PigServer.java:425) > at > org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:737) > at > org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:324) > at > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162) > at > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138) > at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89) > at org.apache.pig.Main.main(Main.java:391) > == > Thanks Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1339) International characters in column names not supported
[ https://issues.apache.org/jira/browse/PIG-1339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12860445#action_12860445 ] Viraj Bhat commented on PIG-1339: - Hi Ashutosh this does not work in trunk. I am using the latest build: {code} $java -cp ~/pig-svn/trunk/pig.jar org.apache.pig.Main -version Apache Pig version 0.8.0-dev (r937554) compiled Apr 23 2010, 16:57:32 {code} 2010-04-23 17:31:41,448 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Lexical error at line 1, column 71. Encountered: "\u3042" (12354), after : "" This is a valid bug. Viraj > International characters in column names not supported > -- > > Key: PIG-1339 > URL: https://issues.apache.org/jira/browse/PIG-1339 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.6.0, 0.7.0, 0.8.0 >Reporter: Viraj Bhat > > There is a particular use-case in which someone specifies a column name to be > in International characters. > {code} > inputdata = load '/user/viraj/inputdata.txt' using PigStorage() as (あいうえお); > describe inputdata; > dump inputdata; > {code} > == > Pig Stack Trace > --- > ERROR 1000: Error during parsing. Lexical error at line 1, column 64. > Encountered: "\u3042" (12354), after : "" > org.apache.pig.impl.logicalLayer.parser.TokenMgrError: Lexical error at line > 1, column 64. Encountered: "\u3042" (12354), after : "" > at > org.apache.pig.impl.logicalLayer.parser.QueryParserTokenManager.getNextToken(QueryParserTokenManager.java:1791) > at > org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_scan_token(QueryParser.java:8959) > at > org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_51(QueryParser.java:7462) > at > org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_120(QueryParser.java:7769) > at > org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_106(QueryParser.java:7787) > at > org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_63(QueryParser.java:8609) > at > org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_32(QueryParser.java:8621) > at > org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3_4(QueryParser.java:8354) > at > org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_2_4(QueryParser.java:6903) > at > org.apache.pig.impl.logicalLayer.parser.QueryParser.BaseExpr(QueryParser.java:1249) > at > org.apache.pig.impl.logicalLayer.parser.QueryParser.Expr(QueryParser.java:911) > at > org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:700) > at > org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63) > at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1164) > at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1114) > at org.apache.pig.PigServer.registerQuery(PigServer.java:425) > at > org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:737) > at > org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:324) > at > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162) > at > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138) > at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89) > at org.apache.pig.Main.main(Main.java:391) > == > Thanks Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1211) Pig script runs half way after which it reports syntax error
[ https://issues.apache.org/jira/browse/PIG-1211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12860419#action_12860419 ] Viraj Bhat commented on PIG-1211: - Ashutosh, I feel that the user may not be interested in running his script first using explain finding his syntax error and then again running it again to get his results. They expect Pig to tell them all the errors upfront before submitting a M/R job. Explain was not designed for checking syntax error in scripts. I believe that if you have a dump statement, explain -script will cause the script to run. Is it not possible for Pig to find out that there is an error with "store" syntax? Viraj > Pig script runs half way after which it reports syntax error > > > Key: PIG-1211 > URL: https://issues.apache.org/jira/browse/PIG-1211 > Project: Pig > Issue Type: Improvement > Components: impl >Affects Versions: 0.6.0 >Reporter: Viraj Bhat > Fix For: 0.8.0 > > > I have a Pig script which is structured in the following way > {code} > register cp.jar > dataset = load '/data/dataset/' using PigStorage('\u0001') as (col1, col2, > col3, col4, col5); > filtered_dataset = filter dataset by (col1 == 1); > proj_filtered_dataset = foreach filtered_dataset generate col2, col3; > rmf $output1; > store proj_filtered_dataset into '$output1' using PigStorage(); > second_stream = foreach filtered_dataset generate col2, col4, col5; > group_second_stream = group second_stream by col4; > output2 = foreach group_second_stream { > a = second_stream.col2 > b = distinct second_stream.col5; > c = order b by $0; > generate 1 as key, group as keyword, MYUDF(c, 100) as finalcalc; > } > rmf $output2; > --syntax error here > store output2 to '$output2' using PigStorage(); > {code} > I run this script using the Multi-query option, it runs successfully till the > first store but later fails with a syntax error. > The usage of HDFS option, "rmf" causes the first store to execute. > The only option the I have is to run an explain before running his script > grunt> explain -script myscript.pig -out explain.out > or moving the rmf statements to the top of the script > Here are some questions: > a) Can we have an option to do something like "checkscript" instead of > explain to get the same syntax error? In this way I can ensure that I do not > run for 3-4 hours before encountering a syntax error > b) Can pig not figure out a way to re-order the rmf statements since all the > store directories are variables > Thanks > Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1345) Link casting errors in POCast to actual lines numbers in Pig script
[ https://issues.apache.org/jira/browse/PIG-1345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12860397#action_12860397 ] Viraj Bhat commented on PIG-1345: - Which release will PIG:908 be fixed? Does it guarantee that if we fix PIG:908, then this issue will be solved? > Link casting errors in POCast to actual lines numbers in Pig script > --- > > Key: PIG-1345 > URL: https://issues.apache.org/jira/browse/PIG-1345 > Project: Pig > Issue Type: Sub-task > Components: impl >Affects Versions: 0.6.0 >Reporter: Viraj Bhat > > For the purpose of easy debugging, I would be nice to find out where my > warnings are coming from is in the pig script. > The only known process is to comment out lines in the Pig script and see if > these warnings go away. > 2010-01-13 21:34:13,697 [main] WARN org.apache.pig.PigServer - Encountered > Warning IMPLICIT_CAST_TO_MAP 2 time(s) line 22 > 2010-01-13 21:34:13,698 [main] WARN org.apache.pig.PigServer - Encountered > Warning IMPLICIT_CAST_TO_LONG 2 time(s) line 23 > 2010-01-13 21:34:13,698 [main] WARN org.apache.pig.PigServer - Encountered > Warning IMPLICIT_CAST_TO_BAG 1 time(s). line 26 > I think this may need us to keep track of the line numbers of the Pig script > (via out javacc parser) and maintain it in the logical and physical plan. > It would help users in debugging simple errors/warning related to casting. > Is this enhancement listed in the http://wiki.apache.org/pig/PigJournal? > Do we need to change the parser to something other than javacc to make this > task simpler? > "Standardize on Parser and Scanner Technology" > Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1378) har url not usable in Pig scripts
[ https://issues.apache.org/jira/browse/PIG-1378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12859384#action_12859384 ] Viraj Bhat commented on PIG-1378: - har:// currently works in Pig 0.7 when the hdfs location is specified. > har url not usable in Pig scripts > - > > Key: PIG-1378 > URL: https://issues.apache.org/jira/browse/PIG-1378 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.7.0 >Reporter: Viraj Bhat > Fix For: 0.8.0 > > > I am trying to use har (Hadoop Archives) in my Pig script. > I can use them through the HDFS shell > {noformat} > $hadoop fs -ls 'har:///user/viraj/project/subproject/files/size/data' > Found 1 items > -rw--- 5 viraj users1537234 2010-04-14 09:49 > user/viraj/project/subproject/files/size/data/part-1 > {noformat} > Using similar URL's in grunt yields > {noformat} > grunt> a = load 'har:///user/viraj/project/subproject/files/size/data'; > grunt> dump a; > {noformat} > {noformat} > 2010-04-14 22:08:48,814 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR > 2998: Unhandled internal error. > org.apache.pig.impl.logicalLayer.FrontendException: ERROR 0: Incompatible > file URI scheme: har : hdfs > 2010-04-14 22:08:48,814 [main] WARN org.apache.pig.tools.grunt.Grunt - There > is no log file to write to. > 2010-04-14 22:08:48,814 [main] ERROR org.apache.pig.tools.grunt.Grunt - > java.lang.Error: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 0: > Incompatible file URI scheme: har : hdfs > at > org.apache.pig.impl.logicalLayer.parser.QueryParser.LoadClause(QueryParser.java:1483) > at > org.apache.pig.impl.logicalLayer.parser.QueryParser.BaseExpr(QueryParser.java:1245) > at > org.apache.pig.impl.logicalLayer.parser.QueryParser.Expr(QueryParser.java:911) > at > org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:700) > at > org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63) > at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1164) > at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1114) > at org.apache.pig.PigServer.registerQuery(PigServer.java:425) > at > org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:737) > at > org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:324) > at > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162) > at > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138) > at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75) > at org.apache.pig.Main.main(Main.java:357) > Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 0: > Incompatible file URI scheme: har : hdfs > at org.apache.pig.LoadFunc.getAbsolutePath(LoadFunc.java:249) > at org.apache.pig.LoadFunc.relativeToAbsolutePath(LoadFunc.java:62) > at > org.apache.pig.impl.logicalLayer.parser.QueryParser.LoadClause(QueryParser.java:1472) > ... 13 more > {noformat} > According to Jira http://issues.apache.org/jira/browse/PIG-1234 I try the > following as stated in the original description > {noformat} > grunt> a = load > 'har://namenode-location/user/viraj/project/subproject/files/size/data'; > grunt> dump a; > {noformat} > {noformat} > Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2118: > Unable to create input splits for: > har://namenode-location/user/viraj/project/subproject/files/size/data'; > ... 8 more > Caused by: java.io.IOException: No FileSystem for scheme: namenode-location > at .apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1375) > at .apache.hadoop.fs.FileSystem.access(200(FileSystem.java:66) > at .apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390) > at .apache.hadoop.fs.FileSystem.get(FileSystem.java:196) > at .apache.hadoop.fs.HarFileSystem.initialize(HarFileSystem.java:104) > at .apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1378) > at .apache.hadoop.fs.FileSystem.get(FileSystem.java:193) > at .apache.hadoop.fs.Path.getFileSystem(Path.java:175) > at > .apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:208) > at > .apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigTextInputFormat.listStatus(PigTextInputFormat.java:36) > at > .apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:246) > at > .apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:245) > {noformat} > Viraj -- This message is automaticall
[jira] Resolved: (PIG-829) DECLARE statement stop processing after special characters such as dot "." , "+" "%" etc..
[ https://issues.apache.org/jira/browse/PIG-829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Bhat resolved PIG-829. Fix Version/s: 0.7.0 Resolution: Fixed Pig 0.7 yields the correct result. {code} x = LOAD 'something' as (a:chararray, b:chararray); y = FILTER x BY ( a MATCHES '^.*yahoo.*$' ); STORE y INTO 'foo.bar'; {code} > DECLARE statement stop processing after special characters such as dot "." , > "+" "%" etc.. > -- > > Key: PIG-829 > URL: https://issues.apache.org/jira/browse/PIG-829 > Project: Pig > Issue Type: Bug > Components: grunt >Affects Versions: 0.3.0 >Reporter: Viraj Bhat > Fix For: 0.7.0 > > > The below Pig script does not work well, when special characters are used in > the DECLARE statement. > {code} > %DECLARE OUT foo.bar > x = LOAD 'something' as (a:chararray, b:chararray); > y = FILTER x BY ( a MATCHES '^.*yahoo.*$' ); > STORE y INTO '$OUT'; > {code} > When the above script is run in the dry run mode; the substituted file does > not contain the special character. > {code} > java -cp pig.jar:/homes/viraj/hadoop-0.18.0-dev/conf -Dhod.server='' > org.apache.pig.Main -r declaresp.pig > {code} > Resulting file: "declaresp.pig.substituted" > {code} > x = LOAD 'something' as (a:chararray, b:chararray); > y = FILTER x BY ( a MATCHES '^.*yahoo.*$' ); > STORE y INTO 'foo'; > {code} -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Resolved: (PIG-518) LOBinCond exception in LogicalPlanValidationExecutor when providing default values for bag
[ https://issues.apache.org/jira/browse/PIG-518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Bhat resolved PIG-518. Fix Version/s: 0.7.0 Resolution: Fixed > LOBinCond exception in LogicalPlanValidationExecutor when providing default > values for bag > --- > > Key: PIG-518 > URL: https://issues.apache.org/jira/browse/PIG-518 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.2.0 >Reporter: Viraj Bhat > Fix For: 0.7.0 > > Attachments: queries.txt, sports_views.txt > > > The following piece of Pig script, which provides default values for bags > {('','')} when the COUNT returns 0 fails with the following error. (Note: > Files used in this script are enclosed on this Jira.) > > a = load 'sports_views.txt' as (col1, col2, col3); > b = load 'queries.txt' as (colb1,colb2,colb3); > mycogroup = cogroup a by col1 inner, b by colb1; > mynewalias = foreach mycogroup generate flatten(a), flatten((COUNT(b) > 0L ? > b.(colb2,colb3) : {('','')})); > dump mynewalias; > > java.io.IOException: Unable to open iterator for alias: mynewalias [Unable to > store for alias: mynewalias [Can't overwrite cause]] > at java.lang.Throwable.initCause(Throwable.java:320) > at > org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.visit(TypeCheckingVisitor.java:1494) > at org.apache.pig.impl.logicalLayer.LOBinCond.visit(LOBinCond.java:85) > at org.apache.pig.impl.logicalLayer.LOBinCond.visit(LOBinCond.java:28) > at > org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:68) > at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51) > at > org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.checkInnerPlan(TypeCheckingVisitor.java:2345) > at > org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.visit(TypeCheckingVisitor.java:2252) > at org.apache.pig.impl.logicalLayer.LOForEach.visit(LOForEach.java:121) > at org.apache.pig.impl.logicalLayer.LOForEach.visit(LOForEach.java:40) > at > org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:68) > at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51) > at > org.apache.pig.impl.plan.PlanValidator.validateSkipCollectException(PlanValidator.java:101) > at > org.apache.pig.impl.logicalLayer.validators.TypeCheckingValidator.validate(TypeCheckingValidator.java:40) > at > org.apache.pig.impl.logicalLayer.validators.TypeCheckingValidator.validate(TypeCheckingValidator.java:30) > at > org.apache.pig.impl.logicalLayer.validators.LogicalPlanValidationExecutor.validate(LogicalPlanValidationExecutor.java: > 79) > at org.apache.pig.PigServer.compileLp(PigServer.java:684) > at org.apache.pig.PigServer.compileLp(PigServer.java:655) > at org.apache.pig.PigServer.store(PigServer.java:433) > at org.apache.pig.PigServer.store(PigServer.java:421) > at org.apache.pig.PigServer.openIterator(PigServer.java:384) > at > org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:269) > at > org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:178) > at > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:84) > at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:64) > at org.apache.pig.Main.main(Main.java:306) > Caused by: java.io.IOException: Unable to store for alias: mynewalias [Can't > overwrite cause] > ... 26 more > Caused by: java.lang.IllegalStateException: Can't overwrite cause > ... 26 more > -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (PIG-518) LOBinCond exception in LogicalPlanValidationExecutor when providing default values for bag
[ https://issues.apache.org/jira/browse/PIG-518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12857157#action_12857157 ] Viraj Bhat commented on PIG-518: The above script generates the following error in Pig 0.7 2010-04-14 17:10:49,807 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1048: Two inputs of BinCond must have compatible schemas. left hand side: b: bag({colb2: bytearray,colb3: bytearray}) right hand side: bag({(chararray,chararray)}) A type cast to the right type solves the problem. {code} a = load 'sports_views.txt' as (col1:chararray, col2:chararray, col3:chararray); b = load 'queries.txt' as (colb1:chararray,colb2:chararray,colb3:chararray); mycogroup = cogroup a by col1 inner, b by colb1; mynewalias = foreach mycogroup generate flatten(a), flatten((COUNT(b) > 0L ? b.(colb2,colb3) : {('','')})); dump mynewalias; {code} (alice,lakers,3,ipod,3) (alice,warriors,7,ipod,3) (peter,sun,7,sun,4) (peter,nets,7,sun,4) Closing bug as Pig yields the correct error message which the user can use to recode his script > LOBinCond exception in LogicalPlanValidationExecutor when providing default > values for bag > --- > > Key: PIG-518 > URL: https://issues.apache.org/jira/browse/PIG-518 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.2.0 >Reporter: Viraj Bhat > Attachments: queries.txt, sports_views.txt > > > The following piece of Pig script, which provides default values for bags > {('','')} when the COUNT returns 0 fails with the following error. (Note: > Files used in this script are enclosed on this Jira.) > > a = load 'sports_views.txt' as (col1, col2, col3); > b = load 'queries.txt' as (colb1,colb2,colb3); > mycogroup = cogroup a by col1 inner, b by colb1; > mynewalias = foreach mycogroup generate flatten(a), flatten((COUNT(b) > 0L ? > b.(colb2,colb3) : {('','')})); > dump mynewalias; > > java.io.IOException: Unable to open iterator for alias: mynewalias [Unable to > store for alias: mynewalias [Can't overwrite cause]] > at java.lang.Throwable.initCause(Throwable.java:320) > at > org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.visit(TypeCheckingVisitor.java:1494) > at org.apache.pig.impl.logicalLayer.LOBinCond.visit(LOBinCond.java:85) > at org.apache.pig.impl.logicalLayer.LOBinCond.visit(LOBinCond.java:28) > at > org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:68) > at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51) > at > org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.checkInnerPlan(TypeCheckingVisitor.java:2345) > at > org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.visit(TypeCheckingVisitor.java:2252) > at org.apache.pig.impl.logicalLayer.LOForEach.visit(LOForEach.java:121) > at org.apache.pig.impl.logicalLayer.LOForEach.visit(LOForEach.java:40) > at > org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:68) > at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51) > at > org.apache.pig.impl.plan.PlanValidator.validateSkipCollectException(PlanValidator.java:101) > at > org.apache.pig.impl.logicalLayer.validators.TypeCheckingValidator.validate(TypeCheckingValidator.java:40) > at > org.apache.pig.impl.logicalLayer.validators.TypeCheckingValidator.validate(TypeCheckingValidator.java:30) > at > org.apache.pig.impl.logicalLayer.validators.LogicalPlanValidationExecutor.validate(LogicalPlanValidationExecutor.java: > 79) > at org.apache.pig.PigServer.compileLp(PigServer.java:684) > at org.apache.pig.PigServer.compileLp(PigServer.java:655) > at org.apache.pig.PigServer.store(PigServer.java:433) > at org.apache.pig.PigServer.store(PigServer.java:421) > at org.apache.pig.PigServer.openIterator(PigServer.java:384) > at > org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:269) > at > org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:178) > at > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:84) > at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:64) > at org.apache.pig.Main.main(Main.java:306) > Caused by: java.io.IOException: Unable to store for alias: mynewalias [Can't > overwrite cause] > ... 26 more > Caused by: java.lang.IllegalStateException: Can't overwrite cause > ... 26 more > ==
[jira] Updated: (PIG-1378) har url not usable in Pig scripts
[ https://issues.apache.org/jira/browse/PIG-1378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Bhat updated PIG-1378: Description: I am trying to use har (Hadoop Archives) in my Pig script. I can use them through the HDFS shell {noformat} $hadoop fs -ls 'har:///user/viraj/project/subproject/files/size/data' Found 1 items -rw--- 5 viraj users1537234 2010-04-14 09:49 user/viraj/project/subproject/files/size/data/part-1 {noformat} Using similar URL's in grunt yields {noformat} grunt> a = load 'har:///user/viraj/project/subproject/files/size/data'; grunt> dump a; {noformat} {noformat} 2010-04-14 22:08:48,814 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2998: Unhandled internal error. org.apache.pig.impl.logicalLayer.FrontendException: ERROR 0: Incompatible file URI scheme: har : hdfs 2010-04-14 22:08:48,814 [main] WARN org.apache.pig.tools.grunt.Grunt - There is no log file to write to. 2010-04-14 22:08:48,814 [main] ERROR org.apache.pig.tools.grunt.Grunt - java.lang.Error: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 0: Incompatible file URI scheme: har : hdfs at org.apache.pig.impl.logicalLayer.parser.QueryParser.LoadClause(QueryParser.java:1483) at org.apache.pig.impl.logicalLayer.parser.QueryParser.BaseExpr(QueryParser.java:1245) at org.apache.pig.impl.logicalLayer.parser.QueryParser.Expr(QueryParser.java:911) at org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:700) at org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63) at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1164) at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1114) at org.apache.pig.PigServer.registerQuery(PigServer.java:425) at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:737) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:324) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138) at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75) at org.apache.pig.Main.main(Main.java:357) Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 0: Incompatible file URI scheme: har : hdfs at org.apache.pig.LoadFunc.getAbsolutePath(LoadFunc.java:249) at org.apache.pig.LoadFunc.relativeToAbsolutePath(LoadFunc.java:62) at org.apache.pig.impl.logicalLayer.parser.QueryParser.LoadClause(QueryParser.java:1472) ... 13 more {noformat} According to Jira http://issues.apache.org/jira/browse/PIG-1234 I try the following as stated in the original description {noformat} grunt> a = load 'har://namenode-location/user/viraj/project/subproject/files/size/data'; grunt> dump a; {noformat} {noformat} Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to create input splits for: har://namenode-location/user/viraj/project/subproject/files/size/data'; ... 8 more Caused by: java.io.IOException: No FileSystem for scheme: namenode-location at .apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1375) at .apache.hadoop.fs.FileSystem.access(200(FileSystem.java:66) at .apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390) at .apache.hadoop.fs.FileSystem.get(FileSystem.java:196) at .apache.hadoop.fs.HarFileSystem.initialize(HarFileSystem.java:104) at .apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1378) at .apache.hadoop.fs.FileSystem.get(FileSystem.java:193) at .apache.hadoop.fs.Path.getFileSystem(Path.java:175) at .apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:208) at .apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigTextInputFormat.listStatus(PigTextInputFormat.java:36) at .apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:246) at .apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:245) {noformat} Viraj was: I am trying to use har (Hadoop Archives) in my Pig script. I can use them through the HDFS shell {noformat} $hadoop fs -ls 'har:///user/viraj/project/subproject/files/size/data' Found 1 items -rw--- 5 viraj users1537234 2010-04-14 09:49 user/viraj/project/subproject/files/size/data/part-1 {noformat} Using similar URL's in grunt yields {noformat} grunt> a = load 'har:///user/viraj/project/subproject/files/size/data'; grunt> dump a; {noformat} {noformat} 2010-04-14 22:08:48,814 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2998: Unhandled internal error. org.apache.pig.impl.logicalLayer.FrontendException: ERROR 0: I
[jira] Created: (PIG-1378) har url not usable in Pig scripts
har url not usable in Pig scripts - Key: PIG-1378 URL: https://issues.apache.org/jira/browse/PIG-1378 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Reporter: Viraj Bhat Fix For: 0.7.0 I am trying to use har (Hadoop Archives) in my Pig script. I can use them through the HDFS shell {noformat} $hadoop fs -ls 'har:///user/viraj/project/subproject/files/size/data' Found 1 items -rw--- 5 viraj users1537234 2010-04-14 09:49 user/viraj/project/subproject/files/size/data/part-1 {noformat} Using similar URL's in grunt yields {noformat} grunt> a = load 'har:///user/viraj/project/subproject/files/size/data'; grunt> dump a; {noformat} {noformat} 2010-04-14 22:08:48,814 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2998: Unhandled internal error. org.apache.pig.impl.logicalLayer.FrontendException: ERROR 0: Incompatible file URI scheme: har : hdfs 2010-04-14 22:08:48,814 [main] WARN org.apache.pig.tools.grunt.Grunt - There is no log file to write to. 2010-04-14 22:08:48,814 [main] ERROR org.apache.pig.tools.grunt.Grunt - java.lang.Error: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 0: Incompatible file URI scheme: har : hdfs at org.apache.pig.impl.logicalLayer.parser.QueryParser.LoadClause(QueryParser.java:1483) at org.apache.pig.impl.logicalLayer.parser.QueryParser.BaseExpr(QueryParser.java:1245) at org.apache.pig.impl.logicalLayer.parser.QueryParser.Expr(QueryParser.java:911) at org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:700) at org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63) at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1164) at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1114) at org.apache.pig.PigServer.registerQuery(PigServer.java:425) at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:737) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:324) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138) at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75) at org.apache.pig.Main.main(Main.java:357) Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 0: Incompatible file URI scheme: har : hdfs at org.apache.pig.LoadFunc.getAbsolutePath(LoadFunc.java:249) at org.apache.pig.LoadFunc.relativeToAbsolutePath(LoadFunc.java:62) at org.apache.pig.impl.logicalLayer.parser.QueryParser.LoadClause(QueryParser.java:1472) ... 13 more {noformat} According to Jira http://issues.apache.org/jira/browse/PIG-1234 I try the following as stated in the original description {noformat} grunt> a = load 'har://namenode-location/user/viraj/project/subproject/files/size/data'; grunt> dump a; {noformat} {noformat} Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to create input splits for: har://namenode-location/user/viraj/project/subproject/files/size/data'; ... 8 more Caused by: java.io.IOException: No FileSystem for scheme: mithrilgold at .apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1375) at .apache.hadoop.fs.FileSystem.access(200(FileSystem.java:66) at .apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390) at .apache.hadoop.fs.FileSystem.get(FileSystem.java:196) at .apache.hadoop.fs.HarFileSystem.initialize(HarFileSystem.java:104) at .apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1378) at .apache.hadoop.fs.FileSystem.get(FileSystem.java:193) at .apache.hadoop.fs.Path.getFileSystem(Path.java:175) at .apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:208) at .apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigTextInputFormat.listStatus(PigTextInputFormat.java:36) at .apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:246) at .apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:245) {noformat} Viraj -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (PIG-1377) Pig/Zebra fails without proper error message when the mapred.jobtracker.maxtasks.per.job exceeds threshold
Pig/Zebra fails without proper error message when the mapred.jobtracker.maxtasks.per.job exceeds threshold -- Key: PIG-1377 URL: https://issues.apache.org/jira/browse/PIG-1377 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0, 0.7.0 Reporter: Viraj Bhat I have a Zebra script which generates huge amount of mappers around 400K. The mapred.jobtracker.maxtasks.per.job is currently set at 200k. The job fails at the initialization phase. It is very hard to find out the cause. We need a way to report the right error message to users. Unfortunately for Pig to get this error in the backend, Map Reduce Jira: https://issues.apache.org/jira/browse/MAPREDUCE-1049 needs to be fixed. {code} -- Sorted format %set default_parallel 100; raw = load '/user/viraj/generated/raw/zebra-sorted/20100203' USING org.apache.hadoop.zebra.pig.TableLoader('', 'sorted') as (id, timestamp, code, ip, host, reference, type, flag, params : map[] ); describe raw; user_events = filter raw by id == 'viraj'; describe user_events; dump user_events; sorted_events = order user_events by id, timestamp; dump sorted_events; store sorted_events into 'finalresult'; {code} -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (PIG-1374) Order by fails with java.lang.String cannot be cast to org.apache.pig.data.DataBag
Order by fails with java.lang.String cannot be cast to org.apache.pig.data.DataBag -- Key: PIG-1374 URL: https://issues.apache.org/jira/browse/PIG-1374 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0, 0.7.0 Reporter: Viraj Bhat Script loads data from BinStorage(), then flattens columns and then sorts on the second column with order descending. The order by fails with the ClassCastException {code} register loader.jar; a = load 'c2' using BinStorage(); b = foreach a generate org.apache.pig.CCMLoader(*); describe b; c = foreach b generate flatten($0); describe c; d = order c by $1 desc; dump d; {code} The sampling job fails with the following error: === java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.pig.data.DataBag at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:407) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:188) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:329) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:232) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:227) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:52) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at org.apache.hadoop.mapred.Child.main(Child.java:159) === The schema for b, c and d are as follows: b: {bag_of_tuples: {tuple: (uuid: chararray,velocity: double)}} c: {bag_of_tuples::uuid: chararray,bag_of_tuples::velocity: double} d: {bag_of_tuples::uuid: chararray,bag_of_tuples::velocity: double} If we modify this script to order on the first column it seems to work {code} register loader.jar; a = load 'c2' using BinStorage(); b = foreach a generate org.apache.pig.CCMLoader(*); describe b; c = foreach b generate flatten($0); describe c; d = order c by $0 desc; dump d; {code} (gc639c60-4267-11df-9879-0800200c9a66,2.4227339503478493) (ec639c60-4267-11df-9879-0800200c9a66,1.140175425099138) There is a workaround to do a projection before ORDER {code} register loader.jar; a = load 'c2' using BinStorage(); b = foreach a generate org.apache.pig.CCMLoader(*); describe b; c = foreach b generate flatten($0); describe c; newc = foreach c generate $0 as uuid, $1 as velocity; newd = order newc by velocity desc; dump newd; {code} (gc639c60-4267-11df-9879-0800200c9a66,2.4227339503478493) (ec639c60-4267-11df-9879-0800200c9a66,1.140175425099138) The schema for the Loader is as follows: {code} public Schema outputSchema(Schema input) { try{ List list = new ArrayList(); list.add(new Schema.FieldSchema("uuid", DataType.CHARARRAY)); list.add(new Schema.FieldSchema("velocity", DataType.DOUBLE)); Schema tupleSchema = new Schema(list); Schema.FieldSchema tupleFs = new Schema.FieldSchema("tuple", tupleSchema, DataType.TUPLE); Schema bagSchema = new Schema(tupleFs); bagSchema.setTwoLevelAccessRequired(true); Schema.FieldSchema bagFs = new Schema.FieldSchema("bag_of_tuples",bagSchema, DataType.BAG); return new Schema(bagFs); }catch (Exception e){ return null; } } {code} -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Resolved: (PIG-756) UDFs should have API for transparently opening and reading files from HDFS or from local file system with only relative path
[ https://issues.apache.org/jira/browse/PIG-756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Bhat resolved PIG-756. Resolution: Fixed Fix Version/s: 0.7.0 https://issues.apache.org/jira/browse/PIG-1053 fixes this issue. > UDFs should have API for transparently opening and reading files from HDFS or > from local file system with only relative path > > > Key: PIG-756 > URL: https://issues.apache.org/jira/browse/PIG-756 > Project: Pig > Issue Type: Bug >Reporter: David Ciemiewicz > Fix For: 0.7.0 > > > I have a utility function util.INSETFROMFILE() that I pass a file name during > initialization. > {code} > define inQuerySet util.INSETFROMFILE(analysis/queries); > A = load 'logs' using PigStorage() as ( date int, query chararray ); > B = filter A by inQuerySet(query); > {code} > This provides a computationally inexpensive way to effect map-side joins for > small sets plus functions of this style provide the ability to encapsulate > more complex matching rules. > For rapid development and debugging purposes, I want this code to run without > modification on both my local file system when I do pig -exectype local and > on HDFS. > Pig needs to provide an API for UDFs which allow them to either: > 1) "know" when they are in local or HDFS mode and let them open and read > from files as appropriate > 2) just provide a file name and read statements and have pig transparently > manage local or HDFS opens and reads for the UDF > UDFs need to read configuration information off the filesystem and it > simplifies the process if one can just flip the switch of -exectype local. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-756) UDFs should have API for transparently opening and reading files from HDFS or from local file system with only relative path
[ https://issues.apache.org/jira/browse/PIG-756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12854762#action_12854762 ] Viraj Bhat commented on PIG-756: In Pig 0.7 we have moved local mode of Pig to local mode of Hadoop. https://issues.apache.org/jira/browse/PIG-1053 Closing issue > UDFs should have API for transparently opening and reading files from HDFS or > from local file system with only relative path > > > Key: PIG-756 > URL: https://issues.apache.org/jira/browse/PIG-756 > Project: Pig > Issue Type: Bug >Reporter: David Ciemiewicz > > I have a utility function util.INSETFROMFILE() that I pass a file name during > initialization. > {code} > define inQuerySet util.INSETFROMFILE(analysis/queries); > A = load 'logs' using PigStorage() as ( date int, query chararray ); > B = filter A by inQuerySet(query); > {code} > This provides a computationally inexpensive way to effect map-side joins for > small sets plus functions of this style provide the ability to encapsulate > more complex matching rules. > For rapid development and debugging purposes, I want this code to run without > modification on both my local file system when I do pig -exectype local and > on HDFS. > Pig needs to provide an API for UDFs which allow them to either: > 1) "know" when they are in local or HDFS mode and let them open and read > from files as appropriate > 2) just provide a file name and read statements and have pig transparently > manage local or HDFS opens and reads for the UDF > UDFs need to read configuration information off the filesystem and it > simplifies the process if one can just flip the switch of -exectype local. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1345) Link casting errors in POCast to actual lines numbers in Pig script
Link casting errors in POCast to actual lines numbers in Pig script --- Key: PIG-1345 URL: https://issues.apache.org/jira/browse/PIG-1345 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.6.0 Reporter: Viraj Bhat For the purpose of easy debugging, I would be nice to find out where my warnings are coming from is in the pig script. The only known process is to comment out lines in the Pig script and see if these warnings go away. 2010-01-13 21:34:13,697 [main] WARN org.apache.pig.PigServer - Encountered Warning IMPLICIT_CAST_TO_MAP 2 time(s) line 22 2010-01-13 21:34:13,698 [main] WARN org.apache.pig.PigServer - Encountered Warning IMPLICIT_CAST_TO_LONG 2 time(s) line 23 2010-01-13 21:34:13,698 [main] WARN org.apache.pig.PigServer - Encountered Warning IMPLICIT_CAST_TO_BAG 1 time(s). line 26 I think this may need us to keep track of the line numbers of the Pig script (via out javacc parser) and maintain it in the logical and physical plan. It would help users in debugging simple errors/warning related to casting. Is this enhancement listed in the http://wiki.apache.org/pig/PigJournal? Do we need to change the parser to something other than javacc to make this task simpler? "Standardize on Parser and Scanner Technology" Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1343) pig_log file missing even though Main tells it is creating one and an M/R job fails
pig_log file missing even though Main tells it is creating one and an M/R job fails Key: PIG-1343 URL: https://issues.apache.org/jira/browse/PIG-1343 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Viraj Bhat There is a particular case where I was running with the latest trunk of Pig. {code} $java -cp pig.jar:/home/path/hadoop20cluster org.apache.pig.Main testcase.pig [main] INFO org.apache.pig.Main - Logging error messages to: /homes/viraj/pig_1263420012601.log $ls -l pig_1263420012601.log ls: pig_1263420012601.log: No such file or directory {code} The job failed and the log file did not contain anything, the only way to debug was to look into the Jobtracker logs. Here are some reasons which would have caused this behavior: 1) The underlying filer/NFS had some issues. In that case do we not error on stdout? 2) There are some errors from the backend which are not being captured Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1341) Cannot convert DataByeArray to Chararray and results in FIELD_DISCARDED_TYPE_CONVERSION_FAILED
[ https://issues.apache.org/jira/browse/PIG-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Bhat updated PIG-1341: Component/s: impl Summary: Cannot convert DataByeArray to Chararray and results in FIELD_DISCARDED_TYPE_CONVERSION_FAILED (was: Cannot convert DataByeArray to Chararray and results in FIELD_DISCARDED_TYPE_CONVERSION_FAILED 20) > Cannot convert DataByeArray to Chararray and results in > FIELD_DISCARDED_TYPE_CONVERSION_FAILED > -- > > Key: PIG-1341 > URL: https://issues.apache.org/jira/browse/PIG-1341 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.6.0 >Reporter: Viraj Bhat > > Script reads in BinStorage data and tries to convert a column which is in > DataByteArray to Chararray. > {code} > raw = load 'sampledata' using BinStorage() as (col1,col2, col3); > --filter out null columns > A = filter raw by col1#'bcookie' is not null; > B = foreach A generate col1#'bcookie' as reqcolumn; > describe B; > --B: {regcolumn: bytearray} > X = limit B 5; > dump X; > B = foreach A generate (chararray)col1#'bcookie' as convertedcol; > describe B; > --B: {convertedcol: chararray} > X = limit B 5; > dump X; > {code} > The first dump produces: > (36co9b55onr8s) > (36co9b55onr8s) > (36hilul5oo1q1) > (36hilul5oo1q1) > (36l4cj15ooa8a) > The second dump produces: > () > () > () > () > () > It also throws an error message: FIELD_DISCARDED_TYPE_CONVERSION_FAILED 5 > time(s). > Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1341) Cannot convert DataByeArray to Chararray and results in FIELD_DISCARDED_TYPE_CONVERSION_FAILED 20
Cannot convert DataByeArray to Chararray and results in FIELD_DISCARDED_TYPE_CONVERSION_FAILED 20 - Key: PIG-1341 URL: https://issues.apache.org/jira/browse/PIG-1341 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Viraj Bhat Script reads in BinStorage data and tries to convert a column which is in DataByteArray to Chararray. {code} raw = load 'sampledata' using BinStorage() as (col1,col2, col3); --filter out null columns A = filter raw by col1#'bcookie' is not null; B = foreach A generate col1#'bcookie' as reqcolumn; describe B; --B: {regcolumn: bytearray} X = limit B 5; dump X; B = foreach A generate (chararray)col1#'bcookie' as convertedcol; describe B; --B: {convertedcol: chararray} X = limit B 5; dump X; {code} The first dump produces: (36co9b55onr8s) (36co9b55onr8s) (36hilul5oo1q1) (36hilul5oo1q1) (36l4cj15ooa8a) The second dump produces: () () () () () It also throws an error message: FIELD_DISCARDED_TYPE_CONVERSION_FAILED 5 time(s). Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1339) International characters in column names not supported
International characters in column names not supported -- Key: PIG-1339 URL: https://issues.apache.org/jira/browse/PIG-1339 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Viraj Bhat There is a particular use-case in which someone specifies a column name to be in International characters. {code} inputdata = load '/user/viraj/inputdata.txt' using PigStorage() as (あいうえお); describe inputdata; dump inputdata; {code} == Pig Stack Trace --- ERROR 1000: Error during parsing. Lexical error at line 1, column 64. Encountered: "\u3042" (12354), after : "" org.apache.pig.impl.logicalLayer.parser.TokenMgrError: Lexical error at line 1, column 64. Encountered: "\u3042" (12354), after : "" at org.apache.pig.impl.logicalLayer.parser.QueryParserTokenManager.getNextToken(QueryParserTokenManager.java:1791) at org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_scan_token(QueryParser.java:8959) at org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_51(QueryParser.java:7462) at org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_120(QueryParser.java:7769) at org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_106(QueryParser.java:7787) at org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_63(QueryParser.java:8609) at org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_32(QueryParser.java:8621) at org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3_4(QueryParser.java:8354) at org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_2_4(QueryParser.java:6903) at org.apache.pig.impl.logicalLayer.parser.QueryParser.BaseExpr(QueryParser.java:1249) at org.apache.pig.impl.logicalLayer.parser.QueryParser.Expr(QueryParser.java:911) at org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:700) at org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63) at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1164) at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1114) at org.apache.pig.PigServer.registerQuery(PigServer.java:425) at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:737) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:324) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89) at org.apache.pig.Main.main(Main.java:391) == Thanks Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1308) Inifinite loop in JobClient when reading from BinStorage Message: [org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2]
[ https://issues.apache.org/jira/browse/PIG-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Bhat updated PIG-1308: Description: Simple script fails to read files from BinStorage() and fails to submit jobs to JobTracker. This occurs with trunk and not with Pig 0.6 branch. {code} data = load 'binstoragesample' using BinStorage() as (s, m, l); A = foreach ULT generate s#'key' as value; X = limit A 20; dump X; {code} When this script is submitted to the Jobtracker, we found the following error: 2010-03-18 22:31:22,296 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:32:01,574 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:32:43,276 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:33:21,743 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:34:02,004 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:34:43,442 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:35:25,907 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:36:07,402 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:36:48,596 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:37:28,014 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:38:04,823 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:38:38,981 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:39:12,220 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 Stack Trace revelead at org.apache.pig.impl.io.ReadToEndLoader.init(ReadToEndLoader.java:144) at org.apache.pig.impl.io.ReadToEndLoader.(ReadToEndLoader.java:115) at org.apache.pig.builtin.BinStorage.getSchema(BinStorage.java:404) at org.apache.pig.impl.logicalLayer.LOLoad.determineSchema(LOLoad.java:167) at org.apache.pig.impl.logicalLayer.LOLoad.getProjectionMap(LOLoad.java:263) at org.apache.pig.impl.logicalLayer.ProjectionMapCalculator.visit(ProjectionMapCalculator.java:112) at org.apache.pig.impl.logicalLayer.LOLoad.visit(LOLoad.java:210) at org.apache.pig.impl.logicalLayer.LOLoad.visit(LOLoad.java:52) at org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:69) at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51) at org.apache.pig.impl.logicalLayer.optimizer.LogicalTransformer.rebuildProjectionMaps(LogicalTransformer.java:76) at org.apache.pig.impl.logicalLayer.optimizer.LogicalOptimizer.optimize(LogicalOptimizer.java:216) at org.apache.pig.PigServer.compileLp(PigServer.java:883) at org.apache.pig.PigServer.store(PigServer.java:564) The binstorage data was generated from 2 datasets using limit and union: {code} Large1 = load 'input1' using PigStorage(); Large2 = load 'input2' using PigStorage(); V = limit Large1 1; C = limit Large2 1; U = union V, C; store U into 'binstoragesample' using BinStorage(); {code} was: Simple script fails to read files from BinStorage() and fails to submit jobs to JobTracker. This occurs with trunk and not with Pig 0.6 branch. {code} data = load 'binstorage' using BinStorage() as (s, m, l); A = foreach ULT generate s#'key' as value; X = limit A 20; dump X; {code} When this script is submitted to the Jobtracker, we found the following error: 2010-03-18 22:31:22,296 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:32:01,574 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:32:43,276 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:33:21,743 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:34:02,004 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:34:43,442 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:35:25,907 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total i
[jira] Created: (PIG-1308) Inifinite loop in JobClient when reading from BinStorage Message: [org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2]
Inifinite loop in JobClient when reading from BinStorage Message: [org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2] Key: PIG-1308 URL: https://issues.apache.org/jira/browse/PIG-1308 Project: Pig Issue Type: Bug Reporter: Viraj Bhat Fix For: 0.7.0 Simple script fails to read files from BinStorage() and fails to submit jobs to JobTracker. This occurs with trunk and not with Pig 0.6 branch. {code} data = load 'binstorage' using BinStorage() as (s, m, l); A = foreach ULT generate s#'key' as value; X = limit A 20; dump X; {code} When this script is submitted to the Jobtracker, we found the following error: 2010-03-18 22:31:22,296 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:32:01,574 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:32:43,276 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:33:21,743 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:34:02,004 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:34:43,442 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:35:25,907 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:36:07,402 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:36:48,596 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:37:28,014 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:38:04,823 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:38:38,981 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 2010-03-18 22:39:12,220 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2 Stack Trace revelead at org.apache.pig.impl.io.ReadToEndLoader.init(ReadToEndLoader.java:144) at org.apache.pig.impl.io.ReadToEndLoader.(ReadToEndLoader.java:115) at org.apache.pig.builtin.BinStorage.getSchema(BinStorage.java:404) at org.apache.pig.impl.logicalLayer.LOLoad.determineSchema(LOLoad.java:167) at org.apache.pig.impl.logicalLayer.LOLoad.getProjectionMap(LOLoad.java:263) at org.apache.pig.impl.logicalLayer.ProjectionMapCalculator.visit(ProjectionMapCalculator.java:112) at org.apache.pig.impl.logicalLayer.LOLoad.visit(LOLoad.java:210) at org.apache.pig.impl.logicalLayer.LOLoad.visit(LOLoad.java:52) at org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:69) at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51) at org.apache.pig.impl.logicalLayer.optimizer.LogicalTransformer.rebuildProjectionMaps(LogicalTransformer.java:76) at org.apache.pig.impl.logicalLayer.optimizer.LogicalOptimizer.optimize(LogicalOptimizer.java:216) at org.apache.pig.PigServer.compileLp(PigServer.java:883) at org.apache.pig.PigServer.store(PigServer.java:564) The binstorage data was generated from 2 datasets using limit and union: {code} Large1 = load 'input1' using PigStorage(); Large2 = load 'input2' using PigStorage(); V = limit Large1 1; C = limit Large2 1; U = union V, C; store U into 'mobilesample' using BinStorage(); {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1305) Document in Load statement syntax that Pig and underlying M/R does not handle concatenated bz2 and gz files correctly
Document in Load statement syntax that Pig and underlying M/R does not handle concatenated bz2 and gz files correctly -- Key: PIG-1305 URL: https://issues.apache.org/jira/browse/PIG-1305 Project: Pig Issue Type: Bug Components: documentation Reporter: Viraj Bhat Fix For: 0.7.0 The Pig Reference Manual needs to be updated: Relational Operators Syntax: LOAD 'data' [USING function] [AS schema]; 'data' Please note: Pig reads in both bz2 and gz formats correctly as long as they are not concatenated gzip or bz2 generated in this manner. cat *.bz2 > text/concat.bz2. Your M/R jobs may succeed but the results will not be accurate. Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1304) Fail underlying M/R jobs when concatenated gzip and bz2 files are provided as input
Fail underlying M/R jobs when concatenated gzip and bz2 files are provided as input --- Key: PIG-1304 URL: https://issues.apache.org/jira/browse/PIG-1304 Project: Pig Issue Type: New Feature Affects Versions: 0.6.0 Reporter: Viraj Bhat I have the following txt files which are bzipped: \t = {code} $ bzcat A.txt.bz2 1\ta 2\taa $bzcat B.txt.bz2 1\tb 2\tbb $cat *.bz2 > test/mymerge.bz2 $bzcat test/mymerge.bz2 1\ta 2\taa 1\tb 2\tbb $hadoop fs -put test/mymerge.bz2 /user/viraj {code} I now write a Pig script to print values of bz2. {code} A = load '/user/viraj/bzipgetmerge/mymerge.bz2' using PigStorage(); dump A; {code} I get the records for the first bz2 file which I concatenated. (1,a) (2,aa) My M/R jobs do not fail or throw any warning about this, just that it drops records. Is there a way we can throw a warning or fail the underlying Map job, can it be done in Bzip2TextInputFormat class in Pig ? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1281) Detect org.apache.pig.data.DataByteArray cannot be cast to org.apache.pig.data.Tuple type of errors at Compile Type during creation of logical plan
Detect org.apache.pig.data.DataByteArray cannot be cast to org.apache.pig.data.Tuple type of errors at Compile Type during creation of logical plan --- Key: PIG-1281 URL: https://issues.apache.org/jira/browse/PIG-1281 Project: Pig Issue Type: Improvement Affects Versions: 0.6.0 Reporter: Viraj Bhat Fix For: 0.8.0 This is more of an enhancement request, where we can detect simple errors during compile time during creation of Logical plan rather than at the backend. I created a script which contains an error which gets detected in the backend as a cast error when in fact we can detect it in the front end(group is a single element so group.$0 projection operation will not work). {code} inputdata = LOAD '/user/viraj/mymapdata' AS (co1, col2, col3, col4); projdata = FILTER inputdata BY (col1 is not null); groupprojdata = GROUP projdata BY col1; cleandata = FOREACH groupprojdata { bagproj = projdata.col1; dist_bags = DISTINCT bagproj; GENERATE group.$0 as newcol1, COUNT(dist_bags) as newcol2; }; cleandata1 = GROUP cleandata by newcol2; cleandata2 = FOREACH cleandata1 { GENERATE group.$0 as finalcol1, COUNT(cleandata.newcol1) as finalcol2; }; ordereddata = ORDER cleandata2 by finalcol2; store into 'finalresult' using PigStorage(); {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1278) Type mismatch in key from map: expected org.apache.pig.impl.io.NullableFloatWritable, recieved org.apache.pig.impl.io.NullableText
Type mismatch in key from map: expected org.apache.pig.impl.io.NullableFloatWritable, recieved org.apache.pig.impl.io.NullableText --- Key: PIG-1278 URL: https://issues.apache.org/jira/browse/PIG-1278 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Viraj Bhat Fix For: 0.7.0 I have a script which uses Map data, and runs a UDF, which creates random numbers and then orders the data by these random numbers. {code} REGISTER myloader.jar; --jar produced from the source code listed below REGISTER math.jar; DEFINE generator math.Random(); inputdata = LOAD '/user/viraj/mymapdata' USING MyMapLoader()AS (s:map[], m:map[], l:map[]); queries = FILTER inputdata BY m#'key'#'query' IS NOT null; queries_rand = FOREACH queries GENERATE generator('') AS rand_num, (CHARARRAY) m#'key'#'query' AS query_string; queries_sorted = ORDER queries_rand BY rand_num PARALLEL 10; queries_limit = LIMIT queries_sorted 1000; rand_queries = FOREACH queries_limit GENERATE query_string; STORE rand_queries INTO 'finalresult'; {code} UDF source for Random.java {code} package math; import java.io.IOException; /* * Implements a random float [0,1) generator. */ public class Random extends EvalFunc { private final Random m_rand = new Random(); public Float exec(Tuple input) throws IOException { return new Float(m_rand.nextFloat()); } public Schema outputSchema(Schema input) { final String name = getSchemaName(getClass().getName(), input); return new Schema(new Schema.FieldSchema(name, DataType.FLOAT)); } } {code} Running this script returns the following error in the Mapper = java.io.IOException: Type mismatch in key from map: expected org.apache.pig.impl.io.NullableFloatWritable, recieved org.apache.pig.impl.io.NullableText at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:845) at org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:466) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:109) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:255) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:244) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:94) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:159) = -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1272) Column pruner causes wrong results
[ https://issues.apache.org/jira/browse/PIG-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12840389#action_12840389 ] Viraj Bhat commented on PIG-1272: - Now with Pig 0.7 or trunk we have the following error: 2010-03-02 23:35:09,349 FATAL org.apache.hadoop.mapred.Child: Error running child : java.lang.NoSuchFieldError: sJobConf at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POJoinPackage.getNext(POJoinPackage.java:110) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.processOnePackageOutput(PigMapReduce.java:380) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:363) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:240) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:567) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:409) at org.apache.hadoop.mapred.Child.main(Child.java:159) Viraj > Column pruner causes wrong results > -- > > Key: PIG-1272 > URL: https://issues.apache.org/jira/browse/PIG-1272 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.6.0 >Reporter: Viraj Bhat >Assignee: Daniel Dai > Fix For: 0.7.0 > > > For a simple script the column pruner optimization removes certain columns > from the original relation, which results in wrong results. > Input file "kv" contains the following columns (tab separated) > {code} > a 1 > a 2 > a 3 > b 4 > c 5 > c 6 > b 7 > d 8 > {code} > Now running this script in Pig 0.6 produces > {code} > kv = load 'kv' as (k,v); > keys= foreach kv generate k; > keys = distinct keys; > keys = limit keys 2; > rejoin = join keys by k, kv by k; > dump rejoin; > {code} > (a,a) > (a,a) > (a,a) > (b,b) > (b,b) > Running this in Pig 0.5 version without column pruner results in: > (a,a,1) > (a,a,2) > (a,a,3) > (b,b,4) > (b,b,7) > When we disable the "ColumnPruner" optimization it gives right results. > Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1272) Column pruner causes wrong results
Column pruner causes wrong results -- Key: PIG-1272 URL: https://issues.apache.org/jira/browse/PIG-1272 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Viraj Bhat Fix For: 0.7.0 For a simple script the column pruner optimization removes certain columns from the original relation, which results in wrong results. Input file "kv" contains the following columns (tab separated) {code} a 1 a 2 a 3 b 4 c 5 c 6 b 7 d 8 {code} Now running this script in Pig 0.6 produces {code} kv = load 'kv' as (k,v); keys= foreach kv generate k; keys = distinct keys; keys = limit keys 2; rejoin = join keys by k, kv by k; dump rejoin; {code} (a,a) (a,a) (a,a) (b,b) (b,b) Running this in Pig 0.5 version without column pruner results in: (a,a,1) (a,a,2) (a,a,3) (b,b,4) (b,b,7) When we disable the "ColumnPruner" optimization it gives right results. Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1252) Diamond splitter does not generate correct results when using Multi-query optimization
[ https://issues.apache.org/jira/browse/PIG-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12840339#action_12840339 ] Viraj Bhat commented on PIG-1252: - A modified version of the script works, does this have to do with nested foreach? {code} loadData = load '/user/viraj/zebradata' using org.apache.hadoop.zebra.pig.TableLoader('col1,col2, col3, col4, col5, col6, col7'); prjData = FOREACH loadData GENERATE (chararray) col1, (chararray) col2, (chararray) col3, (chararray) ((col4 is not null and col4 != '') ? col4 : ((col5 is not null) ? col5 : '')) as splitcond, (chararray) (col6 == 'c' ? 1 : IS_VALID ('200', '0', '0', 'input.txt')) as validRec; SPLIT prjData INTO trueDataTmp IF (validRec == '1' AND splitcond != ''), falseDataTmp IF (validRec == '1' AND splitcond == ''); grpData = GROUP trueDataTmp BY splitcond; finalData = FOREACH grpData GENERATE FLATTEN ( MYUDF (orderedData, 60, 1800, 'input.txt', 'input.dat','20100222','5', 'debug_on')) as (s,m,l); dump finalData; {code} > Diamond splitter does not generate correct results when using Multi-query > optimization > -- > > Key: PIG-1252 > URL: https://issues.apache.org/jira/browse/PIG-1252 > Project: Pig > Issue Type: Bug >Affects Versions: 0.6.0 >Reporter: Viraj Bhat >Assignee: Richard Ding > Fix For: 0.7.0 > > > I have script which uses split but somehow does not use one of the split > branch. The skeleton of the script is as follows > {code} > loadData = load '/user/viraj/zebradata' using > org.apache.hadoop.zebra.pig.TableLoader('col1,col2, col3, col4, col5, col6, > col7'); > prjData = FOREACH loadData GENERATE (chararray) col1, (chararray) col2, > (chararray) col3, (chararray) ((col4 is not null and col4 != '') ? col4 : > ((col5 is not null) ? col5 : '')) as splitcond, (chararray) (col6 == 'c' ? 1 > : IS_VALID ('200', '0', '0', 'input.txt')) as validRec; > SPLIT prjData INTO trueDataTmp IF (validRec == '1' AND splitcond != ''), > falseDataTmp IF (validRec == '1' AND splitcond == ''); > grpData = GROUP trueDataTmp BY splitcond; > finalData = FOREACH grpData { >orderedData = ORDER trueDataTmp BY col1,col2; >GENERATE FLATTEN ( MYUDF (orderedData, 60, > 1800, 'input.txt', 'input.dat','20100222','5', 'debug_on')) as (s,m,l); > } > dump finalData; > {code} > You can see that "falseDataTmp" is untouched. > When I run this script with no-Multiquery (-M) option I get the right result. > This could be the result of complex BinCond's in the POLoad. We can get rid > of this error by using FILTER instead of SPIT. > Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1263) Script producing varying number of records when COGROUPing value of map data type with and without types
[ https://issues.apache.org/jira/browse/PIG-1263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Bhat updated PIG-1263: Description: I have a Pig script which I am experimenting upon. [[Albeit this is not optimized and can be done in variety of ways]] I get different record counts by placing load store pairs in the script. Case 1: Returns 424329 records Case 2: Returns 5859 records Case 3: Returns 5859 records Case 4: Returns 5578 records I am wondering what the correct result is? Here are the scripts. Case 1: {code} register udf.jar A = LOAD '/user/viraj/data/20100203' USING MapLoader() AS (s, m, l); B = FOREACH A GENERATE s#'key1' as key1, s#'key2' as key2; C = FOREACH B generate key2; D = filter C by (key2 IS NOT null); E = distinct D; store E into 'unique_key_list' using PigStorage('\u0001'); F = Foreach E generate key2, MapGenerate(key2) as m; G = FILTER F by (m IS NOT null); H = foreach G generate key2, m#'id1' as id1, m#'id2' as id2, m#'id3' as id3, m#'id4' as id4, m#'id5' as id5, m#'id6' as id6, m#'id7' as id7, m#'id8' as id8, m#'id9' as id9, m#'id10' as id10, m#'id11' as id11, m#'id12' as id12; I = GROUP H BY (id1, id2, id3, id4, id5, id6, id7, id8, id9, id10, id11, id12); J = Foreach I generate group.id1 as id1, group.id2 as id2, group.id3 as id3, group.id4 as id4,group.id5 as id5, group.id6 as id6, group.id7 as id7, group.id8 as id8, group.id9 as id9, group.id10 as id10, group.id11 as id11, group.id12 as id12; --load previous days data K = LOAD '/user/viraj/data/20100202' USING PigStorage('\u0001') as (id1, id2, id3, id4, id5, id6, id7, id8, id9, id10, id11, id12); L = COGROUP K by (id1, id2, id3, id4, id5, id6, id7, id8, id9, id10, id11, id12) OUTER, J by (id1, id2, id3, id4, id5, id6, id7, id8, id9, id10, id11, id12) OUTER; M = filter L by IsEmpty(K); store M into 'cogroupNoTypes' using PigStorage(); {code} Case 2: Storing and loading intermediate results in J {code} register udf.jar A = LOAD '/user/viraj/data/20100203' USING MapLoader() AS (s, m, l); B = FOREACH A GENERATE s#'key1' as key1, s#'key2' as key2; C = FOREACH B generate key2; D = filter C by (key2 IS NOT null); E = distinct D; store E into 'unique_key_list' using PigStorage('\u0001'); F = Foreach E generate key2, MapGenerate(key2) as m; G = FILTER F by (m IS NOT null); H = foreach G generate key2, m#'id1' as id1, m#'id2' as id2, m#'id3' as id3, m#'id4' as id4, m#'id5' as id5, m#'id6' as id6, m#'id7' as id7, m#'id8' as id8, m#'id9' as id9, m#'id10' as id10, m#'id11' as id11, m#'id12' as id12; I = GROUP H BY (id1, id2, id3, id4, id5, id6, id7, id8, id9, id10, id11, id12); J = Foreach I generate group.id1 as id1, group.id2 as id2, group.id3 as id3, group.id4 as id4,group.id5 as id5, group.id6 as id6, group.id7 as id7, group.id8 as id8, group.id9 as id9, group.id10 as id10, group.id11 as id11, group.id12 as id12; --store intermediate data to HDFS and re-read store J into 'output/20100203/J' using PigStorage('\u0001'); --load previous days data K = LOAD '/user/viraj/data/20100202' USING PigStorage('\u0001') as (id1, id2, id3, id4, id5, id6, id7, id8, id9, id10, id11, id12); --read J into K1 K1 = LOAD 'output/20100203/J' using PigStorage('\u0001') as (id1, id2, id3, id4, id5, id6, id7, id8, id9, id10, id11, id12); L = COGROUP K by (id1, id2, id3, id4, id5, id6, id7, id8, id9, id10, id11, id12) OUTER, K1 by (id1, id2, id3, id4, id5, id6, id7, id8, id9, id10, id11, id12) OUTER; M = filter L by IsEmpty(K); store M into 'cogroupNoTypesIntStore' using PigStorage(); {code} Case 3: Types information specified but no intermediate store of J {code} register udf.jar A = LOAD '/user/viraj/data/20100203' USING MapLoader() AS (s, m, l); B = FOREACH A GENERATE s#'key1' as key1, s#'key2' as key2; C = FOREACH B generate key2; D = filter C by (key2 IS NOT null); E = distinct D; store E into 'unique_key_list' using PigStorage('\u0001'); F = Foreach E generate key2, MapGenerate(key2) as m; G = FILTER F by (m IS NOT null); H = foreach G generate key2, (long)m#'id1' as id1, (long)m#'id2' as id2, (long)m#'id3' as id3, (long)m#'id4' as id4, (long)m#'id5' as id5, (long)m#'id6' as id6, (long)m#'id7' as id7, (chararray)m#'id8' as id8, (chararray)m#'id9' as id9, (chararray)m#'id10' as id10, (chararray)m#'id11' as id11, (chararray)m#'id12' as id12; I = GROUP H BY (id1, id2, id3, id4, id5, id6, id7, id8, id9, id10, id11, id12); J = Foreach I generate group.id1 as id1, group.id2 as id2, group.id3 as id3, group.id4 as id4,group.id5 as id5, group.id6 as id6, group.id7 as id7, group.id8 as id8, group.id9 as id9, group.id10 as id10, group.id11 as id11, group.id12 as id12; store J into 'output/20100203/J' using PigStorage('\u0001'); --load previous days data with type information K = LOAD '/user/viraj/data/20100202' USING PigStorage('\u0001') as (
[jira] Created: (PIG-1263) Script producing varying number of records when COGROUPing value of map data type with and without types
Script producing varying number of records when COGROUPing value of map data type with and without types Key: PIG-1263 URL: https://issues.apache.org/jira/browse/PIG-1263 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Viraj Bhat Fix For: 0.6.0 I have a Pig script which I am experimenting upon. [[Albeit this is not optimized and can be done in variety of ways]] I get different record counts by placing load store pairs in the script. Case 1: Returns 424329 records Case 2: Returns 5859 records Case 3: Returns 5859 records Case 4: Returns 5578 records I am wondering what the correct result is? Here are the scripts. Case 1: {code} register udf.jar A = LOAD '/user/viraj/data/20100203' USING MapLoader() AS (s, m, l); B = FOREACH A GENERATE s#'key1' as key1, s#'key2' as key2; C = FOREACH B generate key2; D = filter C by (key2 IS NOT null); E = distinct D; store E into 'unique_key_list' using PigStorage('\u0001'); F = Foreach E generate key2, MapGenerate(key2) as m; G = FILTER F by (m IS NOT null); H = foreach G generate key2, m#'id1' as id1, m#'id2' as id2, m#'id3' as id3, m#'id4' as id4, m#'id5' as id5, m#'id6' as id6, m#'id7' as id7, m#'id8' as id8, m#'id9' as id9, m#'id10' as id10, m#'id11' as id11, m#'id12' as id12; I = GROUP H BY (id1, id2, id3, id4, id5, id6, id7, id8, id9, id10, id11, id12); J = Foreach I generate group.id1 as id1, group.id2 as id2, group.id3 as id3, group.id4 as id4,group.id5 as id5, group.id6 as id6, group.id7 as id7, group.id8 as id8, group.id9 as id9, group.id10 as id10, group.id11 as id11, group.id12 as id12; --load previous days data K = LOAD '/user/viraj/data/20100202' USING PigStorage('\u0001') as (id1, id2, id3, id4, id5, id6, id7, id8, id9, id10, id11, id12); L = COGROUP K by (id1, id2, id3, id4, id5, id6, id7, id8, id9, id10, id11, id12) OUTER, J by (id1, id2, id3, id4, id5, id6, id7, id8, id9, id10, id11, id12) OUTER; M = filter L by IsEmpty(K); store M into 'cogroupNoTypes' using PigStorage(); {code} Case 2: Storing and loading intermediate results in J {code} register udf.jar A = LOAD '/user/viraj/data/20100203' USING MapLoader() AS (s, m, l); B = FOREACH A GENERATE s#'key1' as key1, s#'key2' as key2; C = FOREACH B generate key2; D = filter C by (key2 IS NOT null); E = distinct D; store E into 'unique_key_list' using PigStorage('\u0001'); F = Foreach E generate key2, MapGenerate(key2) as m; G = FILTER F by (m IS NOT null); H = foreach G generate key2, m#'id1' as id1, m#'id2' as id2, m#'id3' as id3, m#'id4' as id4, m#'id5' as id5, m#'id6' as id6, m#'id7' as id7, m#'id8' as id8, m#'id9' as id9, m#'id10' as id10, m#'id11' as id11, m#'id12' as id12; I = GROUP H BY (id1, id2, id3, id4, id5, id6, id7, id8, id9, id10, id11, id12); J = Foreach I generate group.id1 as id1, group.id2 as id2, group.id3 as id3, group.id4 as id4,group.id5 as id5, group.id6 as id6, group.id7 as id7, group.id8 as id8, group.id9 as id9, group.id10 as id10, group.id11 as id11, group.id12 as id12; --store intermediate data to HDFS and re-read store J into 'output/20100203/J' using PigStorage('\u0001'); --load previous days data K = LOAD '/user/viraj/data/20100202' USING PigStorage('\u0001') as (id1, id2, id3, id4, id5, id6, id7, id8, id9, id10, id11, id12); --read J into K1 K1 = LOAD 'output/20100203/J' using PigStorage('\u0001') as (id1, id2, id3, id4, id5, id6, id7, id8, id9, id10, id11, id12); L = COGROUP K by (id1, id2, id3, id4, id5, id6, id7, id8, id9, id10, id11, id12) OUTER, K1 by (id1, id2, id3, id4, id5, id6, id7, id8, id9, id10, id11, id12) OUTER; M = filter L by IsEmpty(K); store M into 'cogroupNoTypesIntStore' using PigStorage(); {code} Case 3: Types information specified but no intermediate store of J {code} register udf.jar A = LOAD '/user/viraj/data/20100203' USING MapLoader() AS (s, m, l); B = FOREACH A GENERATE s#'key1' as key1, s#'key2' as key2; C = FOREACH B generate key2; D = filter C by (key2 IS NOT null); E = distinct D; store E into 'unique_key_list' using PigStorage('\u0001'); F = Foreach E generate key2, MapGenerate(key2) as m; G = FILTER F by (m IS NOT null); H = foreach G generate key2, (long)m#'id1' as id1, (long)m#'id2' as id2, (long)m#'id3' as id3, (long)m#'id4' as id4, (long)m#'id5' as id5, (long)m#'id6' as id6, (long)m#'id7' as id7, (chararray)m#'id8' as id8, (chararray)m#'id9' as id9, (chararray)m#'id10' as id10, (chararray)m#'id11' as id11, (chararray)m#'id12' as id12; I = GROUP H BY (id1, id2, id3, id4, id5, id6, id7, id8, id9, id10, id11, id12); J = Foreach I generate group.id1 as id1, group.id2 as id2, group.id3 as id3, group.id4 as id4,group.id5 as id5, group.id6 as id6, group.id7 as id7,
[jira] Updated: (PIG-1252) Diamond splitter does not generate correct results when using Multi-query optimization
[ https://issues.apache.org/jira/browse/PIG-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Bhat updated PIG-1252: Description: I have script which uses split but somehow does not use one of the split branch. The skeleton of the script is as follows {code} loadData = load '/user/viraj/zebradata' using org.apache.hadoop.zebra.pig.TableLoader('col1,col2, col3, col4, col5, col6, col7'); prjData = FOREACH loadData GENERATE (chararray) col1, (chararray) col2, (chararray) col3, (chararray) ((col4 is not null and col4 != '') ? col4 : ((col5 is not null) ? col5 : '')) as splitcond, (chararray) (col6 == 'c' ? 1 : IS_VALID ('200', '0', '0', 'input.txt')) as validRec; SPLIT prjData INTO trueDataTmp IF (validRec == '1' AND splitcond != ''), falseDataTmp IF (validRec == '1' AND splitcond == ''); grpData = GROUP trueDataTmp BY splitcond; finalData = FOREACH grpData { orderedData = ORDER trueDataTmp BY col1,col2; GENERATE FLATTEN ( MYUDF (orderedData, 60, 1800, 'input.txt', 'input.dat','20100222','5', 'debug_on')) as (s,m,l); } dump finalData; {code} You can see that "falseDataTmp" is untouched. When I run this script with no-Multiquery (-M) option I get the right result. This could be the result of complex BinCond's in the POLoad. We can get rid of this error by using FILTER instead of SPIT. Viraj was: I have script which uses split but somehow does not use one of the split branch. The skeleton of the script is as follows {code} loadData = load '/user/viraj/zebradata' using org.apache.hadoop.zebra.pig.TableLoader('col1,col2, col3, col4, col5, col6, col7, col7'); prjData = FOREACH loadData GENERATE (chararray) col1, (chararray) col2, (chararray) col3, (chararray) ((col4 is not null and col4 != '') ? col4 : ((col5 is not null) ? col5 : '')) as splitcond, (chararray) (col6 == 'c' ? 1 : IS_VALID ('200', '0', '0', 'input.txt')) as validRec; SPLIT prjData INTO trueDataTmp IF (validRec == '1' AND splitcond != ''), falseDataTmp IF (validRec == '1' AND splitcond == ''); grpData = GROUP trueDataTmp BY splitcond; finalData = FOREACH grpData { orderedData = ORDER trueDataTmp BY col1,col2; GENERATE FLATTEN ( MYUDF (orderedData, 60, 1800, 'input.txt', 'input.dat','20100222','5', 'debug_on')) as (s,m,l); } dump finalData; {code} You can see that "falseDataTmp" is untouched. When I run this script with no-Multiquery (-M) option I get the right result. This could be the result of complex BinCond's in the POLoad. We can get rid of this error by using FILTER instead of SPIT. Viraj > Diamond splitter does not generate correct results when using Multi-query > optimization > -- > > Key: PIG-1252 > URL: https://issues.apache.org/jira/browse/PIG-1252 > Project: Pig > Issue Type: Bug >Affects Versions: 0.6.0 >Reporter: Viraj Bhat > Fix For: 0.7.0 > > > I have script which uses split but somehow does not use one of the split > branch. The skeleton of the script is as follows > {code} > loadData = load '/user/viraj/zebradata' using > org.apache.hadoop.zebra.pig.TableLoader('col1,col2, col3, col4, col5, col6, > col7'); > prjData = FOREACH loadData GENERATE (chararray) col1, (chararray) col2, > (chararray) col3, (chararray) ((col4 is not null and col4 != '') ? col4 : > ((col5 is not null) ? col5 : '')) as splitcond, (chararray) (col6 == 'c' ? 1 > : IS_VALID ('200', '0', '0', 'input.txt')) as validRec; > SPLIT prjData INTO trueDataTmp IF (validRec == '1' AND splitcond != ''), > falseDataTmp IF (validRec == '1' AND splitcond == ''); > grpData = GROUP trueDataTmp BY splitcond; > finalData = FOREACH grpData { >orderedData = ORDER trueDataTmp BY col1,col2; >GENERATE FLATTEN ( MYUDF (orderedData, 60, > 1800, 'input.txt', 'input.dat','20100222','5', 'debug_on')) as (s,m,l); > } > dump finalData; > {code} > You can see that "falseDataTmp" is untouched. > When I run this script with no-Multiquery (-M) option I get the right result. > This could be the result of complex BinCond's in the POLoad. We can get rid > of this error by using FILTER instead of SPIT. > Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1252) Diamond splitter does not generate correct results when using Multi-query optimization
Diamond splitter does not generate correct results when using Multi-query optimization -- Key: PIG-1252 URL: https://issues.apache.org/jira/browse/PIG-1252 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Viraj Bhat Fix For: 0.7.0 I have script which uses split but somehow does not use one of the split branch. The skeleton of the script is as follows {code} loadData = load '/user/viraj/zebradata' using org.apache.hadoop.zebra.pig.TableLoader('col1,col2, col3, col4, col5, col6, col7, col7'); prjData = FOREACH loadData GENERATE (chararray) col1, (chararray) col2, (chararray) col3, (chararray) ((col4 is not null and col4 != '') ? col4 : ((col5 is not null) ? col5 : '')) as splitcond, (chararray) (col6 == 'c' ? 1 : IS_VALID ('200', '0', '0', 'input.txt')) as validRec; SPLIT prjData INTO trueDataTmp IF (validRec == '1' AND splitcond != ''), falseDataTmp IF (validRec == '1' AND splitcond == ''); grpData = GROUP trueDataTmp BY splitcond; finalData = FOREACH grpData { orderedData = ORDER trueDataTmp BY col1,col2; GENERATE FLATTEN ( MYUDF (orderedData, 60, 1800, 'input.txt', 'input.dat','20100222','5', 'debug_on')) as (s,m,l); } dump finalData; {code} You can see that "falseDataTmp" is untouched. When I run this script with no-Multiquery (-M) option I get the right result. This could be the result of complex BinCond's in the POLoad. We can get rid of this error by using FILTER instead of SPIT. Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1247) Error Number makes it hard to debug: ERROR 2999: Unexpected internal error. org.apache.pig.backend.datastorage.DataStorageException cannot be cast to java.lang.Error
Error Number makes it hard to debug: ERROR 2999: Unexpected internal error. org.apache.pig.backend.datastorage.DataStorageException cannot be cast to java.lang.Error - Key: PIG-1247 URL: https://issues.apache.org/jira/browse/PIG-1247 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Viraj Bhat Fix For: 0.7.0 I have a large script in which there are intermediate stores statements, one of them writes to a directory I do not have permission to write to. The stack trace I get from Pig is this: 2010-02-20 02:16:32,055 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2999: Unexpected internal error. org.apache.pig.backend.datastorage.DataStorageException cannot be cast to java.lang.Error Details at logfile: /home/viraj/pig_1266632145355.log Pig Stack Trace --- ERROR 2999: Unexpected internal error. org.apache.pig.backend.datastorage.DataStorageException cannot be cast to java.lang.Error java.lang.ClassCastException: org.apache.pig.backend.datastorage.DataStorageException cannot be cast to java.lang.Error at org.apache.pig.impl.logicalLayer.parser.QueryParser.StoreClause(QueryParser.java:3583) at org.apache.pig.impl.logicalLayer.parser.QueryParser.BaseExpr(QueryParser.java:1407) at org.apache.pig.impl.logicalLayer.parser.QueryParser.Expr(QueryParser.java:949) at org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:762) at org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63) at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1036) at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:986) at org.apache.pig.PigServer.registerQuery(PigServer.java:386) at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:720) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:324) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89) at org.apache.pig.Main.main(Main.java:386) The only way to find the error was to look at the javacc generated QueryParser.java code and do a System.out.println() Here is a script to reproduce the problem: {code} A = load '/user/viraj/three.txt' using PigStorage(); B = foreach A generate ['a'#'12'] as b:map[] ; store B into '/user/secure/pigtest' using PigStorage(); {code} "three.txt" has 3 lines which contain nothing but the number "1". {code} $ hadoop fs -ls /user/secure/ ls: could not get get listing for 'hdfs://mynamenode/user/secure' : org.apache.hadoop.security.AccessControlException: Permission denied: user=viraj, access=READ_EXECUTE, inode="secure":secure:users:rwx-- {code} Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1243) Passing Complex map types to and from streaming causes a problem
Passing Complex map types to and from streaming causes a problem Key: PIG-1243 URL: https://issues.apache.org/jira/browse/PIG-1243 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Viraj Bhat Fix For: 0.7.0 I have a program which generates different types of Maps fields and stores it into PigStorage. {code} A = load '/user/viraj/three.txt' using PigStorage(); B = foreach A generate ['a'#'12'] as b:map[], ['b'#['c'#'12']] as c, ['c'#{(['d'#'15']),(['e'#'16'])}] as d; store B into '/user/viraj/pigtest' using PigStorage(); {code} Now I test the previous output in the below script to make sure I have the right results. I also pass this data to a Perl script and I observe that the complex Map types I have generated, are lost when I get the result back. {code} DEFINE CMD `simple.pl` SHIP('simple.pl'); A = load '/user/viraj/pigtest' using PigStorage() as (simpleFields, mapFields, mapListFields); B = foreach A generate $0, $1, $2; dump B; C = foreach A generate (chararray)simpleFields#'a' as value, $0,$1,$2; D = stream C through CMD as (a0:map[], a1:map[], a2:map[]); dump D; {code} dumping B results in: ([a#12],[b#[c#12]],[c#{([d#15]),([e#16])}]) ([a#12],[b#[c#12]],[c#{([d#15]),([e#16])}]) ([a#12],[b#[c#12]],[c#{([d#15]),([e#16])}]) dumping D results in: ([a#12],,) ([a#12],,) ([a#12],,) The Perl script used here is: {code} #!/usr/local/bin/perl use warnings; use strict; while(<>) { my($bc,$s,$m,$l)=split/\t/; print("$s\t$m\t$l"); } {code} Is there an issue with handling of complex Map fields within streaming? How can I fix this to obtain the right result? Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Reopened: (PIG-1194) ERROR 2055: Received Error while processing the map plan
[ https://issues.apache.org/jira/browse/PIG-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Bhat reopened PIG-1194: - Hi Richard, I ran the script attached on the ticket and found out that the map tasks fails with the following error: org.apache.pig.backend.executionengine.ExecException: ERROR 2055: Received Error while processing the map plan. at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:281) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:244) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:94) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:159) I am using the latest pig.jar without hadoop. Viraj > ERROR 2055: Received Error while processing the map plan > > > Key: PIG-1194 > URL: https://issues.apache.org/jira/browse/PIG-1194 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.5.0, 0.6.0 >Reporter: Viraj Bhat >Assignee: Richard Ding > Fix For: 0.7.0 > > Attachments: inputdata.txt, PIG-1194.patch, PIG-1194.patch > > > I have a simple Pig script which takes 3 columns out of which one is null. > {code} > input = load 'inputdata.txt' using PigStorage() as (col1, col2, col3); > a = GROUP input BY (((double) col3)/((double) col2) > .001 OR col1 < 11 ? > col1 : -1); > b = FOREACH a GENERATE group as col1, SUM(input.col2) as col2, > SUM(input.col3) as col3; > store b into 'finalresult'; > {code} > When I run this script I get the following error: > ERROR 2055: Received Error while processing the map plan. > org.apache.pig.backend.executionengine.ExecException: ERROR 2055: Received > Error while processing the map plan. > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:277) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:240) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:93) > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) > > A more useful error message for the purpose of debugging would be helpful. > Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1131) Pig simple join does not work when it contains empty lines
[ https://issues.apache.org/jira/browse/PIG-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12831251#action_12831251 ] Viraj Bhat commented on PIG-1131: - Ashutosh I was able to recreate a similar problem using the trunk. java -cp pig-withouthadoop.jar org.apache.pig.Main -version Apache Pig version 0.7.0-dev (r907874) compiled Feb 08 2010, 17:35:04 Viraj > Pig simple join does not work when it contains empty lines > -- > > Key: PIG-1131 > URL: https://issues.apache.org/jira/browse/PIG-1131 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.7.0 >Reporter: Viraj Bhat >Assignee: Ashutosh Chauhan > Fix For: 0.7.0 > > Attachments: junk1.txt, junk2.txt, simplejoinscript.pig > > > I have a simple script, which does a JOIN. > {code} > input1 = load '/user/viraj/junk1.txt' using PigStorage(' '); > describe input1; > input2 = load '/user/viraj/junk2.txt' using PigStorage('\u0001'); > describe input2; > joineddata = JOIN input1 by $0, input2 by $0; > describe joineddata; > store joineddata into 'result'; > {code} > The input data contains empty lines. > The join fails in the Map phase with the following error in the > PRLocalRearrange.java > java.lang.IndexOutOfBoundsException: Index: 1, Size: 1 > at java.util.ArrayList.RangeCheck(ArrayList.java:547) > at java.util.ArrayList.get(ArrayList.java:322) > at org.apache.pig.data.DefaultTuple.get(DefaultTuple.java:143) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.constructLROutput(POLocalRearrange.java:464) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:360) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POUnion.getNext(POUnion.java:162) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:244) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:94) > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) > at org.apache.hadoop.mapred.Child.main(Child.java:159) > I am surprised that the test cases did not detect this error. Could we add > this data which contains empty lines to the testcases? > Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1131) Pig simple join does not work when it contains empty lines
[ https://issues.apache.org/jira/browse/PIG-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12831248#action_12831248 ] Viraj Bhat commented on PIG-1131: - Olga I marked it as critical since we mention that Pig can eat any type of data, and the example script shows that we need data with fixed schema's and to perform a simple join. Viraj > Pig simple join does not work when it contains empty lines > -- > > Key: PIG-1131 > URL: https://issues.apache.org/jira/browse/PIG-1131 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.7.0 >Reporter: Viraj Bhat >Assignee: Ashutosh Chauhan > Fix For: 0.7.0 > > Attachments: junk1.txt, junk2.txt, simplejoinscript.pig > > > I have a simple script, which does a JOIN. > {code} > input1 = load '/user/viraj/junk1.txt' using PigStorage(' '); > describe input1; > input2 = load '/user/viraj/junk2.txt' using PigStorage('\u0001'); > describe input2; > joineddata = JOIN input1 by $0, input2 by $0; > describe joineddata; > store joineddata into 'result'; > {code} > The input data contains empty lines. > The join fails in the Map phase with the following error in the > PRLocalRearrange.java > java.lang.IndexOutOfBoundsException: Index: 1, Size: 1 > at java.util.ArrayList.RangeCheck(ArrayList.java:547) > at java.util.ArrayList.get(ArrayList.java:322) > at org.apache.pig.data.DefaultTuple.get(DefaultTuple.java:143) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.constructLROutput(POLocalRearrange.java:464) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:360) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POUnion.getNext(POUnion.java:162) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:244) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:94) > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) > at org.apache.hadoop.mapred.Child.main(Child.java:159) > I am surprised that the test cases did not detect this error. Could we add > this data which contains empty lines to the testcases? > Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1220) Document unknown keywords as missing or to do in future
Document unknown keywords as missing or to do in future --- Key: PIG-1220 URL: https://issues.apache.org/jira/browse/PIG-1220 Project: Pig Issue Type: Bug Components: documentation Affects Versions: 0.6.0 Reporter: Viraj Bhat Fix For: 0.7.0 To get help at the grunt shell I do the following: grunt>touchz 010-02-04 00:59:28,714 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Encountered " "touchz "" at line 1, column 1. Was expecting one of: "cat" ... "fs" ... "cd" ... "cp" ... "copyFromLocal" ... "copyToLocal" ... "dump" ... "describe" ... "aliases" ... "explain" ... "help" ... "kill" ... "ls" ... "mv" ... "mkdir" ... "pwd" ... "quit" ... "register" ... "rm" ... "rmf" ... "set" ... "illustrate" ... "run" ... "exec" ... "scriptDone" ... "" ... ... ";" ... I looked at the code and found that we do nothing at: "scriptDone": Is there some future value of that command. Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1211) Pig script runs half way after which it reports syntax error
Pig script runs half way after which it reports syntax error Key: PIG-1211 URL: https://issues.apache.org/jira/browse/PIG-1211 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.6.0 Reporter: Viraj Bhat Fix For: 0.8.0 I have a Pig script which is structured in the following way {code} register cp.jar dataset = load '/data/dataset/' using PigStorage('\u0001') as (col1, col2, col3, col4, col5); filtered_dataset = filter dataset by (col1 == 1); proj_filtered_dataset = foreach filtered_dataset generate col2, col3; rmf $output1; store proj_filtered_dataset into '$output1' using PigStorage(); second_stream = foreach filtered_dataset generate col2, col4, col5; group_second_stream = group second_stream by col4; output2 = foreach group_second_stream { a = second_stream.col2 b = distinct second_stream.col5; c = order b by $0; generate 1 as key, group as keyword, MYUDF(c, 100) as finalcalc; } rmf $output2; --syntax error here store output2 to '$output2' using PigStorage(); {code} I run this script using the Multi-query option, it runs successfully till the first store but later fails with a syntax error. The usage of HDFS option, "rmf" causes the first store to execute. The only option the I have is to run an explain before running his script grunt> explain -script myscript.pig -out explain.out or moving the rmf statements to the top of the script Here are some questions: a) Can we have an option to do something like "checkscript" instead of explain to get the same syntax error? In this way I can ensure that I do not run for 3-4 hours before encountering a syntax error b) Can pig not figure out a way to re-order the rmf statements since all the store directories are variables Thanks Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-531) Way for explain to show 1 plan at a time
[ https://issues.apache.org/jira/browse/PIG-531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Bhat updated PIG-531: --- Fix Version/s: 0.5.0 Hi Olga, I think we have a way to handle it in multi-query optimization. Is it reasonable to close this as fixed. I see the following in the Multi-query document about explain: http://wiki.apache.org/pig/PigMultiQueryPerformanceSpecification explain [-out ] [-brief] [-dot] [-param =]* [-param_file ]* [-script ] [] Viraj > Way for explain to show 1 plan at a time > > > Key: PIG-531 > URL: https://issues.apache.org/jira/browse/PIG-531 > Project: Pig > Issue Type: Improvement >Affects Versions: 0.2.0 >Reporter: Olga Natkovich > Fix For: 0.5.0 > > > Several users complained that EXPLAIN output is too verbose and is hard to > make sense of. > One way to improve the situation is to realize is that EXPLAIN actually > contains several plans: logical, physical, backend specific. So we can update > EXPLAIN to allow to show a particular plan. For instance > EXPLAIN LOGICAL A; > would show only logical plan. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-940) Cross site HDFS access using the default.fs.name not possible in Pig
[ https://issues.apache.org/jira/browse/PIG-940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Bhat updated PIG-940: --- Affects Version/s: (was: 0.3.0) 0.5.0 Fix Version/s: 0.7.0 > Cross site HDFS access using the default.fs.name not possible in Pig > > > Key: PIG-940 > URL: https://issues.apache.org/jira/browse/PIG-940 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.5.0 > Environment: Hadoop 20 >Reporter: Viraj Bhat > Fix For: 0.7.0 > > > I have a script which does the following.. access data from a remote HDFS > location (via a HDFS installed at:hdfs://remotemachine1.company.com/ ) [[as I > do not want to copy this huge amount of data between HDFS locations]]. > However I want my Pigscript to write data to the HDFS running on > localmachine.company.com. > Currently Pig does not support that behavior and complains that: > "hdfs://localmachine.company.com/user/viraj/A1.txt does not exist" > {code} > A = LOAD 'hdfs://remotemachine1.company.com/user/viraj/A1.txt' as (a, b); > B = LOAD 'hdfs://remotemachine1.company.com/user/viraj/B1.txt' as (c, d); > C = JOIN A by a, B by c; > store C into 'output' using PigStorage(); > {code} > === > 2009-09-01 00:37:24,032 [main] INFO > org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting > to hadoop file system at: hdfs://localmachine.company.com:8020 > 2009-09-01 00:37:24,277 [main] INFO > org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting > to map-reduce job tracker at: localmachine.company.com:50300 > 2009-09-01 00:37:24,567 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler$LastInputStreamingOptimizer > - Rewrite: POPackage->POForEach to POJoinPackage > 2009-09-01 00:37:24,573 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer > - MR plan size before optimization: 1 > 2009-09-01 00:37:24,573 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer > - MR plan size after optimization: 1 > 2009-09-01 00:37:26,197 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler > - Setting up single store job > 2009-09-01 00:37:26,249 [Thread-9] WARN org.apache.hadoop.mapred.JobClient - > Use GenericOptionsParser for parsing the arguments. Applications should > implement Tool for the same. > 2009-09-01 00:37:26,746 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > - 0% complete > 2009-09-01 00:37:26,746 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > - 100% complete > 2009-09-01 00:37:26,747 [main] ERROR > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > - 1 map reduce job(s) failed! > 2009-09-01 00:37:26,756 [main] ERROR > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > - Failed to produce result in: > "hdfs:/localmachine.company.com/tmp/temp-1470407685/tmp-510854480" > 2009-09-01 00:37:26,756 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > - Failed! > 2009-09-01 00:37:26,758 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR > 2100: hdfs://localmachine.company.com/user/viraj/A1.txt does not exist. > Details at logfile: /home/viraj/pigscripts/pig_1251765443851.log > === > The error file in Pig contains: > === > ERROR 2998: Unhandled internal error. > org.apache.pig.backend.executionengine.ExecException: ERROR 2100: > hdfs://localmachine.company.com/user/viraj/A1.txt does not exist. > at > org.apache.pig.backend.executionengine.PigSlicer.validate(PigSlicer.java:126) > at > org.apache.pig.impl.io.ValidatingInputFileSpec.validate(ValidatingInputFileSpec.java:59) > at > org.apache.pig.impl.io.ValidatingInputFileSpec.(ValidatingInputFileSpec.java:44) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:228) > at > org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) > at > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781) > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) >
[jira] Updated: (PIG-1174) Creation of output path should be done by storage function
[ https://issues.apache.org/jira/browse/PIG-1174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Bhat updated PIG-1174: Fix Version/s: 0.7.0 > Creation of output path should be done by storage function > -- > > Key: PIG-1174 > URL: https://issues.apache.org/jira/browse/PIG-1174 > Project: Pig > Issue Type: Bug >Reporter: Bill Graham > Fix For: 0.7.0 > > > When executing a STORE command, Pig creates the output location before the > storage function gets called. This causes problems with storage functions > that have logic to determine the output location. See this thread: > http://www.mail-archive.com/pig-user%40hadoop.apache.org/msg01538.html > For example, when making a request like this: > STORE A INTO '/my/home/output' USING MultiStorage('/my/home/output','0', > 'none', '\t'); > Pig creates a file '/my/home/output' and then an exception is thrown when > MultiStorage tries to make a directory under '/my/home/output'. The > workaround is to instead specify a dummy location as the first path like so: > STORE A INTO '/my/home/output/temp' USING MultiStorage('/my/home/output','0', > 'none', '\t'); > Two changes should be made: > 1. The path specified in the INTO clause should be available to the storage > function so it doesn't need to be duplicated. > 2. The creation of the output paths should be delegated to the storage > function. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1194) ERROR 2055: Received Error while processing the map plan
ERROR 2055: Received Error while processing the map plan Key: PIG-1194 URL: https://issues.apache.org/jira/browse/PIG-1194 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.5.0, 0.6.0 Reporter: Viraj Bhat Assignee: Richard Ding Fix For: 0.6.0 Attachments: inputdata.txt I have a simple Pig script which takes 3 columns out of which one is null. {code} input = load 'inputdata.txt' using PigStorage() as (col1, col2, col3); a = GROUP input BY (((double) col3)/((double) col2) > .001 OR col1 < 11 ? col1 : -1); b = FOREACH a GENERATE group as col1, SUM(input.col2) as col2, SUM(input.col3) as col3; store b into 'finalresult'; {code} When I run this script I get the following error: ERROR 2055: Received Error while processing the map plan. org.apache.pig.backend.executionengine.ExecException: ERROR 2055: Received Error while processing the map plan. at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:277) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:240) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:93) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) A more useful error message for the purpose of debugging would be helpful. Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1194) ERROR 2055: Received Error while processing the map plan
[ https://issues.apache.org/jira/browse/PIG-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Bhat updated PIG-1194: Attachment: inputdata.txt Testdata to run with this script > ERROR 2055: Received Error while processing the map plan > > > Key: PIG-1194 > URL: https://issues.apache.org/jira/browse/PIG-1194 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.5.0, 0.6.0 >Reporter: Viraj Bhat >Assignee: Richard Ding > Fix For: 0.6.0 > > Attachments: inputdata.txt > > > I have a simple Pig script which takes 3 columns out of which one is null. > {code} > input = load 'inputdata.txt' using PigStorage() as (col1, col2, col3); > a = GROUP input BY (((double) col3)/((double) col2) > .001 OR col1 < 11 ? > col1 : -1); > b = FOREACH a GENERATE group as col1, SUM(input.col2) as col2, > SUM(input.col3) as col3; > store b into 'finalresult'; > {code} > When I run this script I get the following error: > ERROR 2055: Received Error while processing the map plan. > org.apache.pig.backend.executionengine.ExecException: ERROR 2055: Received > Error while processing the map plan. > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:277) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:240) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:93) > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) > > A more useful error message for the purpose of debugging would be helpful. > Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1187) UTF-8 (international code) breaks with loader when load with schema is specified
[ https://issues.apache.org/jira/browse/PIG-1187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12800315#action_12800315 ] Viraj Bhat commented on PIG-1187: - Hi Jeff, This is specific to the data we are using and it looks like parser failed when it is trying to interpret some characters. As such we have tested this with Chinese characters and it works. Viraj > UTF-8 (international code) breaks with loader when load with schema is > specified > > > Key: PIG-1187 > URL: https://issues.apache.org/jira/browse/PIG-1187 > Project: Pig > Issue Type: Bug >Affects Versions: 0.6.0 >Reporter: Viraj Bhat > Fix For: 0.6.0 > > > I have a set of Pig statements which dump an international dataset. > {code} > INPUT_OBJECT = load 'internationalcode'; > describe INPUT_OBJECT; > dump INPUT_OBJECT; > {code} > Sample output > (756a6196-ebcd-4789-ad2f-175e5df65d55,{(labelAaÂâÀ),(labelあいうえお1),(labelஜார்க2),(labeladfadf)}) > It works and dumps results but when I use a schema for loading it fails. > {code} > INPUT_OBJECT = load 'internationalcode' AS (object_id:chararray, labels: bag > {T: tuple(label:chararray)}); > describe INPUT_OBJECT; > {code} > The error message is as follows:2010-01-14 02:23:27,320 FATAL > org.apache.hadoop.mapred.Child: Error running child : > org.apache.pig.data.parser.TokenMgrError: Error: Bailing out of infinite loop > caused by repeated empty string matches at line 1, column 21. > at > org.apache.pig.data.parser.TextDataParserTokenManager.TokenLexicalActions(TextDataParserTokenManager.java:620) > at > org.apache.pig.data.parser.TextDataParserTokenManager.getNextToken(TextDataParserTokenManager.java:569) > at > org.apache.pig.data.parser.TextDataParser.jj_ntk(TextDataParser.java:651) > at > org.apache.pig.data.parser.TextDataParser.Tuple(TextDataParser.java:152) > at > org.apache.pig.data.parser.TextDataParser.Bag(TextDataParser.java:100) > at > org.apache.pig.data.parser.TextDataParser.Datum(TextDataParser.java:382) > at > org.apache.pig.data.parser.TextDataParser.Parse(TextDataParser.java:42) > at > org.apache.pig.builtin.Utf8StorageConverter.parseFromBytes(Utf8StorageConverter.java:68) > at > org.apache.pig.builtin.Utf8StorageConverter.bytesToBag(Utf8StorageConverter.java:76) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:845) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:250) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:204) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:249) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:240) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$Map.map(PigMapOnly.java:65) > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) > at org.apache.hadoop.mapred.Child.main(Child.java:159) > Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1187) UTF-8 (international code) breaks with loader when load with schema is specified
UTF-8 (international code) breaks with loader when load with schema is specified Key: PIG-1187 URL: https://issues.apache.org/jira/browse/PIG-1187 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Viraj Bhat Fix For: 0.6.0 I have a set of Pig statements which dump an international dataset. {code} INPUT_OBJECT = load 'internationalcode'; describe INPUT_OBJECT; dump INPUT_OBJECT; {code} Sample output (756a6196-ebcd-4789-ad2f-175e5df65d55,{(labelAaÂâÀ),(labelあいうえお1),(labelஜார்க2),(labeladfadf)}) It works and dumps results but when I use a schema for loading it fails. {code} INPUT_OBJECT = load 'internationalcode' AS (object_id:chararray, labels: bag {T: tuple(label:chararray)}); describe INPUT_OBJECT; {code} The error message is as follows:2010-01-14 02:23:27,320 FATAL org.apache.hadoop.mapred.Child: Error running child : org.apache.pig.data.parser.TokenMgrError: Error: Bailing out of infinite loop caused by repeated empty string matches at line 1, column 21. at org.apache.pig.data.parser.TextDataParserTokenManager.TokenLexicalActions(TextDataParserTokenManager.java:620) at org.apache.pig.data.parser.TextDataParserTokenManager.getNextToken(TextDataParserTokenManager.java:569) at org.apache.pig.data.parser.TextDataParser.jj_ntk(TextDataParser.java:651) at org.apache.pig.data.parser.TextDataParser.Tuple(TextDataParser.java:152) at org.apache.pig.data.parser.TextDataParser.Bag(TextDataParser.java:100) at org.apache.pig.data.parser.TextDataParser.Datum(TextDataParser.java:382) at org.apache.pig.data.parser.TextDataParser.Parse(TextDataParser.java:42) at org.apache.pig.builtin.Utf8StorageConverter.parseFromBytes(Utf8StorageConverter.java:68) at org.apache.pig.builtin.Utf8StorageConverter.bytesToBag(Utf8StorageConverter.java:76) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:845) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:250) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:204) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:249) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:240) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$Map.map(PigMapOnly.java:65) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:159) Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1157) Sucessive replicated joins do not generate Map Reduce plan and fails due to OOM
[ https://issues.apache.org/jira/browse/PIG-1157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12792061#action_12792061 ] Viraj Bhat commented on PIG-1157: - Hi Richard, Thanks for your suggestion, it works. Additionally we could also use the "exec" statement before the alias E to prevent the implicit dependency. How hard/easy is it for Pig to find out if there is an implicit dependency or not. Pig anyway has a copy of the logical plan in memory, where it knows that alias E requires output from D which is generated in the previous step. Can we not warn the user about this implicit dependency? Viraj > Sucessive replicated joins do not generate Map Reduce plan and fails due to > OOM > --- > > Key: PIG-1157 > URL: https://issues.apache.org/jira/browse/PIG-1157 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.6.0 >Reporter: Viraj Bhat >Assignee: Richard Ding > Fix For: 0.6.0 > > Attachments: oomreplicatedjoin.pig, replicatedjoinexplain.log > > > Hi all, > I have a script which does 2 replicated joins in succession. Please note > that the inputs do not exist on the HDFS. > {code} > A = LOAD '/tmp/abc' USING PigStorage('\u0001') AS (a:long, b, c); > A1 = FOREACH A GENERATE a; > B = GROUP A1 BY a; > C = LOAD '/tmp/xyz' USING PigStorage('\u0001') AS (x:long, y); > D = JOIN C BY x, B BY group USING "replicated"; > E = JOIN A BY a, D by x USING "replicated"; > dump E; > {code} > 2009-12-16 19:12:00,253 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer > - MR plan size before optimization: 4 > 2009-12-16 19:12:00,254 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer > - Merged 1 map-only splittees. > 2009-12-16 19:12:00,254 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer > - Merged 1 map-reduce splittees. > 2009-12-16 19:12:00,254 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer > - Merged 2 out of total 2 splittees. > 2009-12-16 19:12:00,254 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer > - MR plan size after optimization: 2 > 2009-12-16 19:12:00,713 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR > 2998: Unhandled internal error. unable to create new native thread > Details at logfile: pig_1260990666148.log > Looking at the log file: > Pig Stack Trace > --- > ERROR 2998: Unhandled internal error. unable to create new native thread > java.lang.OutOfMemoryError: unable to create new native thread > at java.lang.Thread.start0(Native Method) > at java.lang.Thread.start(Thread.java:597) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:131) > at > org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:265) > at > org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:773) > at org.apache.pig.PigServer.store(PigServer.java:522) > at org.apache.pig.PigServer.openIterator(PigServer.java:458) > at > org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:532) > at > org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:190) > at > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166) > at > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:142) > at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89) > at org.apache.pig.Main.main(Main.java:397) > > If we want to look at the explain output, we find that there is no Map Reduce > plan that is generated. > Why is the M/R plan not generated? > Attaching the script and explain output. > Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1157) Sucessive replicated joins do not generate Map Reduce plan and fails due to OOM
[ https://issues.apache.org/jira/browse/PIG-1157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Bhat updated PIG-1157: Attachment: oomreplicatedjoin.pig replicatedjoinexplain.log Explain output and Pig script. > Sucessive replicated joins do not generate Map Reduce plan and fails due to > OOM > --- > > Key: PIG-1157 > URL: https://issues.apache.org/jira/browse/PIG-1157 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.6.0 >Reporter: Viraj Bhat > Fix For: 0.6.0 > > Attachments: oomreplicatedjoin.pig, replicatedjoinexplain.log > > > Hi all, > I have a script which does 2 replicated joins in succession. Please note > that the inputs do not exist on the HDFS. > {code} > A = LOAD '/tmp/abc' USING PigStorage('\u0001') AS (a:long, b, c); > A1 = FOREACH A GENERATE a; > B = GROUP A1 BY a; > C = LOAD '/tmp/xyz' USING PigStorage('\u0001') AS (x:long, y); > D = JOIN C BY x, B BY group USING "replicated"; > E = JOIN A BY a, D by x USING "replicated"; > dump E; > {code} > 2009-12-16 19:12:00,253 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer > - MR plan size before optimization: 4 > 2009-12-16 19:12:00,254 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer > - Merged 1 map-only splittees. > 2009-12-16 19:12:00,254 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer > - Merged 1 map-reduce splittees. > 2009-12-16 19:12:00,254 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer > - Merged 2 out of total 2 splittees. > 2009-12-16 19:12:00,254 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer > - MR plan size after optimization: 2 > 2009-12-16 19:12:00,713 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR > 2998: Unhandled internal error. unable to create new native thread > Details at logfile: pig_1260990666148.log > Looking at the log file: > Pig Stack Trace > --- > ERROR 2998: Unhandled internal error. unable to create new native thread > java.lang.OutOfMemoryError: unable to create new native thread > at java.lang.Thread.start0(Native Method) > at java.lang.Thread.start(Thread.java:597) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:131) > at > org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:265) > at > org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:773) > at org.apache.pig.PigServer.store(PigServer.java:522) > at org.apache.pig.PigServer.openIterator(PigServer.java:458) > at > org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:532) > at > org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:190) > at > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166) > at > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:142) > at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89) > at org.apache.pig.Main.main(Main.java:397) > > If we want to look at the explain output, we find that there is no Map Reduce > plan that is generated. > Why is the M/R plan not generated? > Attaching the script and explain output. > Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1157) Sucessive replicated joins do not generate Map Reduce plan and fails due to OOM
Sucessive replicated joins do not generate Map Reduce plan and fails due to OOM --- Key: PIG-1157 URL: https://issues.apache.org/jira/browse/PIG-1157 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Viraj Bhat Fix For: 0.6.0 Hi all, I have a script which does 2 replicated joins in succession. Please note that the inputs do not exist on the HDFS. {code} A = LOAD '/tmp/abc' USING PigStorage('\u0001') AS (a:long, b, c); A1 = FOREACH A GENERATE a; B = GROUP A1 BY a; C = LOAD '/tmp/xyz' USING PigStorage('\u0001') AS (x:long, y); D = JOIN C BY x, B BY group USING "replicated"; E = JOIN A BY a, D by x USING "replicated"; dump E; {code} 2009-12-16 19:12:00,253 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 4 2009-12-16 19:12:00,254 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 1 map-only splittees. 2009-12-16 19:12:00,254 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 1 map-reduce splittees. 2009-12-16 19:12:00,254 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 2 out of total 2 splittees. 2009-12-16 19:12:00,254 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 2 2009-12-16 19:12:00,713 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2998: Unhandled internal error. unable to create new native thread Details at logfile: pig_1260990666148.log Looking at the log file: Pig Stack Trace --- ERROR 2998: Unhandled internal error. unable to create new native thread java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:597) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:131) at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:265) at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:773) at org.apache.pig.PigServer.store(PigServer.java:522) at org.apache.pig.PigServer.openIterator(PigServer.java:458) at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:532) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:190) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:142) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89) at org.apache.pig.Main.main(Main.java:397) If we want to look at the explain output, we find that there is no Map Reduce plan that is generated. Why is the M/R plan not generated? Attaching the script and explain output. Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1144) set default_parallelism construct does not set the number of reducers correctly
[ https://issues.apache.org/jira/browse/PIG-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788481#action_12788481 ] Viraj Bhat commented on PIG-1144: - Hi Daniel, Thanks again for your input. This is more of a performance issue, where users do not detect, till they see that 1 reducer job has failed in the sort phase. They safely assume that the default_parallel keyword will do the trick. Viraj > set default_parallelism construct does not set the number of reducers > correctly > --- > > Key: PIG-1144 > URL: https://issues.apache.org/jira/browse/PIG-1144 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.6.0 > Environment: Hadoop 20 cluster with multi-node installation >Reporter: Viraj Bhat >Assignee: Daniel Dai > Fix For: 0.7.0 > > Attachments: brokenparallel.out, genericscript_broken_parallel.pig, > PIG-1144-1.patch > > > Hi all, > I have a Pig script where I set the parallelism using the following set > construct: "set default_parallel 100" . I modified the "MRPrinter.java" to > printout the parallelism > {code} > ... > public void visitMROp(MapReduceOper mr) > mStream.println("MapReduce node " + mr.getOperatorKey().toString() + " > Parallelism " + mr.getRequestedParallelism()); > ... > {code} > When I run an explain on the script, I see that the last job which does the > actual sort, runs as a single reducer job. This can be corrected, by adding > the PARALLEL keyword in front of the ORDER BY. > Attaching the script and the explain output > Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1144) set default_parallelism construct does not set the number of reducers correctly
[ https://issues.apache.org/jira/browse/PIG-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788439#action_12788439 ] Viraj Bhat commented on PIG-1144: - Hi Daniel, One more thing to note is that the Last Sort M/R job has a parallelism of 1. Should it not be -1? Viraj > set default_parallelism construct does not set the number of reducers > correctly > --- > > Key: PIG-1144 > URL: https://issues.apache.org/jira/browse/PIG-1144 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.7.0 > Environment: Hadoop 20 cluster with multi-node installation >Reporter: Viraj Bhat > Fix For: 0.7.0 > > Attachments: brokenparallel.out, genericscript_broken_parallel.pig > > > Hi all, > I have a Pig script where I set the parallelism using the following set > construct: "set default_parallel 100" . I modified the "MRPrinter.java" to > printout the parallelism > {code} > ... > public void visitMROp(MapReduceOper mr) > mStream.println("MapReduce node " + mr.getOperatorKey().toString() + " > Parallelism " + mr.getRequestedParallelism()); > ... > {code} > When I run an explain on the script, I see that the last job which does the > actual sort, runs as a single reducer job. This can be corrected, by adding > the PARALLEL keyword in front of the ORDER BY. > Attaching the script and the explain output > Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1144) set default_parallelism construct does not set the number of reducers correctly
[ https://issues.apache.org/jira/browse/PIG-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788436#action_12788436 ] Viraj Bhat commented on PIG-1144: - This happens on the real cluster, where the sorting job did not complete because of a single reducer. > set default_parallelism construct does not set the number of reducers > correctly > --- > > Key: PIG-1144 > URL: https://issues.apache.org/jira/browse/PIG-1144 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.7.0 > Environment: Hadoop 20 cluster with multi-node installation >Reporter: Viraj Bhat > Fix For: 0.7.0 > > Attachments: brokenparallel.out, genericscript_broken_parallel.pig > > > Hi all, > I have a Pig script where I set the parallelism using the following set > construct: "set default_parallel 100" . I modified the "MRPrinter.java" to > printout the parallelism > {code} > ... > public void visitMROp(MapReduceOper mr) > mStream.println("MapReduce node " + mr.getOperatorKey().toString() + " > Parallelism " + mr.getRequestedParallelism()); > ... > {code} > When I run an explain on the script, I see that the last job which does the > actual sort, runs as a single reducer job. This can be corrected, by adding > the PARALLEL keyword in front of the ORDER BY. > Attaching the script and the explain output > Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1144) set default_parallelism construct does not set the number of reducers correctly
[ https://issues.apache.org/jira/browse/PIG-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Bhat updated PIG-1144: Attachment: brokenparallel.out genericscript_broken_parallel.pig Script and explain output > set default_parallelism construct does not set the number of reducers > correctly > --- > > Key: PIG-1144 > URL: https://issues.apache.org/jira/browse/PIG-1144 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.7.0 > Environment: Hadoop 20 cluster with multi-node installation >Reporter: Viraj Bhat > Fix For: 0.7.0 > > Attachments: brokenparallel.out, genericscript_broken_parallel.pig > > > Hi all, > I have a Pig script where I set the parallelism using the following set > construct: "set default_parallel 100" . I modified the "MRPrinter.java" to > printout the parallelism > {code} > ... > public void visitMROp(MapReduceOper mr) > mStream.println("MapReduce node " + mr.getOperatorKey().toString() + " > Parallelism " + mr.getRequestedParallelism()); > ... > {code} > When I run an explain on the script, I see that the last job which does the > actual sort, runs as a single reducer job. This can be corrected, by adding > the PARALLEL keyword in front of the ORDER BY. > Attaching the script and the explain output > Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1144) set default_parallelism construct does not set the number of reducers correctly
set default_parallelism construct does not set the number of reducers correctly --- Key: PIG-1144 URL: https://issues.apache.org/jira/browse/PIG-1144 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Environment: Hadoop 20 cluster with multi-node installation Reporter: Viraj Bhat Fix For: 0.7.0 Hi all, I have a Pig script where I set the parallelism using the following set construct: "set default_parallel 100" . I modified the "MRPrinter.java" to printout the parallelism {code} ... public void visitMROp(MapReduceOper mr) mStream.println("MapReduce node " + mr.getOperatorKey().toString() + " Parallelism " + mr.getRequestedParallelism()); ... {code} When I run an explain on the script, I see that the last job which does the actual sort, runs as a single reducer job. This can be corrected, by adding the PARALLEL keyword in front of the ORDER BY. Attaching the script and the explain output Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1131) Pig simple join does not work when it contains empty lines
[ https://issues.apache.org/jira/browse/PIG-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788387#action_12788387 ] Viraj Bhat commented on PIG-1131: - Hi Pradeep, So the workaround for this is for the user to specify the schema for the largest size tuple or which contains the maximum number of fields/columns? Viraj > Pig simple join does not work when it contains empty lines > -- > > Key: PIG-1131 > URL: https://issues.apache.org/jira/browse/PIG-1131 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.7.0 >Reporter: Viraj Bhat >Priority: Critical > Fix For: 0.7.0 > > Attachments: junk1.txt, junk2.txt, simplejoinscript.pig > > > I have a simple script, which does a JOIN. > {code} > input1 = load '/user/viraj/junk1.txt' using PigStorage(' '); > describe input1; > input2 = load '/user/viraj/junk2.txt' using PigStorage('\u0001'); > describe input2; > joineddata = JOIN input1 by $0, input2 by $0; > describe joineddata; > store joineddata into 'result'; > {code} > The input data contains empty lines. > The join fails in the Map phase with the following error in the > PRLocalRearrange.java > java.lang.IndexOutOfBoundsException: Index: 1, Size: 1 > at java.util.ArrayList.RangeCheck(ArrayList.java:547) > at java.util.ArrayList.get(ArrayList.java:322) > at org.apache.pig.data.DefaultTuple.get(DefaultTuple.java:143) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.constructLROutput(POLocalRearrange.java:464) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:360) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POUnion.getNext(POUnion.java:162) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:244) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:94) > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) > at org.apache.hadoop.mapred.Child.main(Child.java:159) > I am surprised that the test cases did not detect this error. Could we add > this data which contains empty lines to the testcases? > Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1131) Pig simple join does not work when it contains empty lines
[ https://issues.apache.org/jira/browse/PIG-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Bhat updated PIG-1131: Attachment: simplejoinscript.pig junk2.txt junk1.txt Dummy datasets and pig script > Pig simple join does not work when it contains empty lines > -- > > Key: PIG-1131 > URL: https://issues.apache.org/jira/browse/PIG-1131 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.7.0 >Reporter: Viraj Bhat >Priority: Critical > Fix For: 0.7.0 > > Attachments: junk1.txt, junk2.txt, simplejoinscript.pig > > > I have a simple script, which does a JOIN. > {code} > input1 = load '/user/viraj/junk1.txt' using PigStorage(' '); > describe input1; > input2 = load '/user/viraj/junk2.txt' using PigStorage('\u0001'); > describe input2; > joineddata = JOIN input1 by $0, input2 by $0; > describe joineddata; > store joineddata into 'result'; > {code} > The input data contains empty lines. > The join fails in the Map phase with the following error in the > PRLocalRearrange.java > java.lang.IndexOutOfBoundsException: Index: 1, Size: 1 > at java.util.ArrayList.RangeCheck(ArrayList.java:547) > at java.util.ArrayList.get(ArrayList.java:322) > at org.apache.pig.data.DefaultTuple.get(DefaultTuple.java:143) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.constructLROutput(POLocalRearrange.java:464) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:360) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POUnion.getNext(POUnion.java:162) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:244) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:94) > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) > at org.apache.hadoop.mapred.Child.main(Child.java:159) > I am surprised that the test cases did not detect this error. Could we add > this data which contains empty lines to the testcases? > Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1131) Pig simple join does not work when it contains empty lines
Pig simple join does not work when it contains empty lines -- Key: PIG-1131 URL: https://issues.apache.org/jira/browse/PIG-1131 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Reporter: Viraj Bhat Priority: Critical Fix For: 0.7.0 I have a simple script, which does a JOIN. {code} input1 = load '/user/viraj/junk1.txt' using PigStorage(' '); describe input1; input2 = load '/user/viraj/junk2.txt' using PigStorage('\u0001'); describe input2; joineddata = JOIN input1 by $0, input2 by $0; describe joineddata; store joineddata into 'result'; {code} The input data contains empty lines. The join fails in the Map phase with the following error in the PRLocalRearrange.java java.lang.IndexOutOfBoundsException: Index: 1, Size: 1 at java.util.ArrayList.RangeCheck(ArrayList.java:547) at java.util.ArrayList.get(ArrayList.java:322) at org.apache.pig.data.DefaultTuple.get(DefaultTuple.java:143) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.constructLROutput(POLocalRearrange.java:464) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:360) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POUnion.getNext(POUnion.java:162) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:244) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:94) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:159) I am surprised that the test cases did not detect this error. Could we add this data which contains empty lines to the testcases? Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1124) Unable to set Custom Job Name using the -Dmapred.job.name parameter
Unable to set Custom Job Name using the -Dmapred.job.name parameter --- Key: PIG-1124 URL: https://issues.apache.org/jira/browse/PIG-1124 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Viraj Bhat Priority: Minor Fix For: 0.6.0 As a Hadoop user I want to control the Job name for my analysis via the command line using the following construct:: java -cp pig.jar:$HADOOP_HOME/conf -Dmapred.job.name=hadoop_junkie org.apache.pig.Main broken.pig -Dmapred.job.name should normally set my Hadoop Job name, but somehow during the formation of the job.xml in Pig this information is lost and the job name turns out to be: "PigLatin:broken.pig" The current workaround seems to be wiring it in the script itself, using the following ( or using parameter substitution). set job.name 'my job' Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1123) Popularize usage of default_parallel keyword in Cookook and Latin Manual
Popularize usage of default_parallel keyword in Cookook and Latin Manual Key: PIG-1123 URL: https://issues.apache.org/jira/browse/PIG-1123 Project: Pig Issue Type: Improvement Components: documentation Affects Versions: 0.6.0 Reporter: Viraj Bhat Fix For: 0.6.0 In the Pig 0.5 release we have the option of setting the default reduce parallelism for a script using the following construct: set default_parallel 100 Unfortunately I do not see this documented on the Reference Manual, in the "SET" section. http://hadoop.apache.org/pig/docs/r0.5.0/piglatin_reference.html or in the Cookbook: http://hadoop.apache.org/pig/docs/r0.5.0/cookbook.html "Use PARALLEL Keyword" section. Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1101) Pig parser does not recognize its own data type in LIMIT statement
Pig parser does not recognize its own data type in LIMIT statement -- Key: PIG-1101 URL: https://issues.apache.org/jira/browse/PIG-1101 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Viraj Bhat Priority: Minor Fix For: 0.6.0 I have a Pig script in which I specify the number of records to limit as a long type. {code} A = LOAD '/user/viraj/echo.txt' AS (txt:chararray); B = LIMIT A 10L; DUMP B; {code} I get a parser error: 2009-11-21 02:25:51,100 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Encountered " "10L "" at line 3, column 13. Was expecting: ... at org.apache.pig.impl.logicalLayer.parser.QueryParser.generateParseException(QueryParser.java:8963) at org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_consume_token(QueryParser.java:8839) at org.apache.pig.impl.logicalLayer.parser.QueryParser.LimitClause(QueryParser.java:1656) at org.apache.pig.impl.logicalLayer.parser.QueryParser.BaseExpr(QueryParser.java:1280) at org.apache.pig.impl.logicalLayer.parser.QueryParser.Expr(QueryParser.java:893) at org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:682) at org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63) at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1017) In fact 10L seems to work in the foreach generate construct. Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1084) Pig CookBook documentation "Take Advantage of Join Optimization" additions:Merge and Skewed Join
Pig CookBook documentation "Take Advantage of Join Optimization" additions:Merge and Skewed Join Key: PIG-1084 URL: https://issues.apache.org/jira/browse/PIG-1084 Project: Pig Issue Type: Bug Components: documentation Affects Versions: 0.6.0 Reporter: Viraj Bhat Fix For: 0.6.0 Hi all, We have a host of Join optimizations that have been implemented recently in Pig to improve performance. These include: http://hadoop.apache.org/pig/docs/r0.5.0/piglatin_reference.html#JOIN 1) Merge Join 2) Skewed Join It would be nice to mention the Merge Join and Skewed join in the following section on the PigCookBook http://hadoop.apache.org/pig/docs/r0.5.0/cookbook.html#Take+Advantage+of+Join+Optimization Can we update this release 0.6?? Thanks Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1081) PigCookBook use of PARALLEL keyword
PigCookBook use of PARALLEL keyword --- Key: PIG-1081 URL: https://issues.apache.org/jira/browse/PIG-1081 Project: Pig Issue Type: Bug Components: documentation Affects Versions: 0.5.0 Reporter: Viraj Bhat Fix For: 0.5.0 Hi all, I am looking at some tips for optimizing Pig programs (Pig Cookbook) using the PARALLEL keyword. http://hadoop.apache.org/pig/docs/r0.5.0/cookbook.html#Use+PARALLEL+Keyword We know that currently Pig 0.5 uses Hadoop 20 (as its default) which launches 1 reducer for all cases. In this documentation we state that: * * 0.9, this documentation was valid for HoD (Hadoop on Demand) where you are creating your own Hadoop clusters, but if you are using: Either the Capacity Scheduler http://hadoop.apache.org/common/docs/current/capacity_scheduler.html or the Fair Share Scheduler http://hadoop.apache.org/common/docs/current/fair_scheduler.html , these numbers could mean that you are using around 90% of your reducer slots in your machine. We should change this to something like: The number of reducers you may need for a particular construct in Pig which forms a Map Reduce boundary depends entirely on your data and the number of intermediate keys you are generating in your mappers. In best cases we have seen that a reducer processing about 500 MB of data behaves efficiently. Additionally it is hard to define the optimum number of reducers, since it completely depends on the paritioner and distribution of map (combiner) output keys. Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1060) MultiQuery optimization throws error for multi-level splits
[ https://issues.apache.org/jira/browse/PIG-1060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12773744#action_12773744 ] Viraj Bhat commented on PIG-1060: - Hi Ankur and Richard, I have a script which demonstrates a similar problem, but can be solved by using the -M option. This script can reproduce the problem even without the UNION operator , but it has properties 1 and 2 of the original problem description. Try commenting out the F alias. It works fine. {code} ORGINALDATA = load '/user/viraj/somedata.txt' using PigStorage() as (col1, col2, col3, col4, col5, col6, col7, col8); --Check data A = foreach ORGINALDATA generate col1, col2, col3, col4, col5, col6; B = group A all; C = foreach B generate COUNT(A); store C into '/user/viraj/result1'; D = filter A by (col1 == col2) or (col1 == col3); E = group D all; F = foreach E generate COUNT(D); --try commenting F store F into '/user/viraj/result2'; G = filter D by (col4 == col5) ; H = group G all; I = foreach H generate COUNT(G); store I into '/user/viraj/result3'; J = filter G by (((col6 == 'm') or (col6 == 'M')) and (col6 == 1)) or (((col6 == 'f') or (col6 == 'F')) and (col6 == 0)) or ((col6 == '') and (col6 == -1)); K = group J all; L = foreach K generate COUNT(J); store L into '/user/viraj/result4'; {code} > MultiQuery optimization throws error for multi-level splits > --- > > Key: PIG-1060 > URL: https://issues.apache.org/jira/browse/PIG-1060 > Project: Pig > Issue Type: Bug >Affects Versions: 0.5.0 >Reporter: Ankur >Assignee: Richard Ding > > Consider the following scenario :- > 1. Multi-level splits in the map plan. > 2. Each split branch further progressing across a local-global rearrange. > 3. Output of each of these finally merged via a UNION. > MultiQuery optimizer throws the following error in such a case: > "ERROR 2146: Internal Error. Inconsistency in key index found during > optimization." -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1065) In-determinate behaviour of Union when there are 2 non-matching schema's
In-determinate behaviour of Union when there are 2 non-matching schema's Key: PIG-1065 URL: https://issues.apache.org/jira/browse/PIG-1065 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Viraj Bhat Fix For: 0.6.0 I have a script which first does a union of these schemas and then does a ORDER BY of this result. {code} f1 = LOAD '1.txt' as (key:chararray, v:chararray); f2 = LOAD '2.txt' as (key:chararray); u0 = UNION f1, f2; describe u0; dump u0; u1 = ORDER u0 BY $0; dump u1; {code} When I run in Map Reduce mode I get the following result: $java -cp pig.jar:$HADOOP_HOME/conf org.apache.pig.Main broken.pig Schema for u0 unknown. (1,2) (2,3) (1) (2) org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias u1 at org.apache.pig.PigServer.openIterator(PigServer.java:475) at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:532) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:190) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:142) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89) at org.apache.pig.Main.main(Main.java:397) Caused by: java.io.IOException: Type mismatch in key from map: expected org.apache.pig.impl.io.NullableBytesWritable, recieved org.apache.pig.impl.io.NullableText at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:415) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:108) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:251) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:240) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:93) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227) When I run the same script in local mode I get a different result, as we know that local mode does not use any Hadoop Classes. $java -cp pig.jar org.apache.pig.Main -x local broken.pig Schema for u0 unknown (1,2) (1) (2,3) (2) (1,2) (1) (2,3) (2) Here are some questions 1) Why do we allow union if the schemas do not match 2) Should we not print an error message/warning so that the user knows that this is not allowed or he can get unexpected results? Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1064) Behvaiour of COGROUP with and without schema when using "*" operator
Behvaiour of COGROUP with and without schema when using "*" operator Key: PIG-1064 URL: https://issues.apache.org/jira/browse/PIG-1064 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Viraj Bhat Fix For: 0.6.0 I have 2 tab separated files, "1.txt" and "2.txt" $ cat 1.txt 1 2 2 3 $ cat 2.txt 1 2 2 3 I use COGROUP feature of Pig in the following way: $java -cp pig.jar:$HADOOP_HOME org.apache.pig.Main {code} grunt> A = load '1.txt'; grunt> B = load '2.txt' as (b0, b1); grunt> C = cogroup A by *, B by *; {code} 2009-10-29 12:46:04,150 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1012: Each COGroup input has to have the same number of inner plans Details at logfile: pig_1256845224752.log == If I reverse, the order of the schema's {code} grunt> A = load '1.txt' as (a0, a1); grunt> B = load '2.txt'; grunt> C = cogroup A by *, B by *; {code} 2009-10-29 12:49:27,869 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1013: Grouping attributes can either be star (*) or a list of expressions, but not both. Details at logfile: pig_1256845224752.log == Now running without schema?? {code} grunt> A = load '1.txt'; grunt> B = load '2.txt'; grunt> C = cogroup A by *, B by *; grunt> dump C; {code} 2009-10-29 12:55:37,202 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - Successfully stored result in: "file:/tmp/temp-319926700/tmp-1990275961" 2009-10-29 12:55:37,202 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - Records written : 2 2009-10-29 12:55:37,202 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - Bytes written : 154 2009-10-29 12:55:37,202 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - 100% complete! 2009-10-29 12:55:37,202 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - Success!! ((1,2),{(1,2)},{(1,2)}) ((2,3),{(2,3)},{(2,3)}) == Is this a bug or a feature? Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1031) PigStorage interpreting chararray/bytearray for a tuple element inside a bag as float or double
[ https://issues.apache.org/jira/browse/PIG-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Bhat updated PIG-1031: Description: I have a data stored in a text file as: {(4153E765)} {(AF533765)} I try reading it using PigStorage as: {code} A = load 'pigstoragebroken.dat' using PigStorage() as (intersectionBag:bag{T:tuple(term:bytearray)}); dump A; {code} I get the following results: ({(Infinity)}) ({(AF533765)}) The problem seems to be with the method: parseFromBytes(byte[] b) in class Utf8StorageConverter. This method uses the TextDataParser (class generated via jjt) to interpret the type of data from content, even though the schema tells it is a bytearray. TextDataParser.jjt sample code {code} TOKEN : { ... < DOUBLENUMBER: (["-","+"])? ( ["e","E"] ([ "-","+"])? )?> < FLOATNUMBER: (["f","F"])? > ... } {code} I tried the following options, but it will not work as we need to call bytesToBag(byte[] b) in the Utf8StorageConverter class. {code} A = load 'pigstoragebroken.dat' using PigStorage() as (intersectionBag:bag{T:tuple(term)}); A = load 'pigstoragebroken.dat' using PigStorage() as (intersectionBag:bag{T:tuple(term:chararray)}); {code} Viraj was: I have a data stored in a text file as: {(4153E765)} {(AF533765)} I try reading it using PigStorage as: {code} A = load 'pigstoragebroken.dat' using PigStorage() as (intersectionBag:bag{T:tuple(term:bytearray)}); dump A; {code} I get the following results: {code} ({(Infinity)}) ({(AF533765)}) {code} The problem seems to be with the method: parseFromBytes(byte[] b) in class Utf8StorageConverter. This method uses the TextDataParser (class generated via jjt) to interpret the type of data from content, even though the schema tells it is a bytearray. TextDataParser.jjt sample code {code} TOKEN : { ... < DOUBLENUMBER: (["-","+"])? ( ["e","E"] ([ "-","+"])? )?> < FLOATNUMBER: (["f","F"])? > ... } {code} I tried the following options, but it will not work as we need to call bytesToBag(byte[] b) in the Utf8StorageConverter class. {code} A = load 'pigstoragebroken.dat' using PigStorage() as (intersectionBag:bag{T:tuple(term)}); A = load 'pigstoragebroken.dat' using PigStorage() as (intersectionBag:bag{T:tuple(term:chararray)}); {code} Viraj > PigStorage interpreting chararray/bytearray for a tuple element inside a bag > as float or double > --- > > Key: PIG-1031 > URL: https://issues.apache.org/jira/browse/PIG-1031 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.5.0 >Reporter: Viraj Bhat > Fix For: 0.5.0, 0.6.0 > > > I have a data stored in a text file as: > {(4153E765)} > {(AF533765)} > I try reading it using PigStorage as: > {code} > A = load 'pigstoragebroken.dat' using PigStorage() as > (intersectionBag:bag{T:tuple(term:bytearray)}); > dump A; > {code} > I get the following results: > ({(Infinity)}) > ({(AF533765)}) > The problem seems to be with the method: parseFromBytes(byte[] b) in class > Utf8StorageConverter. This method uses the TextDataParser (class generated > via jjt) to interpret the type of data from content, even though the schema > tells it is a bytearray. > TextDataParser.jjt sample code > {code} > TOKEN : > { > ... > < DOUBLENUMBER: (["-","+"])? ( ["e","E"] ([ "-","+"])? > )?> > < FLOATNUMBER: (["f","F"])? > > ... > } > {code} > I tried the following options, but it will not work as we need to call > bytesToBag(byte[] b) in the Utf8StorageConverter class. > {code} > A = load 'pigstoragebroken.dat' using PigStorage() as > (intersectionBag:bag{T:tuple(term)}); > A = load 'pigstoragebroken.dat' using PigStorage() as > (intersectionBag:bag{T:tuple(term:chararray)}); > {code} > Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1031) PigStorage interpreting chararray/bytearray for a tuple element inside a bag as float or double
PigStorage interpreting chararray/bytearray for a tuple element inside a bag as float or double --- Key: PIG-1031 URL: https://issues.apache.org/jira/browse/PIG-1031 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.5.0 Reporter: Viraj Bhat Fix For: 0.5.0, 0.6.0 I have a data stored in a text file as: {(4153E765)} {(AF533765)} I try reading it using PigStorage as: {code} A = load 'pigstoragebroken.dat' using PigStorage() as (intersectionBag:bag{T:tuple(term:bytearray)}); dump A; {code} I get the following results: {code} ({(Infinity)}) ({(AF533765)}) {code} The problem seems to be with the method: parseFromBytes(byte[] b) in class Utf8StorageConverter. This method uses the TextDataParser (class generated via jjt) to interpret the type of data from content, even though the schema tells it is a bytearray. TextDataParser.jjt sample code {code} TOKEN : { ... < DOUBLENUMBER: (["-","+"])? ( ["e","E"] ([ "-","+"])? )?> < FLOATNUMBER: (["f","F"])? > ... } {code} I tried the following options, but it will not work as we need to call bytesToBag(byte[] b) in the Utf8StorageConverter class. {code} A = load 'pigstoragebroken.dat' using PigStorage() as (intersectionBag:bag{T:tuple(term)}); A = load 'pigstoragebroken.dat' using PigStorage() as (intersectionBag:bag{T:tuple(term:chararray)}); {code} Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-978) ERROR 2100 (hdfs://localhost/tmp/temp175740929/tmp-1126214010 does not exist) and ERROR 2999: (Unexpected internal error. null) when using Multi-Query optimization
ERROR 2100 (hdfs://localhost/tmp/temp175740929/tmp-1126214010 does not exist) and ERROR 2999: (Unexpected internal error. null) when using Multi-Query optimization --- Key: PIG-978 URL: https://issues.apache.org/jira/browse/PIG-978 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Viraj Bhat Fix For: 0.6.0 I have Pig script of this form.. which I execute using Multi-query optimization. {code} A = load '/user/viraj/firstinput' using PigStorage(); B = group C = ..agrregation function store C into '/user/viraj/firstinputtempresult/days1'; .. Atab = load '/user/viraj/secondinput' using PigStorage(); Btab = group Ctab = ..agrregation function store Ctab into '/user/viraj/secondinputtempresult/days1'; .. E = load '/user/viraj/firstinputtempresult/' using PigStorage(); F = group G = aggregation function store G into '/user/viraj/finalresult1'; Etab = load '/user/viraj/secondinputtempresult/' using PigStorage(); Ftab = group Gtab = aggregation function store Gtab into '/user/viraj/finalresult2'; {code} 2009-07-20 22:05:44,507 [main] ERROR org.apache.pig.tools.grunt.GruntParser - ERROR 2100: hdfs://localhost/tmp/temp175740929/tmp-1126214010 does not exist. Details at logfile: /homes/viraj/pigscripts/pig_1248127173601.log) is due to the mismatch of store/load commands. The script first stores files into the 'days1' directory (store C into '/user/viraj/firstinputtempresult/days1' using PigStorage();), but it later loads from the top level directory (E = load '/user/viraj/firstinputtempresult/' using PigStorage()) instead of the original directory (/user/viraj/firstinputtempresult/days1). The current multi-query optimizer can't solve the dependency between these two commands--they have different load file paths. So the jobs will run concurrently and result in the errors. The solution is to add 'exec' or 'run' command after the first two stores . This will force the first two store commands to run before the rest commands. It would be nice to see this fixed as a part of an enhancement to the Multi-query. We either disable the Multi-query or throw a warning/error message, so that the user can correct his load/store statements. Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-974) Issues with mv command when used after store when using -param_file/-param options
[ https://issues.apache.org/jira/browse/PIG-974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12758962#action_12758962 ] Viraj Bhat commented on PIG-974: It turns out that the problem was due to single quotes. {code} mv '$finalop' '$finalmove'; {code} This piece of modified script should work.. {code} mv $finalop $finalmove; {code} The hard part here is when to use single quotes for parameters and when we should not..This is not documented in the manual. The error message is also confusing.. === java.io.IOException: File or directory '/user/viraj/finaloutput' does not exist. === I thought that the single quotes against the filename printed in the error message refers to the correct file name. {code} $shell>hadoop fs -ls '/user/viraj/finaloutput' Found 1 items -rw--- 3 viraj users420 2009-09-24 01:16 /user/viraj/finaloutput/part-0 {code} Thanks Viraj > Issues with mv command when used after store when using -param_file/-param > options > -- > > Key: PIG-974 > URL: https://issues.apache.org/jira/browse/PIG-974 > Project: Pig > Issue Type: Bug >Affects Versions: 0.6.0 > Environment: Hadoop 18 and 20 >Reporter: Viraj Bhat > Fix For: 0.6.0 > > Attachments: studenttab10k > > > I have a Pig script which moves the final output to another HDFS directory to > signal completion, so that another Pig script can start working on these > results. > {code} > studenttab = LOAD '/user/viraj/studenttab10k' AS (name:chararray, > age:int,gpa:float); > X = GROUP studenttab by age; > Y = FOREACH X GENERATE group, COUNT(studenttab); > store Y into '$finalop' using PigStorage(); > mv '$finalop' '$finalmove'; > {code} > where "finalop" and "finalmove" are parameters used storing intermediate and > final results. > I run this script as this: > {code} > $shell> java -cp pig20.jar:/path/tohadoop/site.xml > -Dmapred.job.queue.name=default org.apache.pig.Main -M -param > finalop=/user/viraj/finaloutput -param finalmove=/user/viraj/finalmove > testmove.pig > {code} > or using the param_file option > {code} > $shell>java -cp pig20.jar:/path/tohadoop/site.xml > -Dmapred.job.queue.name=default org.apache.pig.Main -M -param_file > moveparamfile testmove.pig > {code} > > The underlying Map Reduce jobs run well but the move command seems to be > failing: > > 2009-09-23 23:26:21,781 [main] INFO org.apache.pig.Main - Logging error > messages to: /homes/viraj/pigscripts/pig_1253748381778.log > 2009-09-23 23:26:21,963 [main] INFO > org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting > to hadoop file system at: hdfs://localhost:8020 > 2009-09-23 23:26:22,227 [main] INFO > org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting > to map-reduce job tracker at: localhost:50300 > 2009-09-23 23:26:27,187 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.CombinerOptimizer > - Choosing to move algebraic foreach to combiner > 2009-09-23 23:26:27,203 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer > - MR plan size before optimization: 1 > 2009-09-23 23:26:27,203 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer > - MR plan size after optimization: 1 > 2009-09-23 23:26:28,828 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler > - Setting up single store job > 2009-09-23 23:26:29,423 [Thread-9] WARN org.apache.hadoop.mapred.JobClient - > Use GenericOptionsParser for parsing the arguments. Applications should > implement Tool for the same. > 2009-09-23 23:26:29,478 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > - 0% complete > 2009-09-23 23:27:29,828 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > - 50% complete > 2009-09-23 23:27:59,764 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > - 50% complete > 2009-09-23 23:28:57,249 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > - 100% complete > 2009-09-23 23:28:57,249 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > - Successfully stored result in: "/user/viraj/finaloutput" > 2009-09-23 23:28:57,267 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapRed
[jira] Updated: (PIG-974) Issues with mv command when used after store when using -param_file/-param options
[ https://issues.apache.org/jira/browse/PIG-974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Bhat updated PIG-974: --- Attachment: studenttab10k Testdata > Issues with mv command when used after store when using -param_file/-param > options > -- > > Key: PIG-974 > URL: https://issues.apache.org/jira/browse/PIG-974 > Project: Pig > Issue Type: Bug >Affects Versions: 0.6.0 > Environment: Hadoop 18 and 20 >Reporter: Viraj Bhat > Fix For: 0.6.0 > > Attachments: studenttab10k > > > I have a Pig script which moves the final output to another HDFS directory to > signal completion, so that another Pig script can start working on these > results. > {code} > studenttab = LOAD '/user/viraj/studenttab10k' AS (name:chararray, > age:int,gpa:float); > X = GROUP studenttab by age; > Y = FOREACH X GENERATE group, COUNT(studenttab); > store Y into '$finalop' using PigStorage(); > mv '$finalop' '$finalmove'; > {code} > where "finalop" and "finalmove" are parameters used storing intermediate and > final results. > I run this script as this: > {code} > $shell> java -cp pig20.jar:/path/tohadoop/site.xml > -Dmapred.job.queue.name=default org.apache.pig.Main -M -param > finalop=/user/viraj/finaloutput -param finalmove=/user/viraj/finalmove > testmove.pig > {code} > or using the param_file option > {code} > $shell>java -cp pig20.jar:/path/tohadoop/site.xml > -Dmapred.job.queue.name=default org.apache.pig.Main -M -param_file > moveparamfile testmove.pig > {code} > > The underlying Map Reduce jobs run well but the move command seems to be > failing: > > 2009-09-23 23:26:21,781 [main] INFO org.apache.pig.Main - Logging error > messages to: /homes/viraj/pigscripts/pig_1253748381778.log > 2009-09-23 23:26:21,963 [main] INFO > org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting > to hadoop file system at: hdfs://localhost:8020 > 2009-09-23 23:26:22,227 [main] INFO > org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting > to map-reduce job tracker at: localhost:50300 > 2009-09-23 23:26:27,187 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.CombinerOptimizer > - Choosing to move algebraic foreach to combiner > 2009-09-23 23:26:27,203 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer > - MR plan size before optimization: 1 > 2009-09-23 23:26:27,203 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer > - MR plan size after optimization: 1 > 2009-09-23 23:26:28,828 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler > - Setting up single store job > 2009-09-23 23:26:29,423 [Thread-9] WARN org.apache.hadoop.mapred.JobClient - > Use GenericOptionsParser for parsing the arguments. Applications should > implement Tool for the same. > 2009-09-23 23:26:29,478 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > - 0% complete > 2009-09-23 23:27:29,828 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > - 50% complete > 2009-09-23 23:27:59,764 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > - 50% complete > 2009-09-23 23:28:57,249 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > - 100% complete > 2009-09-23 23:28:57,249 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > - Successfully stored result in: "/user/viraj/finaloutput" > 2009-09-23 23:28:57,267 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > - Records written : 60 > 2009-09-23 23:28:57,267 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > - Bytes written : 420 > 2009-09-23 23:28:57,267 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > - Success! > 2009-09-23 23:28:57,367 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR > 2998: Unhandled internal error. File or directory '/user/viraj/finaloutput' > does not exist. > Details at logfile: /homes/viraj/pigscripts/pig_1253748381778.log > > {code} > $shell> hadoop fs -ls /user/viraj/finaloutput > Found 1 items > -rw--- 3 viraj users420 2009-09-23 23:42 > /user/viraj/finaloutput/part-0 > {code} >
[jira] Created: (PIG-974) Issues with mv command when used after store when using -param_file/-param options
Issues with mv command when used after store when using -param_file/-param options -- Key: PIG-974 URL: https://issues.apache.org/jira/browse/PIG-974 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Environment: Hadoop 18 and 20 Reporter: Viraj Bhat Fix For: 0.6.0 Attachments: studenttab10k I have a Pig script which moves the final output to another HDFS directory to signal completion, so that another Pig script can start working on these results. {code} studenttab = LOAD '/user/viraj/studenttab10k' AS (name:chararray, age:int,gpa:float); X = GROUP studenttab by age; Y = FOREACH X GENERATE group, COUNT(studenttab); store Y into '$finalop' using PigStorage(); mv '$finalop' '$finalmove'; {code} where "finalop" and "finalmove" are parameters used storing intermediate and final results. I run this script as this: {code} $shell> java -cp pig20.jar:/path/tohadoop/site.xml -Dmapred.job.queue.name=default org.apache.pig.Main -M -param finalop=/user/viraj/finaloutput -param finalmove=/user/viraj/finalmove testmove.pig {code} or using the param_file option {code} $shell>java -cp pig20.jar:/path/tohadoop/site.xml -Dmapred.job.queue.name=default org.apache.pig.Main -M -param_file moveparamfile testmove.pig {code} The underlying Map Reduce jobs run well but the move command seems to be failing: 2009-09-23 23:26:21,781 [main] INFO org.apache.pig.Main - Logging error messages to: /homes/viraj/pigscripts/pig_1253748381778.log 2009-09-23 23:26:21,963 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://localhost:8020 2009-09-23 23:26:22,227 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: localhost:50300 2009-09-23 23:26:27,187 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.CombinerOptimizer - Choosing to move algebraic foreach to combiner 2009-09-23 23:26:27,203 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1 2009-09-23 23:26:27,203 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1 2009-09-23 23:26:28,828 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job 2009-09-23 23:26:29,423 [Thread-9] WARN org.apache.hadoop.mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2009-09-23 23:26:29,478 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete 2009-09-23 23:27:29,828 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 50% complete 2009-09-23 23:27:59,764 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 50% complete 2009-09-23 23:28:57,249 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete 2009-09-23 23:28:57,249 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Successfully stored result in: "/user/viraj/finaloutput" 2009-09-23 23:28:57,267 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Records written : 60 2009-09-23 23:28:57,267 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Bytes written : 420 2009-09-23 23:28:57,267 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success! 2009-09-23 23:28:57,367 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2998: Unhandled internal error. File or directory '/user/viraj/finaloutput' does not exist. Details at logfile: /homes/viraj/pigscripts/pig_1253748381778.log {code} $shell> hadoop fs -ls /user/viraj/finaloutput Found 1 items -rw--- 3 viraj users420 2009-09-23 23:42 /user/viraj/finaloutput/part-0 {code} Opening the log file: Pig Stack Trace --- ERROR 2998: Unhandled internal error. File or directory '/user/viraj/finaloutput' does not exist. java.io.IOException: File or directory '/user/viraj/finaloutput' does not exist. at org.apache.pig.tools.grunt.GruntParser.processM
[jira] Commented: (PIG-940) Cross site HDFS access using the default.fs.name not possible in Pig
[ https://issues.apache.org/jira/browse/PIG-940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12749722#action_12749722 ] Viraj Bhat commented on PIG-940: One important point to add: {code} localmachine.company.com prompt> hadoop fs -ls hdfs://remotemachine1.company.com/user/viraj//*.txt -rw-r--r-- 3 viraj users 13 2009-08-13 23:42 /user/viraj/A1.txt -rw-r--r-- 3 viraj users 8 2009-08-29 00:51 /user/viraj/B1.txt {code} > Cross site HDFS access using the default.fs.name not possible in Pig > > > Key: PIG-940 > URL: https://issues.apache.org/jira/browse/PIG-940 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.3.0 > Environment: Hadoop 20 >Reporter: Viraj Bhat > Fix For: 0.3.0 > > > I have a script which does the following.. access data from a remote HDFS > location (via a HDFS installed at:hdfs://remotemachine1.company.com/ ) [[as I > do not want to copy this huge amount of data between HDFS locations]]. > However I want my Pigscript to write data to the HDFS running on > localmachine.company.com. > Currently Pig does not support that behavior and complains that: > "hdfs://localmachine.company.com/user/viraj/A1.txt does not exist" > {code} > A = LOAD 'hdfs://remotemachine1.company.com/user/viraj/A1.txt' as (a, b); > B = LOAD 'hdfs://remotemachine1.company.com/user/viraj/B1.txt' as (c, d); > C = JOIN A by a, B by c; > store C into 'output' using PigStorage(); > {code} > === > 2009-09-01 00:37:24,032 [main] INFO > org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting > to hadoop file system at: hdfs://localmachine.company.com:8020 > 2009-09-01 00:37:24,277 [main] INFO > org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting > to map-reduce job tracker at: localmachine.company.com:50300 > 2009-09-01 00:37:24,567 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler$LastInputStreamingOptimizer > - Rewrite: POPackage->POForEach to POJoinPackage > 2009-09-01 00:37:24,573 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer > - MR plan size before optimization: 1 > 2009-09-01 00:37:24,573 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer > - MR plan size after optimization: 1 > 2009-09-01 00:37:26,197 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler > - Setting up single store job > 2009-09-01 00:37:26,249 [Thread-9] WARN org.apache.hadoop.mapred.JobClient - > Use GenericOptionsParser for parsing the arguments. Applications should > implement Tool for the same. > 2009-09-01 00:37:26,746 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > - 0% complete > 2009-09-01 00:37:26,746 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > - 100% complete > 2009-09-01 00:37:26,747 [main] ERROR > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > - 1 map reduce job(s) failed! > 2009-09-01 00:37:26,756 [main] ERROR > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > - Failed to produce result in: > "hdfs:/localmachine.company.com/tmp/temp-1470407685/tmp-510854480" > 2009-09-01 00:37:26,756 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > - Failed! > 2009-09-01 00:37:26,758 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR > 2100: hdfs://localmachine.company.com/user/viraj/A1.txt does not exist. > Details at logfile: /home/viraj/pigscripts/pig_1251765443851.log > === > The error file in Pig contains: > === > ERROR 2998: Unhandled internal error. > org.apache.pig.backend.executionengine.ExecException: ERROR 2100: > hdfs://localmachine.company.com/user/viraj/A1.txt does not exist. > at > org.apache.pig.backend.executionengine.PigSlicer.validate(PigSlicer.java:126) > at > org.apache.pig.impl.io.ValidatingInputFileSpec.validate(ValidatingInputFileSpec.java:59) > at > org.apache.pig.impl.io.ValidatingInputFileSpec.(ValidatingInputFileSpec.java:44) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:228) > at
[jira] Created: (PIG-940) Cross site HDFS access using the default.fs.name not possible in Pig
Cross site HDFS access using the default.fs.name not possible in Pig Key: PIG-940 URL: https://issues.apache.org/jira/browse/PIG-940 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.3.0 Environment: Hadoop 20 Reporter: Viraj Bhat Fix For: 0.3.0 I have a script which does the following.. access data from a remote HDFS location (via a HDFS installed at:hdfs://remotemachine1.company.com/ ) [[as I do not want to copy this huge amount of data between HDFS locations]]. However I want my Pigscript to write data to the HDFS running on localmachine.company.com. Currently Pig does not support that behavior and complains that: "hdfs://localmachine.company.com/user/viraj/A1.txt does not exist" {code} A = LOAD 'hdfs://remotemachine1.company.com/user/viraj/A1.txt' as (a, b); B = LOAD 'hdfs://remotemachine1.company.com/user/viraj/B1.txt' as (c, d); C = JOIN A by a, B by c; store C into 'output' using PigStorage(); {code} === 2009-09-01 00:37:24,032 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://localmachine.company.com:8020 2009-09-01 00:37:24,277 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: localmachine.company.com:50300 2009-09-01 00:37:24,567 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler$LastInputStreamingOptimizer - Rewrite: POPackage->POForEach to POJoinPackage 2009-09-01 00:37:24,573 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1 2009-09-01 00:37:24,573 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1 2009-09-01 00:37:26,197 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job 2009-09-01 00:37:26,249 [Thread-9] WARN org.apache.hadoop.mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2009-09-01 00:37:26,746 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete 2009-09-01 00:37:26,746 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete 2009-09-01 00:37:26,747 [main] ERROR org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map reduce job(s) failed! 2009-09-01 00:37:26,756 [main] ERROR org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed to produce result in: "hdfs:/localmachine.company.com/tmp/temp-1470407685/tmp-510854480" 2009-09-01 00:37:26,756 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed! 2009-09-01 00:37:26,758 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2100: hdfs://localmachine.company.com/user/viraj/A1.txt does not exist. Details at logfile: /home/viraj/pigscripts/pig_1251765443851.log === The error file in Pig contains: === ERROR 2998: Unhandled internal error. org.apache.pig.backend.executionengine.ExecException: ERROR 2100: hdfs://localmachine.company.com/user/viraj/A1.txt does not exist. at org.apache.pig.backend.executionengine.PigSlicer.validate(PigSlicer.java:126) at org.apache.pig.impl.io.ValidatingInputFileSpec.validate(ValidatingInputFileSpec.java:59) at org.apache.pig.impl.io.ValidatingInputFileSpec.(ValidatingInputFileSpec.java:44) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:228) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:378) at org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247) at org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java:279) at java.lang.Thread.run(Thread.java:619) java.lang.Exception: org.apache.pig.backend.executionengine.ExecException: ERROR 2100: hdfs://localmachine.company.com/user/viraj/A1.txt does not
[jira] Updated: (PIG-921) Strange use case for Join which produces different results in local and map reduce mode
[ https://issues.apache.org/jira/browse/PIG-921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Bhat updated PIG-921: --- Attachment: joinusecase.pig B.txt A.txt Script with test data. > Strange use case for Join which produces different results in local and map > reduce mode > --- > > Key: PIG-921 > URL: https://issues.apache.org/jira/browse/PIG-921 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.3.0 > Environment: Hadoop 18 and Hadoop 20 >Reporter: Viraj Bhat > Fix For: 0.3.0 > > Attachments: A.txt, B.txt, joinusecase.pig > > > I have script in this manner, loads from 2 files A.txt and B.txt > {code} > A = LOAD 'A.txt' as (a:tuple(a1:int, a2:chararray)); > B = LOAD 'B.txt' as (b:tuple(b1:int, b2:chararray)); > C = JOIN A by a.a1, B by b.b1; > DESCRIBE C; > DUMP C; > {code} > A.txt contains the following lines: > {code} > (1,a) > (2,aa) > {code} > B.txt contains the following lines: > {code} > (1,b) > (2,bb) > {code} > Now running the above script in local and map reduce mode on Hadoop 18 & > Hadoop 20, produces the following: > Hadoop 18 > = > (1,1) > (2,2) > = > Hadoop 20 > = > (1,1) > (2,2) > = > Local Mode: Pig with Hadoop 18 jar release > = > 2009-08-13 17:15:13,473 [main] INFO org.apache.pig.Main - Logging error > messages to: /homes/viraj/pig-svn/trunk/pigscripts/pig_1250208913472.log > 09/08/13 17:15:13 INFO pig.Main: Logging error messages to: > /homes/viraj/pig-svn/trunk/pigscripts/pig_1250208913472.log > C: {a: (a1: int,a2: chararray),b: (b1: int,b2: chararray)} > 2009-08-13 17:15:13,932 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR > 1002: Unable to store alias C > 09/08/13 17:15:13 ERROR grunt.Grunt: ERROR 1002: Unable to store alias C > Details at logfile: > /homes/viraj/pig-svn/trunk/pigscripts/pig_1250208913472.log > = > Caused by: java.lang.NullPointerException > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POPackage.getNext(POPackage.java:206) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:231) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:191) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:231) > at > org.apache.pig.backend.local.executionengine.physicalLayer.counters.POCounter.getNext(POCounter.java:71) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:231) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POStore.getNext(POStore.java:117) > at > org.apache.pig.backend.local.executionengine.LocalPigLauncher.runPipeline(LocalPigLauncher.java:146) > at > org.apache.pig.backend.local.executionengine.LocalPigLauncher.launchPig(LocalPigLauncher.java:109) > at > org.apache.pig.backend.local.executionengine.LocalExecutionEngine.execute(LocalExecutionEngine.java:165) > ... 9 more > = > Local Mode: Pig with Hadoop 20 jar release > = > ((1,a),(1,b)) > ((2,aa),(2,bb) > = -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-921) Strange use case for Join which produces different results in local and map reduce mode
Strange use case for Join which produces different results in local and map reduce mode --- Key: PIG-921 URL: https://issues.apache.org/jira/browse/PIG-921 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.3.0 Environment: Hadoop 18 and Hadoop 20 Reporter: Viraj Bhat Fix For: 0.3.0 I have script in this manner, loads from 2 files A.txt and B.txt {code} A = LOAD 'A.txt' as (a:tuple(a1:int, a2:chararray)); B = LOAD 'B.txt' as (b:tuple(b1:int, b2:chararray)); C = JOIN A by a.a1, B by b.b1; DESCRIBE C; DUMP C; {code} A.txt contains the following lines: {code} (1,a) (2,aa) {code} B.txt contains the following lines: {code} (1,b) (2,bb) {code} Now running the above script in local and map reduce mode on Hadoop 18 & Hadoop 20, produces the following: Hadoop 18 = (1,1) (2,2) = Hadoop 20 = (1,1) (2,2) = Local Mode: Pig with Hadoop 18 jar release = 2009-08-13 17:15:13,473 [main] INFO org.apache.pig.Main - Logging error messages to: /homes/viraj/pig-svn/trunk/pigscripts/pig_1250208913472.log 09/08/13 17:15:13 INFO pig.Main: Logging error messages to: /homes/viraj/pig-svn/trunk/pigscripts/pig_1250208913472.log C: {a: (a1: int,a2: chararray),b: (b1: int,b2: chararray)} 2009-08-13 17:15:13,932 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1002: Unable to store alias C 09/08/13 17:15:13 ERROR grunt.Grunt: ERROR 1002: Unable to store alias C Details at logfile: /homes/viraj/pig-svn/trunk/pigscripts/pig_1250208913472.log = Caused by: java.lang.NullPointerException at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POPackage.getNext(POPackage.java:206) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:231) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:191) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:231) at org.apache.pig.backend.local.executionengine.physicalLayer.counters.POCounter.getNext(POCounter.java:71) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:231) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POStore.getNext(POStore.java:117) at org.apache.pig.backend.local.executionengine.LocalPigLauncher.runPipeline(LocalPigLauncher.java:146) at org.apache.pig.backend.local.executionengine.LocalPigLauncher.launchPig(LocalPigLauncher.java:109) at org.apache.pig.backend.local.executionengine.LocalExecutionEngine.execute(LocalExecutionEngine.java:165) ... 9 more = Local Mode: Pig with Hadoop 20 jar release = ((1,a),(1,b)) ((2,aa),(2,bb) = -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.