Problem running Pig 0.60
Hi pig team, I¹m testing zebra v2 and trying to run the pig 0.60 jar that I got from Yan. However, I got the following error: Caused by: java.lang.ClassNotFoundException: jline.ConsoleReaderInputStream at java.net.URLClassLoader$1.run(URLClassLoader.java:200) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:188) at java.lang.ClassLoader.loadClass(ClassLoader.java:307) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:252) at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320) Is there any additional jar file that I need to include with Hadoop or pig? Thanks~ -- Yiping Han y...@yahoo-inc.com US phone: +1(408)349-4403 Beijing phone: +86(10)8215-9357
[jira] Created: (PIG-941) [zebra] Loading non-existing column generates error
[zebra] Loading non-existing column generates error --- Key: PIG-941 URL: https://issues.apache.org/jira/browse/PIG-941 Project: Pig Issue Type: Bug Components: data Reporter: Yiping Han Loading a column that does not exist generates the following error: 2009-09-01 21:29:15,161 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2999: Unexpected internal error. null Example is like this: STORE urls2 into '$output' using org.apache.pig.table.pig.TableStorer('md5:string, url:string'); and then in another pig script, I load the table: input = LOAD '$output' USING org.apache.pig.table.pig.TableLoader('md5,url, domain'); where domain is a column that does not exist. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Proposal to create a branch for contrib project Zebra
+1 On 8/18/09 7:11 AM, Olga Natkovich ol...@yahoo-inc.com wrote: +1 -Original Message- From: Raghu Angadi [mailto:rang...@yahoo-inc.com] Sent: Monday, August 17, 2009 4:06 PM To: pig-dev@hadoop.apache.org Subject: Proposal to create a branch for contrib project Zebra Thanks to the PIG team, The first version of contrib project Zebra (PIG-833) is committed to PIG trunk. In short, Zebra is a table storage layer built for use in PIG and other Hadoop applications. While we are stabilizing current version V1 in the trunk, we plan to add more new features to it. We would like to create an svn branch for the new features. We will be responsible for managing zebra in PIG trunk and in the new branch. We will merge the branch when it is ready. We expect the changes to affect only 'contrib/zebra' directory. As a regular contributor to Hadoop, I will be the initial committer for Zebra. As more patches are contributed by other Zebra developers, there might be more commiters added through normal Hadoop/Apache procedure. I would like to create a branch called 'zebra-v2' with approval from PIG team. Thanks, Raghu. -- Yiping Han F-3140 (408)349-4403 y...@yahoo-inc.com
Re: COUNT, AVG and nulls
+1. --Yiping On 7/6/09 10:58 AM, Dmitriy Ryaboy dvrya...@cloudera.com wrote: +1 for standard semantics. We need a COALESCE function to go along with this. -D On Mon, Jul 6, 2009 at 10:46 AM, Olga Natkovich ol...@yahoo-inc.com wrote: Hi, The current implementation of COUNT and AVG in Pig counts null values. This is inconsistent with SQL semantics and also with semantics of other aggregated functions such as SUM, MIN, and MAX. Originally we chose this implementation for performance reasons; however, we re-implemented both functions to support multi-step combiner and now the cost of checking for null for the case where combiner is invoked is trivial. (I ran some tests with COUNT and they showed no performance difference.) We will pay penalty for the non-combinable case including local mode but I think it is worth the price to have consistent semantics. Also as we are working on SQL support, having SQL compliant semantics becomes very desirable. Please, let us know if you have any concerns. I am planning to make the change later this week. Olga -- Yiping Han F-3140 (408)349-4403 y...@yahoo-inc.com
[jira] Commented: (PIG-796) support conversion from numeric types to chararray
[ https://issues.apache.org/jira/browse/PIG-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12714526#action_12714526 ] Yiping Han commented on PIG-796: I have the same idea that Alan proposed. I agree the common case is most values are of the same type. Caching the type and change the cached type only when catch the ClassCastException would be the most efficient way. support conversion from numeric types to chararray --- Key: PIG-796 URL: https://issues.apache.org/jira/browse/PIG-796 Project: Pig Issue Type: Improvement Affects Versions: 0.2.0 Reporter: Olga Natkovich -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-807) PERFORMANCE: Provide a way for UDFs to use read-once bags (backed by the Hadoop values iterator)
[ https://issues.apache.org/jira/browse/PIG-807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12710818#action_12710818 ] Yiping Han commented on PIG-807: David, the syntax: B = foreach A generate SUM(m), is confusing for both developers and the parser. I like the idea to remove the explicit GROUP ALL, but would rather to use a different key word for that. I.e., B = FOR A GENERATE SUM(m); Adding a new keyword for this purpose would also works as the hint for parser to treat this as a direct hadoop iterator access. PERFORMANCE: Provide a way for UDFs to use read-once bags (backed by the Hadoop values iterator) Key: PIG-807 URL: https://issues.apache.org/jira/browse/PIG-807 Project: Pig Issue Type: Improvement Affects Versions: 0.2.1 Reporter: Pradeep Kamath Fix For: 0.3.0 Currently all bags resulting from a group or cogroup are materialized as bags containing all of the contents. The issue with this is that if a particular key has many corresponding values, all these values get stuffed in a bag which may run out of memory and hence spill causing slow down in performance and sometime memory exceptions. In many cases, the udfs which use these bags coming out a group and cogroup only need to iterate over the bag in a unidirectional read-once manner. This can be implemented by having the bag implement its iterator by simply iterating over the underlying hadoop iterator provided in the reduce. This kind of a bag is also needed in http://issues.apache.org/jira/browse/PIG-802. So the code can be reused for this issue too. The other part of this issue is to have some way for the udfs to communicate to Pig that any input bags that they need are read once bags . This can be achieved by having an Interface - say UsesReadOnceBags which is serves as a tag to indicate the intent to Pig. Pig can then rewire its execution plan to use ReadOnceBags is feasible. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-664) Semantics of * is not consistent
[ https://issues.apache.org/jira/browse/PIG-664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12672372#action_12672372 ] Yiping Han commented on PIG-664: I would second Santhosh. In PIG 1.x, * in UDF parameter list does expend as flattened list of columns. While converting into PIG 2.0, this create a lot of inconvenience. * should always generate flattened columns. Semantics of * is not consistent Key: PIG-664 URL: https://issues.apache.org/jira/browse/PIG-664 Project: Pig Issue Type: Bug Components: impl Affects Versions: types_branch Reporter: Santhosh Srinivasan Assignee: Santhosh Srinivasan Fix For: types_branch The semantics of * is not consistent in PIG. The use of * with generate results in the all the columns of the record being flattened. However, the use of * as an input to a UDF results in a tuple (wrapped in another tuple). For consistency, * should always result in all the columns of the record (i.e., flattened). The use of * occurs in: 1. Foreach generate: E.g.: foreach input generate *; 2. Input to UDFs: E.g. foreach input generate myUDF(*); 3. Order by: E.g.: order input by *; 4. (Co)Group: E.g.: group a by *; cogroup a by *, b by *; In terms of implementation, this involves rolling back the fix introduced in PIG-597 and fixing the following builtin UDFs: 1. ARITY - Should return the size of the input tuple instead of extracting the first column of the input tuple 2. SIZE - Should return the size of the input tuple instead of extracting the first column of the input tuple -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-602) Pass global configurations to UDF
[ https://issues.apache.org/jira/browse/PIG-602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12672376#action_12672376 ] Yiping Han commented on PIG-602: Alan, this plan looks good for our requirements. Pass global configurations to UDF - Key: PIG-602 URL: https://issues.apache.org/jira/browse/PIG-602 Project: Pig Issue Type: New Feature Components: impl Reporter: Yiping Han Assignee: Alan Gates We are seeking an easy way to pass a large number of global configurations to UDFs. Since our application contains many pig jobs, and has a large number of configurations. Passing configurations through command line is not an ideal way (i.e. modifying single parameter needs to change multiple command lines). And to put everything into the hadoop conf is not an ideal way either. We would like to see if Pig can provide such a facility that allows us to pass a configuration file in some format(XML?) and then make it available through out all the UDFs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-282) Custom Partitioner
[ https://issues.apache.org/jira/browse/PIG-282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12672467#action_12672467 ] Yiping Han commented on PIG-282: Any concerns on this issue? Custom Partitioner -- Key: PIG-282 URL: https://issues.apache.org/jira/browse/PIG-282 Project: Pig Issue Type: New Feature Reporter: Amir Youssefi Priority: Minor By adding custom partitioner we can give control over which output partition a key (/value) goes to. We can add keywords to language e.g. PARTITION BY UDF(...) or a similar syntax. UDF returns a number between 0 and n-1 where n is number of output partitions. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-610) Pig appears to continue when an underlying mapred job fails
[ https://issues.apache.org/jira/browse/PIG-610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12662084#action_12662084 ] Yiping Han commented on PIG-610: We are on hadoop 0.18.2 and latest pig_types branch. We tried to do hadoop job -kill x through a different terminal. I believe this happens every time since Ralf gave me instruction yesterday and I can easily reproduce it. Pig appears to continue when an underlying mapred job fails Key: PIG-610 URL: https://issues.apache.org/jira/browse/PIG-610 Project: Pig Issue Type: Bug Reporter: Yiping Han Priority: Critical We observed sometimes, pig appears to continue when an underlying mapred job fails. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-601) Add finalize() interface to UDF
Add finalize() interface to UDF --- Key: PIG-601 URL: https://issues.apache.org/jira/browse/PIG-601 Project: Pig Issue Type: New Feature Components: impl Reporter: Yiping Han I would like to have a finalize() method to UDF, which will be called when no more inputs and the UDF will be killed. The finalize() method should allow to generate extra output, which in many cases could benefit aggregations. There are couple of application that can benefit from this feature. One of the example is, in some UDFs, I need to open some resource(i. e. local file) and when the task finishes, I need to close the resource. Another example is, in one of my application, I do statistics for a list of categories and I need to generate a summary category and attach to the end of the table. With the finalize method, I could achieve this in an efficient and neat way. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-602) Pass global configurations to UDF
Pass global configurations to UDF - Key: PIG-602 URL: https://issues.apache.org/jira/browse/PIG-602 Project: Pig Issue Type: New Feature Components: impl Reporter: Yiping Han We are seeking an easy way to pass a large number of global configurations to UDFs. Since our application contains many pig jobs, and has a large number of configurations. Passing configurations through command line is not an ideal way (i.e. modifying single parameter needs to change multiple command lines). And to put everything into the hadoop conf is not an ideal way either. We would like to see if Pig can provide such a facility that allows us to pass a configuration file in some format(XML?) and then make it available through out all the UDFs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-603) Pig Server
Pig Server -- Key: PIG-603 URL: https://issues.apache.org/jira/browse/PIG-603 Project: Pig Issue Type: New Feature Components: grunt Reporter: Yiping Han With a real Pig Server, when we lose the client, the pig job will not be killed. And also, a more important reason for a Pig Server is, we can talk with the Pig Sever through APIs to query status, failures, etc. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-604) Kill the Pig job should kill all associated Hadoop Jobs
Kill the Pig job should kill all associated Hadoop Jobs --- Key: PIG-604 URL: https://issues.apache.org/jira/browse/PIG-604 Project: Pig Issue Type: Improvement Components: grunt Reporter: Yiping Han Current if we kill the pig job on the client machine, those hadoop jobs already launched still keep running. We have to kill these jobs manually. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-605) Better explain and console output
Better explain and console output - Key: PIG-605 URL: https://issues.apache.org/jira/browse/PIG-605 Project: Pig Issue Type: Improvement Components: grunt Reporter: Yiping Han It would be nice if when we explain the script, the corresponding mapred jobs can be explicitly mark out in a neat way. While we execute the script, the console output could print the name and url of the corresponding hadoop jobs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-606) Setting replication factor in Pig
Setting replication factor in Pig - Key: PIG-606 URL: https://issues.apache.org/jira/browse/PIG-606 Project: Pig Issue Type: New Feature Reporter: Yiping Han We would like the STORE clause to be able to set the replication factor. This is particularly useful for certain small files, i.e. for replication join. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-607) Utilize intermediate results instead of re-execution
Utilize intermediate results instead of re-execution Key: PIG-607 URL: https://issues.apache.org/jira/browse/PIG-607 Project: Pig Issue Type: New Feature Reporter: Yiping Han Priority: Critical This is the long existing problem. intermediate results are not reused. Every STORE or DUMP are executed in a separate plan and thus everything it needs are re-executed. This is really a terrible issue that should be fixed asap. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-608) Compile or validate the whole script before execution
Compile or validate the whole script before execution - Key: PIG-608 URL: https://issues.apache.org/jira/browse/PIG-608 Project: Pig Issue Type: Improvement Components: grunt Reporter: Yiping Han This is a very usual scenario: We are running a big pig job that contains several hadoop jobs. It has been running for long times and the first hadoop job sucess, then suddenly pig report it found a syntax error in the script after the first hadoop job...we have to repeat from the beginning. It would be nice if pig can compile to the end of the script, find all the syntax error, type mismatch, etc., before it really starts execution. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-609) PIG does not return the correct error code
PIG does not return the correct error code -- Key: PIG-609 URL: https://issues.apache.org/jira/browse/PIG-609 Project: Pig Issue Type: Bug Reporter: Yiping Han Pig still does not always return a correct error code. When the hadoop job fails, sometimes pig still return 0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-611) Better logging support
Better logging support -- Key: PIG-611 URL: https://issues.apache.org/jira/browse/PIG-611 Project: Pig Issue Type: Improvement Components: tools Reporter: Yiping Han I started this ticket to discuss future improvements on logging. The first thing I would like to suggest is that, pig needs more comprehensive logs. If there is a debug mode, when pig could print extensive detailed log, that would be very helpful. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-609) PIG does not return the correct error code
[ https://issues.apache.org/jira/browse/PIG-609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiping Han updated PIG-609: --- Priority: Critical (was: Major) PIG does not return the correct error code -- Key: PIG-609 URL: https://issues.apache.org/jira/browse/PIG-609 Project: Pig Issue Type: Bug Reporter: Yiping Han Priority: Critical Pig still does not always return a correct error code. When the hadoop job fails, sometimes pig still return 0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-610) Pig appears to continue when an underlying mapred job fails
[ https://issues.apache.org/jira/browse/PIG-610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12661823#action_12661823 ] Yiping Han commented on PIG-610: Create a pig job with multiple mapred jobs. Let the script run and then manually kill the running mapred job. Pig reports the failure of this mapred job but does not terminate itself. The next mapred job will be launched. Pig should fail immediately. Pig appears to continue when an underlying mapred job fails Key: PIG-610 URL: https://issues.apache.org/jira/browse/PIG-610 Project: Pig Issue Type: Bug Reporter: Yiping Han Priority: Critical We observed sometimes, pig appears to continue when an underlying mapred job fails. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.