Problem running Pig 0.60

2009-11-03 Thread Yiping Han
Hi pig team,

I¹m testing zebra v2 and trying to run the pig 0.60 jar that I got from Yan.
However, I got the following error:

Caused by: java.lang.ClassNotFoundException: jline.ConsoleReaderInputStream
at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)

Is there any additional jar file that I need to include with Hadoop or pig?


Thanks~
--
Yiping Han
y...@yahoo-inc.com
US phone: +1(408)349-4403
Beijing phone: +86(10)8215-9357 



[jira] Created: (PIG-941) [zebra] Loading non-existing column generates error

2009-09-01 Thread Yiping Han (JIRA)
[zebra] Loading non-existing column generates error
---

 Key: PIG-941
 URL: https://issues.apache.org/jira/browse/PIG-941
 Project: Pig
  Issue Type: Bug
  Components: data
Reporter: Yiping Han


Loading a column that does not exist generates the following error:

2009-09-01 21:29:15,161 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
2999: Unexpected internal error. null

Example is like this:

STORE urls2 into '$output' using 
org.apache.pig.table.pig.TableStorer('md5:string, url:string');

and then in another pig script, I load the table:

input = LOAD '$output' USING org.apache.pig.table.pig.TableLoader('md5,url, 
domain');

where domain is a column that does not exist.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Proposal to create a branch for contrib project Zebra

2009-08-17 Thread Yiping Han
+1


On 8/18/09 7:11 AM, Olga Natkovich ol...@yahoo-inc.com wrote:

 +1
 
 -Original Message-
 From: Raghu Angadi [mailto:rang...@yahoo-inc.com]
 Sent: Monday, August 17, 2009 4:06 PM
 To: pig-dev@hadoop.apache.org
 Subject: Proposal to create a branch for contrib project Zebra
 
 
 Thanks to the PIG team, The first version of contrib project Zebra
 (PIG-833) is committed to PIG trunk.
 
 In short, Zebra is a table storage layer built for use in PIG and other
 Hadoop applications.
 
 While we are stabilizing current version V1 in the trunk, we plan to add
 
 more new features to it. We would like to create an svn branch for the
 new features. We will be responsible for managing zebra in PIG trunk and
 
 in the new branch. We will merge the branch when it is ready. We expect
 the changes to affect only 'contrib/zebra' directory.
 
 As a regular contributor to Hadoop, I will be the initial committer for
 Zebra. As more patches are contributed by other Zebra developers, there
 might be more commiters added through normal Hadoop/Apache procedure.
 
 I would like to create a branch called 'zebra-v2' with approval from PIG
 
 team.
 
 Thanks,
 Raghu.

--
Yiping Han
F-3140 
(408)349-4403
y...@yahoo-inc.com



Re: COUNT, AVG and nulls

2009-07-06 Thread Yiping Han
+1.

--Yiping


On 7/6/09 10:58 AM, Dmitriy Ryaboy dvrya...@cloudera.com wrote:

 +1 for standard semantics.
 
 We need a COALESCE function to go along with this.
 
 -D
 
 On Mon, Jul 6, 2009 at 10:46 AM, Olga Natkovich ol...@yahoo-inc.com wrote:
 
 Hi,
 
 
 
 The current implementation of COUNT and AVG in Pig counts null values.
 This is inconsistent with SQL semantics and also with semantics of other
 aggregated functions such as SUM, MIN, and MAX. Originally we chose this
 implementation for performance reasons; however, we re-implemented both
 functions to support multi-step combiner and now the cost of checking
 for null for the case where combiner is invoked is trivial. (I ran some
 tests with COUNT and they showed no performance difference.) We will pay
 penalty for the non-combinable case including local mode but I think it
 is worth the price to have consistent semantics. Also as we are working
 on SQL support, having SQL compliant semantics becomes very desirable.
 
 
 
 Please, let us know if you have any concerns. I am planning to make the
 change later this week.
 
 
 
 Olga
 
 

--
Yiping Han
F-3140 
(408)349-4403
y...@yahoo-inc.com



[jira] Commented: (PIG-796) support conversion from numeric types to chararray

2009-05-29 Thread Yiping Han (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12714526#action_12714526
 ] 

Yiping Han commented on PIG-796:


I have the same idea that Alan proposed. I agree the common case is most values 
are of the same type. Caching the type and change the cached type only when 
catch the ClassCastException would be the most efficient way.

 support  conversion from numeric types to chararray
 ---

 Key: PIG-796
 URL: https://issues.apache.org/jira/browse/PIG-796
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.2.0
Reporter: Olga Natkovich



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-807) PERFORMANCE: Provide a way for UDFs to use read-once bags (backed by the Hadoop values iterator)

2009-05-19 Thread Yiping Han (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12710818#action_12710818
 ] 

Yiping Han commented on PIG-807:


David, the syntax: B = foreach A generate SUM(m), is confusing for both 
developers and the parser.

I like the idea to remove the explicit GROUP ALL, but would rather to use a 
different key word for that. I.e., B = FOR A GENERATE SUM(m);

Adding a new keyword for this purpose would also works as the hint for parser 
to treat this as a direct hadoop iterator access.

 PERFORMANCE: Provide a way for UDFs to use read-once bags (backed by the 
 Hadoop values iterator)
 

 Key: PIG-807
 URL: https://issues.apache.org/jira/browse/PIG-807
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.2.1
Reporter: Pradeep Kamath
 Fix For: 0.3.0


 Currently all bags resulting from a group or cogroup are materialized as bags 
 containing all of the contents. The issue with this is that if a particular 
 key has many corresponding values, all these values get stuffed in a bag 
 which may run out of memory and hence spill causing slow down in performance 
 and sometime memory exceptions. In many cases, the udfs which use these bags 
 coming out a group and cogroup only need to iterate over the bag in a 
 unidirectional read-once manner. This can be implemented by having the bag 
 implement its iterator by simply iterating over the underlying hadoop 
 iterator provided in the reduce. This kind of a bag is also needed in 
 http://issues.apache.org/jira/browse/PIG-802. So the code can be reused for 
 this issue too. The other part of this issue is to have some way for the udfs 
 to communicate to Pig that any input bags that they need are read once bags 
 . This can be achieved by having an Interface - say UsesReadOnceBags  which 
 is serves as a tag to indicate the intent to Pig. Pig can then rewire its 
 execution plan to use ReadOnceBags is feasible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-664) Semantics of * is not consistent

2009-02-10 Thread Yiping Han (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12672372#action_12672372
 ] 

Yiping Han commented on PIG-664:


I would second Santhosh. In PIG 1.x, * in UDF parameter list does expend as 
flattened list of columns. While converting into PIG 2.0, this create a lot of 
inconvenience. * should always generate flattened columns.

 Semantics of * is not consistent
 

 Key: PIG-664
 URL: https://issues.apache.org/jira/browse/PIG-664
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: types_branch
Reporter: Santhosh Srinivasan
Assignee: Santhosh Srinivasan
 Fix For: types_branch


 The semantics of * is not consistent in PIG. The use of * with generate 
 results in the all the columns of the record being flattened. However, the 
 use of * as an input to a UDF results in a tuple (wrapped in another tuple). 
 For consistency, * should always result in all the columns of the record 
 (i.e., flattened). The use of * occurs in:
 1. Foreach generate: E.g.: foreach input generate *;
 2. Input to UDFs: E.g. foreach input generate myUDF(*);
 3. Order by: E.g.: order input by *;
 4. (Co)Group: E.g.: group a by *; cogroup a by *, b by *;
 In terms of implementation, this involves rolling back the fix introduced in 
 PIG-597 and fixing the following builtin UDFs:
 1. ARITY - Should return the size of the input tuple instead of extracting 
 the first column of the input tuple
 2. SIZE - Should return the size of the input tuple instead of extracting the 
 first column of the input tuple

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-602) Pass global configurations to UDF

2009-02-10 Thread Yiping Han (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12672376#action_12672376
 ] 

Yiping Han commented on PIG-602:


Alan, this plan looks good for our requirements.

 Pass global configurations to UDF
 -

 Key: PIG-602
 URL: https://issues.apache.org/jira/browse/PIG-602
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Yiping Han
Assignee: Alan Gates

 We are seeking an easy way to pass a large number of global configurations to 
 UDFs.
 Since our application contains many pig jobs, and has a large number of 
 configurations. Passing configurations through command line is not an ideal 
 way (i.e. modifying single parameter needs to change multiple command lines). 
 And to put everything into the hadoop conf is not an ideal way either.
 We would like to see if Pig can provide such a facility that allows us to 
 pass a configuration file in some format(XML?) and then make it available 
 through out all the UDFs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-282) Custom Partitioner

2009-02-10 Thread Yiping Han (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12672467#action_12672467
 ] 

Yiping Han commented on PIG-282:


Any concerns on this issue?

 Custom Partitioner
 --

 Key: PIG-282
 URL: https://issues.apache.org/jira/browse/PIG-282
 Project: Pig
  Issue Type: New Feature
Reporter: Amir Youssefi
Priority: Minor

 By adding custom partitioner we can give control over which output partition 
 a key (/value) goes to. We can add keywords to language e.g. 
 PARTITION BY UDF(...)
 or a similar syntax. UDF returns a number between 0 and n-1 where n is number 
 of output partitions.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-610) Pig appears to continue when an underlying mapred job fails

2009-01-08 Thread Yiping Han (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12662084#action_12662084
 ] 

Yiping Han commented on PIG-610:


We are on hadoop 0.18.2 and latest pig_types branch. We tried to do hadoop job 
-kill x through a different terminal. I believe this happens every time 
since Ralf gave me instruction yesterday and I can easily reproduce it.

 Pig appears to continue when an underlying mapred job fails 
 

 Key: PIG-610
 URL: https://issues.apache.org/jira/browse/PIG-610
 Project: Pig
  Issue Type: Bug
Reporter: Yiping Han
Priority: Critical

 We observed sometimes, pig appears to continue when an underlying mapred job 
 fails.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-601) Add finalize() interface to UDF

2009-01-07 Thread Yiping Han (JIRA)
Add finalize() interface to UDF
---

 Key: PIG-601
 URL: https://issues.apache.org/jira/browse/PIG-601
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Yiping Han


I would like to have a finalize() method to UDF, which will be called when no 
more inputs and the UDF will be killed. The finalize() method should allow to 
generate extra output, which in many cases could benefit aggregations.

There are couple of application that can benefit from this feature.

One of the example is, in some UDFs, I need to open some resource(i. e. local 
file) and when the task finishes, I need to close the resource.

Another example is, in one of my application, I do statistics for a list of 
categories and I need to generate a summary category and attach to the end of 
the table. With the finalize method, I could achieve this in an efficient and 
neat way.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-602) Pass global configurations to UDF

2009-01-07 Thread Yiping Han (JIRA)
Pass global configurations to UDF
-

 Key: PIG-602
 URL: https://issues.apache.org/jira/browse/PIG-602
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Yiping Han


We are seeking an easy way to pass a large number of global configurations to 
UDFs.

Since our application contains many pig jobs, and has a large number of 
configurations. Passing configurations through command line is not an ideal way 
(i.e. modifying single parameter needs to change multiple command lines). And 
to put everything into the hadoop conf is not an ideal way either.

We would like to see if Pig can provide such a facility that allows us to pass 
a configuration file in some format(XML?) and then make it available through 
out all the UDFs.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-603) Pig Server

2009-01-07 Thread Yiping Han (JIRA)
Pig Server
--

 Key: PIG-603
 URL: https://issues.apache.org/jira/browse/PIG-603
 Project: Pig
  Issue Type: New Feature
  Components: grunt
Reporter: Yiping Han


With a real Pig Server, when we lose the client, the pig job will not be 
killed. And also, a more important reason for a Pig Server is, we can talk with 
the Pig Sever through APIs to query status, failures, etc.




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-604) Kill the Pig job should kill all associated Hadoop Jobs

2009-01-07 Thread Yiping Han (JIRA)
Kill the Pig job should kill all associated Hadoop Jobs
---

 Key: PIG-604
 URL: https://issues.apache.org/jira/browse/PIG-604
 Project: Pig
  Issue Type: Improvement
  Components: grunt
Reporter: Yiping Han


Current if we kill the pig job on the client machine, those hadoop jobs already 
launched still keep running. We have to kill these jobs manually.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-605) Better explain and console output

2009-01-07 Thread Yiping Han (JIRA)
Better explain and console output
-

 Key: PIG-605
 URL: https://issues.apache.org/jira/browse/PIG-605
 Project: Pig
  Issue Type: Improvement
  Components: grunt
Reporter: Yiping Han


It would be nice if when we explain the script, the corresponding mapred jobs 
can be explicitly mark out in a neat way. While we execute the script, the 
console output could print the name and url of the corresponding hadoop jobs.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-606) Setting replication factor in Pig

2009-01-07 Thread Yiping Han (JIRA)
Setting replication factor in Pig
-

 Key: PIG-606
 URL: https://issues.apache.org/jira/browse/PIG-606
 Project: Pig
  Issue Type: New Feature
Reporter: Yiping Han


We would like the STORE clause to be able to set the replication factor. This 
is particularly useful for certain small files, i.e. for replication join.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-607) Utilize intermediate results instead of re-execution

2009-01-07 Thread Yiping Han (JIRA)
Utilize intermediate results instead of re-execution


 Key: PIG-607
 URL: https://issues.apache.org/jira/browse/PIG-607
 Project: Pig
  Issue Type: New Feature
Reporter: Yiping Han
Priority: Critical


This is the long existing problem. intermediate results are not reused. Every 
STORE or DUMP are executed in a separate plan and thus everything it needs are 
re-executed. This is really a terrible issue that should be fixed asap.




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-608) Compile or validate the whole script before execution

2009-01-07 Thread Yiping Han (JIRA)
Compile or validate the whole script before execution
-

 Key: PIG-608
 URL: https://issues.apache.org/jira/browse/PIG-608
 Project: Pig
  Issue Type: Improvement
  Components: grunt
Reporter: Yiping Han


This is a very usual scenario: 

We are running a big pig job that contains several hadoop jobs. It has been 
running for long times and the first hadoop job sucess, then suddenly pig 
report it found a syntax error in the script after the first hadoop job...we 
have to repeat from the beginning.

It would be nice if pig can compile to the end of the script, find all the 
syntax error, type mismatch, etc., before it really starts execution.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-609) PIG does not return the correct error code

2009-01-07 Thread Yiping Han (JIRA)
PIG does not return the correct error code
--

 Key: PIG-609
 URL: https://issues.apache.org/jira/browse/PIG-609
 Project: Pig
  Issue Type: Bug
Reporter: Yiping Han


Pig still does not always return a correct error code. When the hadoop job 
fails, sometimes pig still return 0.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-611) Better logging support

2009-01-07 Thread Yiping Han (JIRA)
Better logging support
--

 Key: PIG-611
 URL: https://issues.apache.org/jira/browse/PIG-611
 Project: Pig
  Issue Type: Improvement
  Components: tools
Reporter: Yiping Han


I started this ticket to discuss future improvements on logging.

The first thing I would like to suggest is that, pig needs more comprehensive 
logs. If there is a debug mode, when pig could print extensive detailed log, 
that would be very helpful.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-609) PIG does not return the correct error code

2009-01-07 Thread Yiping Han (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiping Han updated PIG-609:
---

Priority: Critical  (was: Major)

 PIG does not return the correct error code
 --

 Key: PIG-609
 URL: https://issues.apache.org/jira/browse/PIG-609
 Project: Pig
  Issue Type: Bug
Reporter: Yiping Han
Priority: Critical

 Pig still does not always return a correct error code. When the hadoop job 
 fails, sometimes pig still return 0.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-610) Pig appears to continue when an underlying mapred job fails

2009-01-07 Thread Yiping Han (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12661823#action_12661823
 ] 

Yiping Han commented on PIG-610:


Create a pig job with multiple mapred jobs. Let the script run and then 
manually kill the running mapred job. Pig reports the failure of this mapred 
job but does not terminate itself. The next mapred job will be launched.

Pig should fail immediately.

 Pig appears to continue when an underlying mapred job fails 
 

 Key: PIG-610
 URL: https://issues.apache.org/jira/browse/PIG-610
 Project: Pig
  Issue Type: Bug
Reporter: Yiping Han
Priority: Critical

 We observed sometimes, pig appears to continue when an underlying mapred job 
 fails.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.