date:20110128

[jira] Commented: (PIG-1829) "0" value seen in PigStat's map/reduce runtime, even when the job is successful

2011-01-28 Thread Arun C Murthy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12988344#action_12988344
 ] 

Arun C Murthy commented on PIG-1829:


bq. A short-term fix can be to poll the JobControl to get finished job (instead 
of waiting for completion of all the jobs in the patch). 

+1 - this should suffice for a very long time. *smile*

> "0" value seen in PigStat's map/reduce runtime, even when the job is 
> successful
> ---
>
> Key: PIG-1829
> URL: https://issues.apache.org/jira/browse/PIG-1829
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.8.0
>Reporter: Thejas M Nair
> Fix For: 0.9.0
>
>
> Pig runtime calls JobClient.getMapTaskReports(jobId) and 
> JobClient.getReduceTaskReports(jobId) to get statistics about numbers of 
> maps/reducers, as well as max/min/avg time of these tasks. But from time to 
> time, these calls return empty lists. When that happens pig is reports 0 
> values for the stats. 
> The jobtracker keeps the stats information only for a limited duration based 
> on the configuration parameters  mapred.jobtracker.completeuserjobs.maximum 
> and mapred.job.tracker.retiredjobs.cache.size. Since pig collects the stats 
> after jobs have finished running, it is possible that the stats for the 
> initial jobs are no longer available. To have better chances of getting the 
> stats, it should be collected as soon as the job is over. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1733) java.lang.NumberFormatException as value is automatically detected as int

2011-01-28 Thread Laukik Chitnis (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12988343#action_12988343
 ] 

Laukik Chitnis commented on PIG-1733:
-

Since UDFs cannot declare types for the parameters within the input tuple, 
wouldn't this be the expected behavior? i.e., if a user does not specify that 
the value is long, we will try to parse it as an int by default. Also, isn't 
expressing a long value as 21431317276L not really typecasting, but just how a 
long value is specified?

Or am I missing something?

Thanks,
Laukik 

> java.lang.NumberFormatException as value is automatically detected as int
> -
>
> Key: PIG-1733
> URL: https://issues.apache.org/jira/browse/PIG-1733
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.6.0, 0.7.0
>Reporter: Viraj Bhat
>Assignee: Laukik Chitnis
> Fix For: 0.9.0
>
>
> I have Pig script which uses an custom FilterFunc "ANYIN". The parser has 
> made this an "integer" by default. It should be detected as "long" . A cast 
> for the following works.
> {code}B = filter A by ANYIN(id, 21431317276L);{code}
> {code}
> A0 = load '/projects/cookie/20101018/input' using MyLoader as s:map[];
> A = foreach A0 generate s#'cookie' as cookie, s#'rtype' as rtype, s#'id' as 
> id, s#'networkid' as networkid;
> B = filter A by ANYIN(id, 21431317276);
> C = GROUP B BY cookie parallel 10;
> D = foreach C generate group, COUNT(B) as COUNT_FIELD;
> E = filter D BY INRANGE(COUNT_FIELD, 1,1000);
> F = foreach E generate group;
> store F into '/projects/cookie/20101018/output';
> {code}
> Since the parse tries to convert the input to an int we get the following 
> error:
> {quote}
> at 
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
> at java.lang.Integer.parseInt(Integer.java:459)
> at java.lang.Integer.parseInt(Integer.java:497)
> at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.AtomDatum(QueryParser.java:6593)
> at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.Const(QueryParser.java:6707)
> at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.BaseEvalSpec(QueryParser.java:4868)
> at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.UnaryExpr(QueryParser.java:4774)
> at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.CastExpr(QueryParser.java:4720)
> at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.MultiplicativeExpr(QueryParser.java:4629)
> at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.AdditiveExpr(QueryParser.java:4555)
> at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.InfixExpr(QueryParser.java:4521)
> at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.EvalArgsItem(QueryParser.java:5271)
> at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.EvalArgs(QueryParser.java:5231)
> at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.EvalFuncSpec(QueryParser.java:5049)
> at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.PUnaryCond(QueryParser.java:2075)
> at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.PAndCond(QueryParser.java:1916)
> at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.POrCond(QueryParser.java:1860)
> at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.PCond(QueryParser.java:1826)
> at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.FilterClause(QueryParser.java:1661)
> at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.BaseExpr(QueryParser.java:1368)
> at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.Expr(QueryParser.java:985)
> at 
> org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:774)
> at 
> org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63)
> at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1164)
> at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1114)
> at org.apache.pig.PigServer.registerQuery(PigServer.java:425)
> at 
> org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:737)
> at 
> org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:324)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138)
> at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89)
> at org.apache.pig.Main.main(Main.java:314)
> {quote}
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1829) "0" value seen in PigStat's map/reduce runtime, even when the job is successful

2011-01-28 Thread Richard Ding (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12988342#action_12988342
 ] 

Richard Ding commented on PIG-1829:
---

Current implementation is to get job stats after each patch of jobs is finished 
(i.e. jobs that can be ran in parallel). A short-term fix can be to poll the 
JobControl to get finished job (instead of waiting for completion of all the 
jobs in the patch). This could reduce the time window between a job is finished 
and its stats gets queried in cases where the running time of jobs in a patch 
varies.

> "0" value seen in PigStat's map/reduce runtime, even when the job is 
> successful
> ---
>
> Key: PIG-1829
> URL: https://issues.apache.org/jira/browse/PIG-1829
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.8.0
>Reporter: Thejas M Nair
> Fix For: 0.9.0
>
>
> Pig runtime calls JobClient.getMapTaskReports(jobId) and 
> JobClient.getReduceTaskReports(jobId) to get statistics about numbers of 
> maps/reducers, as well as max/min/avg time of these tasks. But from time to 
> time, these calls return empty lists. When that happens pig is reports 0 
> values for the stats. 
> The jobtracker keeps the stats information only for a limited duration based 
> on the configuration parameters  mapred.jobtracker.completeuserjobs.maximum 
> and mapred.job.tracker.retiredjobs.cache.size. Since pig collects the stats 
> after jobs have finished running, it is possible that the stats for the 
> initial jobs are no longer available. To have better chances of getting the 
> stats, it should be collected as soon as the job is over. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1829) "0" value seen in PigStat's map/reduce runtime, even when the job is successful

2011-01-28 Thread Arun C Murthy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12988324#action_12988324
 ] 

Arun C Murthy commented on PIG-1829:


bq. Any range in practical usage, 1 minute, 5 minutes, 1 hour, etc? It will 
help to set expectation with the user.

It is number of jobs. Typically it's 1000, so that last 1k jobs are around to 
query status.

> "0" value seen in PigStat's map/reduce runtime, even when the job is 
> successful
> ---
>
> Key: PIG-1829
> URL: https://issues.apache.org/jira/browse/PIG-1829
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.8.0
>Reporter: Thejas M Nair
> Fix For: 0.9.0
>
>
> Pig runtime calls JobClient.getMapTaskReports(jobId) and 
> JobClient.getReduceTaskReports(jobId) to get statistics about numbers of 
> maps/reducers, as well as max/min/avg time of these tasks. But from time to 
> time, these calls return empty lists. When that happens pig is reports 0 
> values for the stats. 
> The jobtracker keeps the stats information only for a limited duration based 
> on the configuration parameters  mapred.jobtracker.completeuserjobs.maximum 
> and mapred.job.tracker.retiredjobs.cache.size. Since pig collects the stats 
> after jobs have finished running, it is possible that the stats for the 
> initial jobs are no longer available. To have better chances of getting the 
> stats, it should be collected as soon as the job is over. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1829) "0" value seen in PigStat's map/reduce runtime, even when the job is successful

2011-01-28 Thread Santhosh Srinivasan (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12988320#action_12988320
 ] 

Santhosh Srinivasan commented on PIG-1829:
--

bq. bq. What are the bounds on the reasonable amount of time?
bq. The JT has a configured limit on #jobs in memory and disk. So, one can 
customize it per-installation.

Any range in practical usage, 1 minute, 5 minutes, 1 hour, etc? It will help to 
set expectation with the user.

> "0" value seen in PigStat's map/reduce runtime, even when the job is 
> successful
> ---
>
> Key: PIG-1829
> URL: https://issues.apache.org/jira/browse/PIG-1829
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.8.0
>Reporter: Thejas M Nair
> Fix For: 0.9.0
>
>
> Pig runtime calls JobClient.getMapTaskReports(jobId) and 
> JobClient.getReduceTaskReports(jobId) to get statistics about numbers of 
> maps/reducers, as well as max/min/avg time of these tasks. But from time to 
> time, these calls return empty lists. When that happens pig is reports 0 
> values for the stats. 
> The jobtracker keeps the stats information only for a limited duration based 
> on the configuration parameters  mapred.jobtracker.completeuserjobs.maximum 
> and mapred.job.tracker.retiredjobs.cache.size. Since pig collects the stats 
> after jobs have finished running, it is possible that the stats for the 
> initial jobs are no longer available. To have better chances of getting the 
> stats, it should be collected as soon as the job is over. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1829) "0" value seen in PigStat's map/reduce runtime, even when the job is successful

2011-01-28 Thread Arun C Murthy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12988315#action_12988315
 ] 

Arun C Murthy commented on PIG-1829:


bq. What are the bounds on the reasonable amount of time?
The JT has a configured limit on #jobs in memory and disk. So, one can 
customize it per-installation.

bq. Are there any plans in place to support APIs to access job history?
MAPREDUCE-1941 and Rumen.

> "0" value seen in PigStat's map/reduce runtime, even when the job is 
> successful
> ---
>
> Key: PIG-1829
> URL: https://issues.apache.org/jira/browse/PIG-1829
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.8.0
>Reporter: Thejas M Nair
> Fix For: 0.9.0
>
>
> Pig runtime calls JobClient.getMapTaskReports(jobId) and 
> JobClient.getReduceTaskReports(jobId) to get statistics about numbers of 
> maps/reducers, as well as max/min/avg time of these tasks. But from time to 
> time, these calls return empty lists. When that happens pig is reports 0 
> values for the stats. 
> The jobtracker keeps the stats information only for a limited duration based 
> on the configuration parameters  mapred.jobtracker.completeuserjobs.maximum 
> and mapred.job.tracker.retiredjobs.cache.size. Since pig collects the stats 
> after jobs have finished running, it is possible that the stats for the 
> initial jobs are no longer available. To have better chances of getting the 
> stats, it should be collected as soon as the job is over. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

RE: Pig developer meeting in February

2011-01-28 Thread Santhosh Srinivasan

I am planning to attend. 

-Original Message-
From: Olga Natkovich [mailto:ol...@yahoo-inc.com] 
Sent: Friday, January 28, 2011 12:58 PM
To: dev@pig.apache.org
Subject: RE: Pig developer meeting in February

I believe we have critical mass so the meeting is on!

If you have not responded yet but planning to attend, please, let me know.

Thanks,

Olga

-Original Message-
From: Julien Le Dem [mailto:led...@yahoo-inc.com]
Sent: Thursday, January 27, 2011 5:21 PM
To: dev@pig.apache.org
Subject: Re: Pig developer meeting in February

Me too.
Julien


On 1/27/11 4:09 PM, "Dmitriy Ryaboy"  wrote:

Ok yeah I'll come :).



On Thu, Jan 27, 2011 at 3:17 PM, Olga Natkovich  wrote:

> While there is a lively discussion on this thread, I have not actually 
> gotten any responses to having the meeting with exception of 1 person :).
>
> Please, let me know by the end of the week if you are planning to attend.
> If we don't get at least a few more responses I suggest we postpone 
> the meeting.
>
> Thanks,
>
> Olga
>
> -Original Message-
> From: Dmitriy Ryaboy [mailto:dvrya...@gmail.com]
> Sent: Wednesday, January 26, 2011 6:04 PM
> To: dev@pig.apache.org
> Subject: Re: Pig developer meeting in February
>
> Right, we do partition filtering, but not true predicate pushdown.
>
> On Wed, Jan 26, 2011 at 5:59 PM, Daniel Dai 
> wrote:
>
> > Are you talking about LoadMetadata.setPartitionFilter?
> > PartitionFilterOptimizer will do that.
> >
> > Daniel
> >
> >
> > Dmitriy Ryaboy wrote:
> >
> >> I may be wrong but I think predicate pushdown is designed for, but 
> >> not actually implemented in the current LoadPushdown interface (you 
> >> can only push projections). If I am wrong, that's great.. but if 
> >> not, that would
> be
> >> an important feature to add, as people are trying to connect Pig to 
> >> "smart"
> >> storage systems like rdbmses, HBase, and Cassandra more and more.  
> >> I
> think
> >> we only kind of simulate this with partition keys info, which is 
> >> not always sufficient
> >>
> >> D
> >>
> >> On Wed, Jan 26, 2011 at 2:41 PM, Julien Le Dem 
> >> 
> >> wrote:
> >>
> >>
> >>
> >>> If making Pig Thread safe (i.e.: two threads running a different 
> >>> pig
> >>> script) is important then we need to change some of the APIs from
> static
> >>> singleton access to a dependency injection pattern.
> >>> In that case, this should probably be done before 1.0 For example: 
> >>> UDFContext should be passed to the UDF after construction (similar 
> >>> to the SevrletContext in Servlet or the way Hadoop passes the 
> >>> context to tasks) Also a clearly separated API that does not 
> >>> depend on the Pig implementation would help.
> >>> For example UDFContext is in org.apache.pig.impl.util when it 
> >>> would be better in org.apache.pig.api (Or at least an interface 
> >>> defining it)
> >>>
> >>> Julien
> >>>
> >>> On 1/24/11 10:14 AM, "Olga Natkovich"  wrote:
> >>>
> >>> Hi Guys,
> >>>
> >>> I think it is time for us to have another meeting. Yahoo would be 
> >>> happy to host if this works for everybody. How about Wednesday, 
> >>> 2/9 4-6 pm.
> >>> Please,
> >>> let us know if you are planning to attend and if the date/time 
> >>> works
> for
> >>> you.
> >>>
> >>> Things that come to mind to discuss and as always feel free to 
> >>> suggest
> >>> others:
> >>>
> >>> -  Error handling proposal - this might be easier to finalize
> >>> face-to-face
> >>> -  Pig 0.9 plan
> >>> -  Pig Roadmap beyond 0.9
> >>> oWhat do we want to do in Pig.next?
> >>> oAre we ready for Pig 1.0
> >>>
> >>> Olga
> >>>
> >>>
> >>>
> >>>
> >>
> >
>

[jira] Commented: (PIG-1829) "0" value seen in PigStat's map/reduce runtime, even when the job is successful

2011-01-28 Thread Santhosh Srinivasan (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12988306#action_12988306
 ] 

Santhosh Srinivasan commented on PIG-1829:
--

A couple of questions:

What are the bounds on the reasonable amount of time?
Are there any plans in place to support APIs to access job history?

> "0" value seen in PigStat's map/reduce runtime, even when the job is 
> successful
> ---
>
> Key: PIG-1829
> URL: https://issues.apache.org/jira/browse/PIG-1829
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.8.0
>Reporter: Thejas M Nair
> Fix For: 0.9.0
>
>
> Pig runtime calls JobClient.getMapTaskReports(jobId) and 
> JobClient.getReduceTaskReports(jobId) to get statistics about numbers of 
> maps/reducers, as well as max/min/avg time of these tasks. But from time to 
> time, these calls return empty lists. When that happens pig is reports 0 
> values for the stats. 
> The jobtracker keeps the stats information only for a limited duration based 
> on the configuration parameters  mapred.jobtracker.completeuserjobs.maximum 
> and mapred.job.tracker.retiredjobs.cache.size. Since pig collects the stats 
> after jobs have finished running, it is possible that the stats for the 
> initial jobs are no longer available. To have better chances of getting the 
> stats, it should be collected as soon as the job is over. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1829) "0" value seen in PigStat's map/reduce runtime, even when the job is successful

2011-01-28 Thread Arun C Murthy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12988300#action_12988300
 ] 

Arun C Murthy commented on PIG-1829:


Reading directly from JobHistory directory is a non-starter - permissions.

We should investigate standard apis. For now getting stats right after job 
completion will work pretty much all the time. The JT has a the ability to be 
stats for completed jobs either from memory or disk for a reasonable amount of 
time after job-completion.

> "0" value seen in PigStat's map/reduce runtime, even when the job is 
> successful
> ---
>
> Key: PIG-1829
> URL: https://issues.apache.org/jira/browse/PIG-1829
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.8.0
>Reporter: Thejas M Nair
> Fix For: 0.9.0
>
>
> Pig runtime calls JobClient.getMapTaskReports(jobId) and 
> JobClient.getReduceTaskReports(jobId) to get statistics about numbers of 
> maps/reducers, as well as max/min/avg time of these tasks. But from time to 
> time, these calls return empty lists. When that happens pig is reports 0 
> values for the stats. 
> The jobtracker keeps the stats information only for a limited duration based 
> on the configuration parameters  mapred.jobtracker.completeuserjobs.maximum 
> and mapred.job.tracker.retiredjobs.cache.size. Since pig collects the stats 
> after jobs have finished running, it is possible that the stats for the 
> initial jobs are no longer available. To have better chances of getting the 
> stats, it should be collected as soon as the job is over. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1829) "0" value seen in PigStat's map/reduce runtime, even when the job is successful

2011-01-28 Thread Santhosh Srinivasan (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12988278#action_12988278
 ] 

Santhosh Srinivasan commented on PIG-1829:
--

+1 for both - investigate further to check for options to retrieve the job from 
history and push for standardized access mechanisms.

For now, can we document this behavior - the only concern I have is around the 
duration for which JT retains the job in memory. For e.g., if more than half 
the time we are unable to retrieve statistics, it will be a bit frustrating for 
the users.

> "0" value seen in PigStat's map/reduce runtime, even when the job is 
> successful
> ---
>
> Key: PIG-1829
> URL: https://issues.apache.org/jira/browse/PIG-1829
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.8.0
>Reporter: Thejas M Nair
> Fix For: 0.9.0
>
>
> Pig runtime calls JobClient.getMapTaskReports(jobId) and 
> JobClient.getReduceTaskReports(jobId) to get statistics about numbers of 
> maps/reducers, as well as max/min/avg time of these tasks. But from time to 
> time, these calls return empty lists. When that happens pig is reports 0 
> values for the stats. 
> The jobtracker keeps the stats information only for a limited duration based 
> on the configuration parameters  mapred.jobtracker.completeuserjobs.maximum 
> and mapred.job.tracker.retiredjobs.cache.size. Since pig collects the stats 
> after jobs have finished running, it is possible that the stats for the 
> initial jobs are no longer available. To have better chances of getting the 
> stats, it should be collected as soon as the job is over. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1812) Problem with DID_NOT_FIND_LOAD_ONLY_MAP_PLAN

2011-01-28 Thread Thejas M Nair (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12988270#action_12988270
 ] 

Thejas M Nair commented on PIG-1812:


+1

> Problem with DID_NOT_FIND_LOAD_ONLY_MAP_PLAN
> 
>
> Key: PIG-1812
> URL: https://issues.apache.org/jira/browse/PIG-1812
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0, 0.8.0
> Environment: RHEL, Pig 0.8.0
>Reporter: xianyu
>Assignee: Daniel Dai
> Attachments: PIG-1812-1.patch, PIG-1812-2.patch
>
>
> Hi, 
> I have the following input files:
> pkg.txt
> a   3   {(123,1.0),(236,2.0)}
> a   3   {(236,1.0)}
> model.txt
> a   123 2   0.33
> a   236 2   0.5
> My script is listed below:
> A = load 'pkg.txt' using PigStorage('\t') as (pkg:chararray, ts:int, 
> cat_bag:{t:(id:chararray, wht:float)});
> M = load 'model.txt' using PigStorage('\t') as (pkg:chararray, 
> cat_id:chararray, ts:int, score:double);
> B = foreach A generate ts, pkg, flatten(cat_bag.id) as (cat_id:chararray);
> B = distinct B;
> H1 = cogroup M by (pkg, cat_id) inner, B by (pkg, cat_id);
> H2 = foreach H1 {
> I = order M by ts;
> J = order B by ts;
> generate flatten(group) as (pkg:chararray, cat_id:chararray), J.ts as 
> tsorig, I.ts as tsmap;
> }
> dump H2;
> When running this script, I got a warning about "Encountered Warning 
> DID_NOT_FIND_LOAD_ONLY_MAP_PLAN 1 time(s)" and pig error log as below:
> Pig Stack Trace
> ---
> ERROR 2043: Unexpected error during execution.
> org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to 
> open iterator for alias H2
> at org.apache.pig.PigServer.openIterator(PigServer.java:764)
> at 
> org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:612)
> at 
> org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:303)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:165)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:141)
> at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:90)
> at org.apache.pig.Main.run(Main.java:500)
> at org.apache.pig.Main.main(Main.java:107)
> Caused by: org.apache.pig.PigException: ERROR 1002: Unable to store alias H2
> at org.apache.pig.PigServer.storeEx(PigServer.java:888)
> at org.apache.pig.PigServer.store(PigServer.java:826)
> at org.apache.pig.PigServer.openIterator(PigServer.java:738)
> ... 7 more
> Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2043: 
> Unexpected error during execution.
> at 
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:403)
> at 
> org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1208)
> at org.apache.pig.PigServer.storeEx(PigServer.java:884)
> ... 9 more
> Caused by: java.lang.ClassCastException: 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLoad
>  cannot be cast to 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.SecondaryKeyOptimizer.visitMROp(SecondaryKeyOptimizer.java:352)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceOper.visit(MapReduceOper.java:246)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceOper.visit(MapReduceOper.java:41)
> at 
> org.apache.pig.impl.plan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:69)
> at 
> org.apache.pig.impl.plan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:71)
> at 
> org.apache.pig.impl.plan.DepthFirstWalker.walk(DepthFirstWalker.java:52)
> at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.compile(MapReduceLauncher.java:498)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:117)
> at 
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:378)
> ... 11 more
> But, when I removed the DISTINCT statement before COGROUP, i.e. "B = distinct 
> B;"  this script can run smoothly. I have also tried other reducer side 
> operations like ORDER, it seems that they will also trigger above error. This 
> is really very confusing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1748) Add load/store function AvroStorage for avro data

2011-01-28 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12988253#action_12988253
 ] 

Dmitriy V. Ryaboy commented on PIG-1748:


The TestPigStorageSchema thing is mine, someone else just opened a ticket. Will 
fix.

> Add load/store function AvroStorage for avro data
> -
>
> Key: PIG-1748
> URL: https://issues.apache.org/jira/browse/PIG-1748
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Reporter: lin guo
>Assignee: Jakob Homan
> Attachments: avro_storage.patch, avro_test_files.tar.gz, 
> PIG-1748-2.patch, PIG-1748-3.patch
>
>
> We want to use Pig to process arbitrary Avro data and store results as Avro 
> files. AvroStorage() extends two PigFuncs: LoadFunc and StoreFunc. 
> Due to discrepancies of Avro and Pig data models, AvroStorage has:
> 1. Limited support for "record": we do not support recursively defined record 
> because the number of fields in such records is data dependent.
> 2. Limited support for "union": we only accept nullable union like ["null", 
> "some-type"].
> For simplicity, we also make the following assumptions:
> If the input directory is a leaf directory, then we assume Avro data files in 
> it have the same schema;
> If the input directory contains sub-directories, then we assume Avro data 
> files in all sub-directories have the same schema.
> AvroStorage takes no input parameters when used as a LoadFunc (except for 
> "debug [debug-level]"). 
> Users can provide parameters to AvroStorage when used as a StoreFunc. If they 
> don't, Avro schema of output data is derived from its 
> Pig schema.
> Detailed documentation can be found in 
> http://snaprojects.jira.com/wiki/display/HTOOLS/AvroStorage+-+Pig+support+for+Avro+data

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1834) relation-as-scalar - uses the last statement associated with the scalar alias

2011-01-28 Thread Thejas M Nair (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thejas M Nair updated PIG-1834:
---

Description: 
Pig allows relation alias to be re-used , ie refer to different 
relations(/statements) . I have not seen this in documentation, but I have seen 
people writing such queries.

For example -
{code}
l = load 'x' as (a,b);
l = filter l by a > 1;
l = foreach ...
store l into  'y'
{code}

At any part of the query, the alias "l' always represents the relation it last 
associated with the portion of pig-query above it.

But in case of relation-as-scalar feature the association is happening with the 
last relation associated with the alias in entire script.

For example -
{code}
 l = load 'x' as (a,b);
 A = load 'x' as (a,b); 
 B = foreach A generate a, l.a as la;
 l = foreach l generate a+1 as a;
store B into 'b';
{code}

The alias l in relation with alias B should refer to the load, but it refers to 
the foreach statement -
{code}

#--
# Map Reduce Plan
#--
MapReduce node scope-16
Map Plan
l: 
Store(file:/tmp/temp-953430379/tmp2006282146:org.apache.pig.impl.io.InterStorage)
 - scope-8
|
|---l: New For Each(false)[bag] - scope-7
|   |
|   Add[int] - scope-5
|   |
|   |---Cast[int] - scope-3
|   |   |  
|   |   |---Project[bytearray][0] - scope-2
|   |
|   |---Constant(1) - scope-4
|
|---l: 
Load(file:///Users/tejas/pig_type/trunk/x:org.apache.pig.builtin.PigStorage) - 
scope-1
Global sort: false


MapReduce node scope-17
Map Plan
B: 
Store(file:///Users/tejas/pig_type/trunk/b:org.apache.pig.builtin.PigStorage) - 
scope-15
|
|---B: New For Each(false,false)[bag] - scope-14
|   |
|   Project[bytearray][0] - scope-9
|   |
|   POUserFunc(org.apache.pig.impl.builtin.ReadScalars)[int] - scope-13
|   |
|   |---Constant(0) - scope-11
|   |
|   |---Constant(file:/tmp/temp-953430379/tmp2006282146) - scope-12
|
|---A: 
Load(file:///Users/tejas/pig_type/trunk/x:org.apache.pig.builtin.PigStorage) - 
scope-0
Global sort: false

{code}



  was:
Pig allows relation alias to be re-used , ie refer to different 
relations(/statements) . I have not seen this in documentation, but I have seen 
people writing such queries.

For example -
{code}
l = load 'x' as (a,b);
l = filter l by a > 1;
l = foreach ...
store l into  'y'
{code}

At any part of the query, the alias "l' always represents the relation it last 
associated with the portion of pig-query above it.

But in case of relation-as-scalar feature the association is happening with the 
last relation associated with the alias in entire script.

For example -
{code}
 l = load 'x' as (a,b);
 A = load 'x' as (a,b); 
 B = foreach A generate a, l.a as la;
 l = foreach l generate a+1 as a;
store B into 'b';
{code}

The alias l in relation with alias B should refer to the load, but it refers to 
the foreach statement -
#--
# Map Reduce Plan
#--
MapReduce node scope-16
Map Plan
l: 
Store(file:/tmp/temp-953430379/tmp2006282146:org.apache.pig.impl.io.InterStorage)
 - scope-8
|
|---l: New For Each(false)[bag] - scope-7
|   |
|   Add[int] - scope-5
|   |
|   |---Cast[int] - scope-3
|   |   |  
|   |   |---Project[bytearray][0] - scope-2
|   |
|   |---Constant(1) - scope-4
|
|---l: 
Load(file:///Users/tejas/pig_type/trunk/x:org.apache.pig.builtin.PigStorage) - 
scope-1
Global sort: false


MapReduce node scope-17
Map Plan
B: 
Store(file:///Users/tejas/pig_type/trunk/b:org.apache.pig.builtin.PigStorage) - 
scope-15
|
|---B: New For Each(false,false)[bag] - scope-14
|   |
|   Project[bytearray][0] - scope-9
|   |
|   POUserFunc(org.apache.pig.impl.builtin.ReadScalars)[int] - scope-13
|   |
|   |---Constant(0) - scope-11
|   |
|   |---Constant(file:/tmp/temp-953430379/tmp2006282146) - scope-12
|
|---A: 
Load(file:///Users/tejas/pig_type/trunk/x:org.apache.pig.builtin.PigStorage) - 
scope-0
Global sort: false





> relation-as-scalar - uses the last statement associated with the scalar alias
> -
>
> Key: PIG-1834
> URL: https://issues.apache.org/jira/browse/PIG-1834
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Thejas M Nair
> Fix For: 0.8.0, 0.9.0
>
>
> Pig allows relation alias to be re-used , ie refer to different 
> relations(/statements) . I have not seen this in documentation, but I have 
> seen people writing such queries.
> For example -
> {code}
> l = load 'x' as (a,

[jira] Created: (PIG-1834) relation-as-scalar - uses the last statement associated with the scalar alias

2011-01-28 Thread Thejas M Nair (JIRA)

relation-as-scalar - uses the last statement associated with the scalar alias
-

 Key: PIG-1834
 URL: https://issues.apache.org/jira/browse/PIG-1834
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Thejas M Nair
 Fix For: 0.9.0, 0.8.0


Pig allows relation alias to be re-used , ie refer to different 
relations(/statements) . I have not seen this in documentation, but I have seen 
people writing such queries.

For example -
{code}
l = load 'x' as (a,b);
l = filter l by a > 1;
l = foreach ...
store l into  'y'
{code}

At any part of the query, the alias "l' always represents the relation it last 
associated with the portion of pig-query above it.

But in case of relation-as-scalar feature the association is happening with the 
last relation associated with the alias in entire script.

For example -
{code}
 l = load 'x' as (a,b);
 A = load 'x' as (a,b); 
 B = foreach A generate a, l.a as la;
 l = foreach l generate a+1 as a;
store B into 'b';
{code}

The alias l in relation with alias B should refer to the load, but it refers to 
the foreach statement -
#--
# Map Reduce Plan
#--
MapReduce node scope-16
Map Plan
l: 
Store(file:/tmp/temp-953430379/tmp2006282146:org.apache.pig.impl.io.InterStorage)
 - scope-8
|
|---l: New For Each(false)[bag] - scope-7
|   |
|   Add[int] - scope-5
|   |
|   |---Cast[int] - scope-3
|   |   |  
|   |   |---Project[bytearray][0] - scope-2
|   |
|   |---Constant(1) - scope-4
|
|---l: 
Load(file:///Users/tejas/pig_type/trunk/x:org.apache.pig.builtin.PigStorage) - 
scope-1
Global sort: false


MapReduce node scope-17
Map Plan
B: 
Store(file:///Users/tejas/pig_type/trunk/b:org.apache.pig.builtin.PigStorage) - 
scope-15
|
|---B: New For Each(false,false)[bag] - scope-14
|   |
|   Project[bytearray][0] - scope-9
|   |
|   POUserFunc(org.apache.pig.impl.builtin.ReadScalars)[int] - scope-13
|   |
|   |---Constant(0) - scope-11
|   |
|   |---Constant(file:/tmp/temp-953430379/tmp2006282146) - scope-12
|
|---A: 
Load(file:///Users/tejas/pig_type/trunk/x:org.apache.pig.builtin.PigStorage) - 
scope-0
Global sort: false




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (PIG-1830) Type mismatch error in key from map, when doing GROUP on PigStorageSchema() variable

2011-01-28 Thread Dmitriy V. Ryaboy (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy reassigned PIG-1830:
--

Assignee: Dmitriy V. Ryaboy

> Type mismatch error in key from map, when doing GROUP on PigStorageSchema() 
> variable
> 
>
> Key: PIG-1830
> URL: https://issues.apache.org/jira/browse/PIG-1830
> Project: Pig
>  Issue Type: Bug
>Reporter: Mitesh Singh Jat
>Assignee: Dmitriy V. Ryaboy
>
> Pig fails when we try to GROUP data loaded via PigStorageSchema.
> {code}
> Events = LOAD 'input/PigStorageSchema' USING 
> org.apache.pig.piggybank.storage.PigStorageSchema();
> Sessions = GROUP Events BY name;
> DUMP Sessions;
> {code}
> Schema file '''input/PigStorageSchema/.pig_schema'''
> {code}
> {"fields":[{"name":"name","type":55,"schema":null,"description":"autogenerated
>  from Pig Field 
> Schema"},{"name":"val","type":10,"schema":null,"description":"autogenerated 
> from Pig Field Schema"}],"version":0,"sortKeys":[],"sortKeyOrders":[]}
> {code}
> Header file '''input/PigStorageSchema/.pig_header'''
> {code}
> nameval
> {code}
> Sample input file '''input/PigStorageSchema/pss.in'''
> {code}
> peter   1
> samir   3
> michael 4
> peter   2
> peter   4
> samir   1
> {code}
> On running the above pig script, the following error is received.
> {code}
> 2010-12-15 08:07:58,367 WARN org.apache.hadoop.mapred.Child: Error running 
> child
> java.io.IOException: Type mismatch in key from map: expected 
> org.apache.pig.impl.io.NullableText, recieved
> org.apache.pig.impl.io.NullableBytesWritable
> at 
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:898)
> at 
> org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:600)
> at 
> org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:116)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:238)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53)
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:674)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:335)
> at org.apache.hadoop.mapred.Child$4.run(Child.java:242)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:396)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1062)
> at org.apache.hadoop.mapred.Child.main(Child.java:236)
> {code}
> On changing "type" of "name" from 55(chararray) to 50(bytearray), the
> GROUP-BY worked.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1830) Type mismatch error in key from map, when doing GROUP on PigStorageSchema() variable

2011-01-28 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12988239#action_12988239
 ] 

Dmitriy V. Ryaboy commented on PIG-1830:


That'd be me. I'll fix. Thanks for pointing the way, Olga.

The reason it's numbers instead of strings is that the schema is automatically 
(de)serialized and that's what the values are in a schema.. I would've 
preferred an enum but that's not what it is in DataType.java



> Type mismatch error in key from map, when doing GROUP on PigStorageSchema() 
> variable
> 
>
> Key: PIG-1830
> URL: https://issues.apache.org/jira/browse/PIG-1830
> Project: Pig
>  Issue Type: Bug
>Reporter: Mitesh Singh Jat
>
> Pig fails when we try to GROUP data loaded via PigStorageSchema.
> {code}
> Events = LOAD 'input/PigStorageSchema' USING 
> org.apache.pig.piggybank.storage.PigStorageSchema();
> Sessions = GROUP Events BY name;
> DUMP Sessions;
> {code}
> Schema file '''input/PigStorageSchema/.pig_schema'''
> {code}
> {"fields":[{"name":"name","type":55,"schema":null,"description":"autogenerated
>  from Pig Field 
> Schema"},{"name":"val","type":10,"schema":null,"description":"autogenerated 
> from Pig Field Schema"}],"version":0,"sortKeys":[],"sortKeyOrders":[]}
> {code}
> Header file '''input/PigStorageSchema/.pig_header'''
> {code}
> nameval
> {code}
> Sample input file '''input/PigStorageSchema/pss.in'''
> {code}
> peter   1
> samir   3
> michael 4
> peter   2
> peter   4
> samir   1
> {code}
> On running the above pig script, the following error is received.
> {code}
> 2010-12-15 08:07:58,367 WARN org.apache.hadoop.mapred.Child: Error running 
> child
> java.io.IOException: Type mismatch in key from map: expected 
> org.apache.pig.impl.io.NullableText, recieved
> org.apache.pig.impl.io.NullableBytesWritable
> at 
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:898)
> at 
> org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:600)
> at 
> org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:116)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:238)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53)
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:674)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:335)
> at org.apache.hadoop.mapred.Child$4.run(Child.java:242)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:396)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1062)
> at org.apache.hadoop.mapred.Child.main(Child.java:236)
> {code}
> On changing "type" of "name" from 55(chararray) to 50(bytearray), the
> GROUP-BY worked.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PIG-1833) Contrib's build.xml points to an invalid hadoop-conf

2011-01-28 Thread Jakob Homan (JIRA)

Contrib's build.xml points to an invalid hadoop-conf


 Key: PIG-1833
 URL: https://issues.apache.org/jira/browse/PIG-1833
 Project: Pig
  Issue Type: Bug
Reporter: Jakob Homan


As discovered in testing PIG-1748, the build.xml in the contrib/piggybank/java 
module has {{junit.hadoop..conf}} which points to 
{{"${user.home}/pigtest/conf/"}}.  In this directory is a hadoop-conf.xml that 
defines a value for {{fs.default.name}} which is valid during the regular test 
runs but not for the contrib modules.  However, any tests in contrib that try 
to access a non-fully qualified file via FileSystem will be routed to this 
value and will then fail when they can't reach it.  If, however, one runs the 
tests directly from contrib module without the pigtest directory existing, the 
tests will pass.  Do any of the contrib modules actually need this variable?  
If not, it should be removed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

RE: Pig developer meeting in February

2011-01-28 Thread Olga Natkovich

I believe we have critical mass so the meeting is on!

If you have not responded yet but planning to attend, please, let me know.

Thanks,

Olga

-Original Message-
From: Julien Le Dem [mailto:led...@yahoo-inc.com] 
Sent: Thursday, January 27, 2011 5:21 PM
To: dev@pig.apache.org
Subject: Re: Pig developer meeting in February

Me too.
Julien


On 1/27/11 4:09 PM, "Dmitriy Ryaboy"  wrote:

Ok yeah I'll come :).



On Thu, Jan 27, 2011 at 3:17 PM, Olga Natkovich  wrote:

> While there is a lively discussion on this thread, I have not actually
> gotten any responses to having the meeting with exception of 1 person :).
>
> Please, let me know by the end of the week if you are planning to attend.
> If we don't get at least a few more responses I suggest we postpone the
> meeting.
>
> Thanks,
>
> Olga
>
> -Original Message-
> From: Dmitriy Ryaboy [mailto:dvrya...@gmail.com]
> Sent: Wednesday, January 26, 2011 6:04 PM
> To: dev@pig.apache.org
> Subject: Re: Pig developer meeting in February
>
> Right, we do partition filtering, but not true predicate pushdown.
>
> On Wed, Jan 26, 2011 at 5:59 PM, Daniel Dai 
> wrote:
>
> > Are you talking about LoadMetadata.setPartitionFilter?
> > PartitionFilterOptimizer will do that.
> >
> > Daniel
> >
> >
> > Dmitriy Ryaboy wrote:
> >
> >> I may be wrong but I think predicate pushdown is designed for, but not
> >> actually implemented in the current LoadPushdown interface (you can only
> >> push projections). If I am wrong, that's great.. but if not, that would
> be
> >> an important feature to add, as people are trying to connect Pig to
> >> "smart"
> >> storage systems like rdbmses, HBase, and Cassandra more and more.  I
> think
> >> we only kind of simulate this with partition keys info, which is not
> >> always
> >> sufficient
> >>
> >> D
> >>
> >> On Wed, Jan 26, 2011 at 2:41 PM, Julien Le Dem 
> >> wrote:
> >>
> >>
> >>
> >>> If making Pig Thread safe (i.e.: two threads running a different pig
> >>> script) is important then we need to change some of the APIs from
> static
> >>> singleton access to a dependency injection pattern.
> >>> In that case, this should probably be done before 1.0
> >>> For example: UDFContext should be passed to the UDF after construction
> >>> (similar to the SevrletContext in Servlet or the way Hadoop passes the
> >>> context to tasks)
> >>> Also a clearly separated API that does not depend on the Pig
> >>> implementation
> >>> would help.
> >>> For example UDFContext is in org.apache.pig.impl.util when it would be
> >>> better in org.apache.pig.api (Or at least an interface defining it)
> >>>
> >>> Julien
> >>>
> >>> On 1/24/11 10:14 AM, "Olga Natkovich"  wrote:
> >>>
> >>> Hi Guys,
> >>>
> >>> I think it is time for us to have another meeting. Yahoo would be happy
> >>> to
> >>> host if this works for everybody. How about Wednesday, 2/9 4-6 pm.
> >>> Please,
> >>> let us know if you are planning to attend and if the date/time works
> for
> >>> you.
> >>>
> >>> Things that come to mind to discuss and as always feel free to suggest
> >>> others:
> >>>
> >>> -  Error handling proposal - this might be easier to finalize
> >>> face-to-face
> >>> -  Pig 0.9 plan
> >>> -  Pig Roadmap beyond 0.9
> >>> oWhat do we want to do in Pig.next?
> >>> oAre we ready for Pig 1.0
> >>>
> >>> Olga
> >>>
> >>>
> >>>
> >>>
> >>
> >
>

[jira] Commented: (PIG-1830) Type mismatch error in key from map, when doing GROUP on PigStorageSchema() variable

2011-01-28 Thread Olga Natkovich (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12988207#action_12988207
 ] 

Olga Natkovich commented on PIG-1830:
-

The problem is with PigStorageSchema implementation. The class extends 
PigStorage without overwriting getNext.
So, while the schema tells Pig that the data is coming as chararray, the data 
is actually created (by PigStorage)
as bytearray.

The owner of the PigStorageSchema function needs to make sure that the data and 
schema types match. 

> Type mismatch error in key from map, when doing GROUP on PigStorageSchema() 
> variable
> 
>
> Key: PIG-1830
> URL: https://issues.apache.org/jira/browse/PIG-1830
> Project: Pig
>  Issue Type: Bug
>Reporter: Mitesh Singh Jat
>
> Pig fails when we try to GROUP data loaded via PigStorageSchema.
> {code}
> Events = LOAD 'input/PigStorageSchema' USING 
> org.apache.pig.piggybank.storage.PigStorageSchema();
> Sessions = GROUP Events BY name;
> DUMP Sessions;
> {code}
> Schema file '''input/PigStorageSchema/.pig_schema'''
> {code}
> {"fields":[{"name":"name","type":55,"schema":null,"description":"autogenerated
>  from Pig Field 
> Schema"},{"name":"val","type":10,"schema":null,"description":"autogenerated 
> from Pig Field Schema"}],"version":0,"sortKeys":[],"sortKeyOrders":[]}
> {code}
> Header file '''input/PigStorageSchema/.pig_header'''
> {code}
> nameval
> {code}
> Sample input file '''input/PigStorageSchema/pss.in'''
> {code}
> peter   1
> samir   3
> michael 4
> peter   2
> peter   4
> samir   1
> {code}
> On running the above pig script, the following error is received.
> {code}
> 2010-12-15 08:07:58,367 WARN org.apache.hadoop.mapred.Child: Error running 
> child
> java.io.IOException: Type mismatch in key from map: expected 
> org.apache.pig.impl.io.NullableText, recieved
> org.apache.pig.impl.io.NullableBytesWritable
> at 
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:898)
> at 
> org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:600)
> at 
> org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:116)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:238)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53)
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:674)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:335)
> at org.apache.hadoop.mapred.Child$4.run(Child.java:242)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:396)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1062)
> at org.apache.hadoop.mapred.Child.main(Child.java:236)
> {code}
> On changing "type" of "name" from 55(chararray) to 50(bytearray), the
> GROUP-BY worked.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1829) "0" value seen in PigStat's map/reduce runtime, even when the job is successful

2011-01-28 Thread Olga Natkovich (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12988203#action_12988203
 ] 

Olga Natkovich commented on PIG-1829:
-

I agree that having standard API is useful. I don't think I like the idea of 
reading current structure because this will make it difficult to run against 
concurrent versions of Hadoop.

I think we need to do further investigation and find other alternatives or just 
say that this will resolve once we have a reasonable support form Hadoop

> "0" value seen in PigStat's map/reduce runtime, even when the job is 
> successful
> ---
>
> Key: PIG-1829
> URL: https://issues.apache.org/jira/browse/PIG-1829
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.8.0
>Reporter: Thejas M Nair
> Fix For: 0.9.0
>
>
> Pig runtime calls JobClient.getMapTaskReports(jobId) and 
> JobClient.getReduceTaskReports(jobId) to get statistics about numbers of 
> maps/reducers, as well as max/min/avg time of these tasks. But from time to 
> time, these calls return empty lists. When that happens pig is reports 0 
> values for the stats. 
> The jobtracker keeps the stats information only for a limited duration based 
> on the configuration parameters  mapred.jobtracker.completeuserjobs.maximum 
> and mapred.job.tracker.retiredjobs.cache.size. Since pig collects the stats 
> after jobs have finished running, it is possible that the stats for the 
> initial jobs are no longer available. To have better chances of getting the 
> stats, it should be collected as soon as the job is over. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1748) Add load/store function AvroStorage for avro data

2011-01-28 Thread Jakob Homan (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jakob Homan updated PIG-1748:
-

Attachment: PIG-1748-3.patch

Figured out the test failures.  Turns out that when one does a full run of the 
unit tests (which I cannot get to succeed on my machine), the ~/pigtest 
directory is left running during the contrib tests and within the contrib 
build.xml file is a {{junit.hadoop.conf}} variable pointing those tests to the 
hdfs the pig tests had running but is no longer up.  This conf trickles down to 
the test which ends up using it as the default filesystem and tries to connect 
to it, but can't since that HDFS is gone.  This doesn't occur when run through 
an idea like IntelliJ since the IDE doesn't use contrib's build.xml settings.  

I've fixed this by explicitly referencing the local file system in the tests, 
though this seems like a bug in the contrib build system to me.  I'll open a 
JIRA to address this.

@Felix - good catch.  To provide a cleaner separation between my work and 
Lin's, I would like to go ahead and fix this bug in a separate JIRA after 1748 
is committed.  How does this sound to you?

Contrib tests pass, except org.apache.pig.piggybank.test.TestPigStorageSchema, 
which fails for me with or without the patch.  Version 3 of the patch is 
updated to include better behavior in for directories with files that should be 
filtered out.

> Add load/store function AvroStorage for avro data
> -
>
> Key: PIG-1748
> URL: https://issues.apache.org/jira/browse/PIG-1748
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Reporter: lin guo
>Assignee: Jakob Homan
> Attachments: avro_storage.patch, avro_test_files.tar.gz, 
> PIG-1748-2.patch, PIG-1748-3.patch
>
>
> We want to use Pig to process arbitrary Avro data and store results as Avro 
> files. AvroStorage() extends two PigFuncs: LoadFunc and StoreFunc. 
> Due to discrepancies of Avro and Pig data models, AvroStorage has:
> 1. Limited support for "record": we do not support recursively defined record 
> because the number of fields in such records is data dependent.
> 2. Limited support for "union": we only accept nullable union like ["null", 
> "some-type"].
> For simplicity, we also make the following assumptions:
> If the input directory is a leaf directory, then we assume Avro data files in 
> it have the same schema;
> If the input directory contains sub-directories, then we assume Avro data 
> files in all sub-directories have the same schema.
> AvroStorage takes no input parameters when used as a LoadFunc (except for 
> "debug [debug-level]"). 
> Users can provide parameters to AvroStorage when used as a StoreFunc. If they 
> don't, Avro schema of output data is derived from its 
> Pig schema.
> Detailed documentation can be found in 
> http://snaprojects.jira.com/wiki/display/HTOOLS/AvroStorage+-+Pig+support+for+Avro+data

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1782) Add ability to load data by column family in HBaseStorage

2011-01-28 Thread Bill Graham (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12988192#action_12988192
 ] 

Bill Graham commented on PIG-1782:
--

I agree. Dmitriy, I like where you're going with new classes and deprecation, 
but maybe we could do this with just an enhanced (and backward compatible) 
HBaseStorage and a new AdvancedHBaseStorage.

* HBaseStorage
   * if you specific discrete columns, you get a tuple of values like the 
current behavior
   * if you specify one or more CFs (or possibly a CF with a wildcard column 
expression) you get back a tuple of maps
   * If you specify a mix, you get a tuple with values and maps. For example 
'cf2:foo c1: cf2:bar' would produce ( value, { col => value }, value }
   * This is backwards compatible and seems easiest to grok from a users 
perspective.

* AdvancedHBaseStorage
   * Somehow support mulitiple timestamps with a more complex data structure
   * One possibility is to use the data structure I suggested in my previous 
comment where everything is a map 
   * Another is to return something like the proposed HBaseStorage data 
structure, where each 'value' is replaced with ( (value, ts), ... )
   * We could hash out the specifics of AdvancedHBaseStorage in another JIRA if 
we decide to go this route

> Add ability to load data by column family in HBaseStorage
> -
>
> Key: PIG-1782
> URL: https://issues.apache.org/jira/browse/PIG-1782
> Project: Pig
>  Issue Type: New Feature
> Environment: Java 6, Mac OS X 10.6
>Reporter: Eric Yang
>Assignee: Bill Graham
>
> It would be nice to load all columns in the column family by using short hand 
> syntax like:
> {noformat}
> CpuMetrics = load 'hbase://SystemMetrics' USING 
> org.apache.pig.backend.hadoop.hbase.HBaseStorage('cpu:','-loadKey');
> {noformat}
> Assuming there are columns cpu: sys.0, cpu:sys.1, cpu:user.0, cpu:user.1,  in 
> cpu column family.
> CpuMetrics would contain something like:
> {noformat}
> (rowKey, cpu:sys.0, cpu:sys.1, cpu:user.0, cpu:user.1)
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1829) "0" value seen in PigStat's map/reduce runtime, even when the job is successful

2011-01-28 Thread Santhosh Srinivasan (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12988191#action_12988191
 ] 

Santhosh Srinivasan commented on PIG-1829:
--

Thats exactly what I state in the second half of my previous comment. We should 
work with the structure that is in place today and engage with the MapReduce 
team to standardize the access via APIs and not via directory structures. The 
directory structure has created problems in the past.

> "0" value seen in PigStat's map/reduce runtime, even when the job is 
> successful
> ---
>
> Key: PIG-1829
> URL: https://issues.apache.org/jira/browse/PIG-1829
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.8.0
>Reporter: Thejas M Nair
> Fix For: 0.9.0
>
>
> Pig runtime calls JobClient.getMapTaskReports(jobId) and 
> JobClient.getReduceTaskReports(jobId) to get statistics about numbers of 
> maps/reducers, as well as max/min/avg time of these tasks. But from time to 
> time, these calls return empty lists. When that happens pig is reports 0 
> values for the stats. 
> The jobtracker keeps the stats information only for a limited duration based 
> on the configuration parameters  mapred.jobtracker.completeuserjobs.maximum 
> and mapred.job.tracker.retiredjobs.cache.size. Since pig collects the stats 
> after jobs have finished running, it is possible that the stats for the 
> initial jobs are no longer available. To have better chances of getting the 
> stats, it should be collected as soon as the job is over. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1830) Type mismatch error in key from map, when doing GROUP on PigStorageSchema() variable

2011-01-28 Thread Christopher Egner (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12988190#action_12988190
 ] 

Christopher Egner commented on PIG-1830:


The mapping from numeric values to data types can be found in 
[DataType.java|http://svn.apache.org/viewvc/pig/trunk/src/org/apache/pig/data/DataType.java?view=markup]

Excerpt:
{code}
public static final byte UNKNOWN   =   0;
public static final byte NULL  =   1;
public static final byte BOOLEAN   =   5; // internal use only
public static final byte BYTE  =   6; // internal use only
public static final byte INTEGER   =  10;
public static final byte LONG  =  15;
public static final byte FLOAT =  20;
public static final byte DOUBLE=  25;
public static final byte BYTEARRAY =  50;
public static final byte CHARARRAY =  55;
/**
 * Internal use only.
 */
public static final byte BIGCHARARRAY =  60; //internal use only; for 
storing/loading chararray bigger than 64K characters in BinStorage
public static final byte MAP   = 100;
public static final byte TUPLE = 110;
public static final byte BAG   = 120;
{code}


> Type mismatch error in key from map, when doing GROUP on PigStorageSchema() 
> variable
> 
>
> Key: PIG-1830
> URL: https://issues.apache.org/jira/browse/PIG-1830
> Project: Pig
>  Issue Type: Bug
>Reporter: Mitesh Singh Jat
>
> Pig fails when we try to GROUP data loaded via PigStorageSchema.
> {code}
> Events = LOAD 'input/PigStorageSchema' USING 
> org.apache.pig.piggybank.storage.PigStorageSchema();
> Sessions = GROUP Events BY name;
> DUMP Sessions;
> {code}
> Schema file '''input/PigStorageSchema/.pig_schema'''
> {code}
> {"fields":[{"name":"name","type":55,"schema":null,"description":"autogenerated
>  from Pig Field 
> Schema"},{"name":"val","type":10,"schema":null,"description":"autogenerated 
> from Pig Field Schema"}],"version":0,"sortKeys":[],"sortKeyOrders":[]}
> {code}
> Header file '''input/PigStorageSchema/.pig_header'''
> {code}
> nameval
> {code}
> Sample input file '''input/PigStorageSchema/pss.in'''
> {code}
> peter   1
> samir   3
> michael 4
> peter   2
> peter   4
> samir   1
> {code}
> On running the above pig script, the following error is received.
> {code}
> 2010-12-15 08:07:58,367 WARN org.apache.hadoop.mapred.Child: Error running 
> child
> java.io.IOException: Type mismatch in key from map: expected 
> org.apache.pig.impl.io.NullableText, recieved
> org.apache.pig.impl.io.NullableBytesWritable
> at 
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:898)
> at 
> org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:600)
> at 
> org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:116)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:238)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53)
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:674)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:335)
> at org.apache.hadoop.mapred.Child$4.run(Child.java:242)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:396)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1062)
> at org.apache.hadoop.mapred.Child.main(Child.java:236)
> {code}
> On changing "type" of "name" from 55(chararray) to 50(bytearray), the
> GROUP-BY worked.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1829) "0" value seen in PigStat's map/reduce runtime, even when the job is successful

2011-01-28 Thread Olga Natkovich (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12988189#action_12988189
 ] 

Olga Natkovich commented on PIG-1829:
-

directory structure is not an official interface and in fact is has changed 
recently.

> "0" value seen in PigStat's map/reduce runtime, even when the job is 
> successful
> ---
>
> Key: PIG-1829
> URL: https://issues.apache.org/jira/browse/PIG-1829
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.8.0
>Reporter: Thejas M Nair
> Fix For: 0.9.0
>
>
> Pig runtime calls JobClient.getMapTaskReports(jobId) and 
> JobClient.getReduceTaskReports(jobId) to get statistics about numbers of 
> maps/reducers, as well as max/min/avg time of these tasks. But from time to 
> time, these calls return empty lists. When that happens pig is reports 0 
> values for the stats. 
> The jobtracker keeps the stats information only for a limited duration based 
> on the configuration parameters  mapred.jobtracker.completeuserjobs.maximum 
> and mapred.job.tracker.retiredjobs.cache.size. Since pig collects the stats 
> after jobs have finished running, it is possible that the stats for the 
> initial jobs are no longer available. To have better chances of getting the 
> stats, it should be collected as soon as the job is over. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1829) "0" value seen in PigStat's map/reduce runtime, even when the job is successful

2011-01-28 Thread Santhosh Srinivasan (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12988187#action_12988187
 ] 

Santhosh Srinivasan commented on PIG-1829:
--

There are smart ways to lookup the job tracker history based on the directory 
structure ensuring that you do not bring the system down. Simultaneously, we 
can request the MapReduce team to create a hierarchical structure to enable 
such queries.

> "0" value seen in PigStat's map/reduce runtime, even when the job is 
> successful
> ---
>
> Key: PIG-1829
> URL: https://issues.apache.org/jira/browse/PIG-1829
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.8.0
>Reporter: Thejas M Nair
> Fix For: 0.9.0
>
>
> Pig runtime calls JobClient.getMapTaskReports(jobId) and 
> JobClient.getReduceTaskReports(jobId) to get statistics about numbers of 
> maps/reducers, as well as max/min/avg time of these tasks. But from time to 
> time, these calls return empty lists. When that happens pig is reports 0 
> values for the stats. 
> The jobtracker keeps the stats information only for a limited duration based 
> on the configuration parameters  mapred.jobtracker.completeuserjobs.maximum 
> and mapred.job.tracker.retiredjobs.cache.size. Since pig collects the stats 
> after jobs have finished running, it is possible that the stats for the 
> initial jobs are no longer available. To have better chances of getting the 
> stats, it should be collected as soon as the job is over. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1828) HBaseStorage has problems with processing multiregion tables

2011-01-28 Thread Ashutosh Chauhan (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12988184#action_12988184
 ] 

Ashutosh Chauhan commented on PIG-1828:
---

Yes, setting it the way you had here. But not in Pig code but in loader. That 
way that change is only in HBaseStorage not in Pig and Pig's default behavior 
is modified.  All the loader methods are passed a job object. So, just set the 
key in that job object. Trick is in which of loader's method. Job confs in few 
of those methods are read-only. I need to check in which of loader's method it 
is appropriate to do so. 

> HBaseStorage has problems with processing multiregion tables
> 
>
> Key: PIG-1828
> URL: https://issues.apache.org/jira/browse/PIG-1828
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
> Environment: Hadoop 0.20.2, Hbase 0.20.6, Distributed mode
>Reporter: Lukas
>Assignee: Dmitriy V. Ryaboy
>
> As brought up in the pig user mailing list 
> (http://www.mail-archive.com/user%40pig.apache.org/msg00606.html) Pig does 
> sometime not scan the full HBase table.
> It seems that HBaseStorage has problems scanning large tables. It issues just 
> one mapper job instead of one mapper job per table region.
> Ian Stevens, who brought this issue up in the mailing list, attached a script 
> to reproduce the problem (https://gist.github.com/766929).
> However, in my case, the problem only occurred, after the table was split 
> into more than one regions.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1782) Add ability to load data by column family in HBaseStorage

2011-01-28 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12988167#action_12988167
 ] 

Dmitriy V. Ryaboy commented on PIG-1782:


That's certainly possible, I just don't think it's a good design from a 
usability standpoint :)

> Add ability to load data by column family in HBaseStorage
> -
>
> Key: PIG-1782
> URL: https://issues.apache.org/jira/browse/PIG-1782
> Project: Pig
>  Issue Type: New Feature
> Environment: Java 6, Mac OS X 10.6
>Reporter: Eric Yang
>Assignee: Bill Graham
>
> It would be nice to load all columns in the column family by using short hand 
> syntax like:
> {noformat}
> CpuMetrics = load 'hbase://SystemMetrics' USING 
> org.apache.pig.backend.hadoop.hbase.HBaseStorage('cpu:','-loadKey');
> {noformat}
> Assuming there are columns cpu: sys.0, cpu:sys.1, cpu:user.0, cpu:user.1,  in 
> cpu column family.
> CpuMetrics would contain something like:
> {noformat}
> (rowKey, cpu:sys.0, cpu:sys.1, cpu:user.0, cpu:user.1)
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1828) HBaseStorage has problems with processing multiregion tables

2011-01-28 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12988165#action_12988165
 ] 

Dmitriy V. Ryaboy commented on PIG-1828:


Ashutosh,
Wouldn't setting the key affect all loaders, not just HBase?
Are you saying that if I add something like 

job.getConfiguration().setBoolean("pig.splitCombination", false);

at the top of setLocation, the setting will be honored? I thought there was 
some copying going on under the covers that made this not work.. but I am very 
hazy on what exactly happens during initialization, you have been far deeper in 
there.

> HBaseStorage has problems with processing multiregion tables
> 
>
> Key: PIG-1828
> URL: https://issues.apache.org/jira/browse/PIG-1828
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
> Environment: Hadoop 0.20.2, Hbase 0.20.6, Distributed mode
>Reporter: Lukas
>
> As brought up in the pig user mailing list 
> (http://www.mail-archive.com/user%40pig.apache.org/msg00606.html) Pig does 
> sometime not scan the full HBase table.
> It seems that HBaseStorage has problems scanning large tables. It issues just 
> one mapper job instead of one mapper job per table region.
> Ian Stevens, who brought this issue up in the mailing list, attached a script 
> to reproduce the problem (https://gist.github.com/766929).
> However, in my case, the problem only occurred, after the table was split 
> into more than one regions.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (PIG-1828) HBaseStorage has problems with processing multiregion tables

2011-01-28 Thread Dmitriy V. Ryaboy (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy reassigned PIG-1828:
--

Assignee: Dmitriy V. Ryaboy

> HBaseStorage has problems with processing multiregion tables
> 
>
> Key: PIG-1828
> URL: https://issues.apache.org/jira/browse/PIG-1828
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
> Environment: Hadoop 0.20.2, Hbase 0.20.6, Distributed mode
>Reporter: Lukas
>Assignee: Dmitriy V. Ryaboy
>
> As brought up in the pig user mailing list 
> (http://www.mail-archive.com/user%40pig.apache.org/msg00606.html) Pig does 
> sometime not scan the full HBase table.
> It seems that HBaseStorage has problems scanning large tables. It issues just 
> one mapper job instead of one mapper job per table region.
> Ian Stevens, who brought this issue up in the mailing list, attached a script 
> to reproduce the problem (https://gist.github.com/766929).
> However, in my case, the problem only occurred, after the table was split 
> into more than one regions.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PIG-1832) Support timestamp in HBaseStorage

2011-01-28 Thread Eric Yang (JIRA)

Support timestamp in HBaseStorage
-

 Key: PIG-1832
 URL: https://issues.apache.org/jira/browse/PIG-1832
 Project: Pig
  Issue Type: Improvement
 Environment: Java 6, Mac OS X 10.6
Reporter: Eric Yang


When storing data into HBase using 
org.apache.pig.backend.hadoop.hbase.HBaseStorage, HBase timestamp field is 
stored with insertion time of the mapreduce job.  It would be nice to have a 
way to populate timestamp from user data.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1828) HBaseStorage has problems with processing multiregion tables

2011-01-28 Thread Ashutosh Chauhan (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12988159#action_12988159
 ] 

Ashutosh Chauhan commented on PIG-1828:
---

Thanks Lukas for checking. This indicates that TableSplits are rather not 
combinable. Thinking more about it, I think basic Pig's assumption that splits 
can be combined in general and only for special cases we won't combine (which 
Pig checks itself) is not correct. Question of combination should really be 
asked from Loader and not assumed. Also, this OLF thing is too complicated.  
Condition imposed by OLF is one possibility, but I assume there exists other 
scenarios where loader is not OLF but is still not combinable. I would propose 
to add a new method in LoadFunc and ask directly from loader and drop all the 
logic of determining whether splits are combinable or not.
{java}
// By default, splits generated by a loader is considered combinable to 
preserve current behavior
public boolean isCombinable() {
return true;
}
{java}

Good thing is LoadFunc is abstract class, so this won't break backward 
compatibility.

@Dmitiry,
As I pointed above adding OLF to HBaseStorage will not help. Though it won't 
hurt either. A quick fix for HBaseStorage loader for now is to set the key to 
false, somewhere early. I think setLocation() or setSchema() is one of the 
first methods called on LoadFunc and since checks for determining combination 
happen much later,  loader setting that key to false will be seen and 
combination won't happen. That will avoid the need of telling the users of 
HbaseStorage to set the key themselves. 


> HBaseStorage has problems with processing multiregion tables
> 
>
> Key: PIG-1828
> URL: https://issues.apache.org/jira/browse/PIG-1828
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
> Environment: Hadoop 0.20.2, Hbase 0.20.6, Distributed mode
>Reporter: Lukas
>
> As brought up in the pig user mailing list 
> (http://www.mail-archive.com/user%40pig.apache.org/msg00606.html) Pig does 
> sometime not scan the full HBase table.
> It seems that HBaseStorage has problems scanning large tables. It issues just 
> one mapper job instead of one mapper job per table region.
> Ian Stevens, who brought this issue up in the mailing list, attached a script 
> to reproduce the problem (https://gist.github.com/766929).
> However, in my case, the problem only occurred, after the table was split 
> into more than one regions.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1828) HBaseStorage has problems with processing multiregion tables

2011-01-28 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12988160#action_12988160
 ] 

Dmitriy V. Ryaboy commented on PIG-1828:


Ashutosh,
HBase stores records ordered by their keys, and splits the keyspace into 
regions as needed (unlike something like Cassandra, which by default uses hash 
partitioning and can be *made* to use total order partitions, total order is 
the *only* thing HBase does).

Indeed, implementing OLF didn't solve my problem as the splits were still 
combined. I don't know if TableSplits are stateful.

> HBaseStorage has problems with processing multiregion tables
> 
>
> Key: PIG-1828
> URL: https://issues.apache.org/jira/browse/PIG-1828
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
> Environment: Hadoop 0.20.2, Hbase 0.20.6, Distributed mode
>Reporter: Lukas
>
> As brought up in the pig user mailing list 
> (http://www.mail-archive.com/user%40pig.apache.org/msg00606.html) Pig does 
> sometime not scan the full HBase table.
> It seems that HBaseStorage has problems scanning large tables. It issues just 
> one mapper job instead of one mapper job per table region.
> Ian Stevens, who brought this issue up in the mailing list, attached a script 
> to reproduce the problem (https://gist.github.com/766929).
> However, in my case, the problem only occurred, after the table was split 
> into more than one regions.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1782) Add ability to load data by column family in HBaseStorage

2011-01-28 Thread Eric Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12988158#action_12988158
 ] 

Eric Yang commented on PIG-1782:


@Bill, agree.  I filed a seperated jira for supporting timestamp.
@Dmitriy, Would it be possible to add a parameter to switch between the return 
type?

Suggested flags:
- -returnMap (default)
- -returnTuple

Example for Map:

{noformat}
CpuMetrics = load 'hbase://SystemMetrics' USING 
org.apache.pig.backend.hadoop.hbase.HBaseStorage('cpu:','-loadKey');
{noformat}

Example for Tuple:

{noformat}
CpuMetrics = load 'hbase://SystemMetrics' USING 
org.apache.pig.backend.hadoop.hbase.HBaseStorage('cpu:','-loadKey 
-returnTuple');
{noformat}

> Add ability to load data by column family in HBaseStorage
> -
>
> Key: PIG-1782
> URL: https://issues.apache.org/jira/browse/PIG-1782
> Project: Pig
>  Issue Type: New Feature
> Environment: Java 6, Mac OS X 10.6
>Reporter: Eric Yang
>Assignee: Bill Graham
>
> It would be nice to load all columns in the column family by using short hand 
> syntax like:
> {noformat}
> CpuMetrics = load 'hbase://SystemMetrics' USING 
> org.apache.pig.backend.hadoop.hbase.HBaseStorage('cpu:','-loadKey');
> {noformat}
> Assuming there are columns cpu: sys.0, cpu:sys.1, cpu:user.0, cpu:user.1,  in 
> cpu column family.
> CpuMetrics would contain something like:
> {noformat}
> (rowKey, cpu:sys.0, cpu:sys.1, cpu:user.0, cpu:user.1)
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1830) Type mismatch error in key from map, when doing GROUP on PigStorageSchema() variable

2011-01-28 Thread David Ciemiewicz (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12988146#action_12988146
 ] 

David Ciemiewicz commented on PIG-1830:
---

I certainly hope the solution is not to require users to cast chararray to 
bytearray.

Also, why are the .pig_schema type values numeric ids and not name strings 
(e.g. chararray, int).

> Type mismatch error in key from map, when doing GROUP on PigStorageSchema() 
> variable
> 
>
> Key: PIG-1830
> URL: https://issues.apache.org/jira/browse/PIG-1830
> Project: Pig
>  Issue Type: Bug
>Reporter: Mitesh Singh Jat
>
> Pig fails when we try to GROUP data loaded via PigStorageSchema.
> {code}
> Events = LOAD 'input/PigStorageSchema' USING 
> org.apache.pig.piggybank.storage.PigStorageSchema();
> Sessions = GROUP Events BY name;
> DUMP Sessions;
> {code}
> Schema file '''input/PigStorageSchema/.pig_schema'''
> {code}
> {"fields":[{"name":"name","type":55,"schema":null,"description":"autogenerated
>  from Pig Field 
> Schema"},{"name":"val","type":10,"schema":null,"description":"autogenerated 
> from Pig Field Schema"}],"version":0,"sortKeys":[],"sortKeyOrders":[]}
> {code}
> Header file '''input/PigStorageSchema/.pig_header'''
> {code}
> nameval
> {code}
> Sample input file '''input/PigStorageSchema/pss.in'''
> {code}
> peter   1
> samir   3
> michael 4
> peter   2
> peter   4
> samir   1
> {code}
> On running the above pig script, the following error is received.
> {code}
> 2010-12-15 08:07:58,367 WARN org.apache.hadoop.mapred.Child: Error running 
> child
> java.io.IOException: Type mismatch in key from map: expected 
> org.apache.pig.impl.io.NullableText, recieved
> org.apache.pig.impl.io.NullableBytesWritable
> at 
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:898)
> at 
> org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:600)
> at 
> org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:116)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:238)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53)
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:674)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:335)
> at org.apache.hadoop.mapred.Child$4.run(Child.java:242)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:396)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1062)
> at org.apache.hadoop.mapred.Child.main(Child.java:236)
> {code}
> On changing "type" of "name" from 55(chararray) to 50(bytearray), the
> GROUP-BY worked.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: Pig developer meeting in February

2011-01-28 Thread Ashutosh Chauhan

> Are you saying that as long as one claims every column as a partition, all 
> filters will be pushed
> down?

Exactly. Though javadoc are heavily worded for partition pruning,
since that was the primary use case at that time for predicate
pushdown.  But you will get all the filter expressions if you claim
all the columns are partition columns. Partition columns have no
special semantics in Pig apart then this.

> Will the filters also be applied to the data the loader returns, even if the 
> loader accepts the
> expression?

I think filter will be deleted from logical plan if it is pushed up.
So, it wont be applied in pipeline later on. Daniel can confirm if
thats the case with new logical plan or not?

Ashutosh

On Thu, Jan 27, 2011 at 17:21, Julien Le Dem  wrote:
> Me too.
> Julien
>
>
> On 1/27/11 4:09 PM, "Dmitriy Ryaboy"  wrote:
>
> Ok yeah I'll come :).
>
>
>
> On Thu, Jan 27, 2011 at 3:17 PM, Olga Natkovich  wrote:
>
>> While there is a lively discussion on this thread, I have not actually
>> gotten any responses to having the meeting with exception of 1 person :).
>>
>> Please, let me know by the end of the week if you are planning to attend.
>> If we don't get at least a few more responses I suggest we postpone the
>> meeting.
>>
>> Thanks,
>>
>> Olga
>>
>> -Original Message-
>> From: Dmitriy Ryaboy [mailto:dvrya...@gmail.com]
>> Sent: Wednesday, January 26, 2011 6:04 PM
>> To: dev@pig.apache.org
>> Subject: Re: Pig developer meeting in February
>>
>> Right, we do partition filtering, but not true predicate pushdown.
>>
>> On Wed, Jan 26, 2011 at 5:59 PM, Daniel Dai 
>> wrote:
>>
>> > Are you talking about LoadMetadata.setPartitionFilter?
>> > PartitionFilterOptimizer will do that.
>> >
>> > Daniel
>> >
>> >
>> > Dmitriy Ryaboy wrote:
>> >
>> >> I may be wrong but I think predicate pushdown is designed for, but not
>> >> actually implemented in the current LoadPushdown interface (you can only
>> >> push projections). If I am wrong, that's great.. but if not, that would
>> be
>> >> an important feature to add, as people are trying to connect Pig to
>> >> "smart"
>> >> storage systems like rdbmses, HBase, and Cassandra more and more.  I
>> think
>> >> we only kind of simulate this with partition keys info, which is not
>> >> always
>> >> sufficient
>> >>
>> >> D
>> >>
>> >> On Wed, Jan 26, 2011 at 2:41 PM, Julien Le Dem 
>> >> wrote:
>> >>
>> >>
>> >>
>> >>> If making Pig Thread safe (i.e.: two threads running a different pig
>> >>> script) is important then we need to change some of the APIs from
>> static
>> >>> singleton access to a dependency injection pattern.
>> >>> In that case, this should probably be done before 1.0
>> >>> For example: UDFContext should be passed to the UDF after construction
>> >>> (similar to the SevrletContext in Servlet or the way Hadoop passes the
>> >>> context to tasks)
>> >>> Also a clearly separated API that does not depend on the Pig
>> >>> implementation
>> >>> would help.
>> >>> For example UDFContext is in org.apache.pig.impl.util when it would be
>> >>> better in org.apache.pig.api (Or at least an interface defining it)
>> >>>
>> >>> Julien
>> >>>
>> >>> On 1/24/11 10:14 AM, "Olga Natkovich"  wrote:
>> >>>
>> >>> Hi Guys,
>> >>>
>> >>> I think it is time for us to have another meeting. Yahoo would be happy
>> >>> to
>> >>> host if this works for everybody. How about Wednesday, 2/9 4-6 pm.
>> >>> Please,
>> >>> let us know if you are planning to attend and if the date/time works
>> for
>> >>> you.
>> >>>
>> >>> Things that come to mind to discuss and as always feel free to suggest
>> >>> others:
>> >>>
>> >>> -          Error handling proposal - this might be easier to finalize
>> >>> face-to-face
>> >>> -          Pig 0.9 plan
>> >>> -          Pig Roadmap beyond 0.9
>> >>> o        What do we want to do in Pig.next?
>> >>> o        Are we ready for Pig 1.0
>> >>>
>> >>> Olga
>> >>>
>> >>>
>> >>>
>> >>>
>> >>
>> >
>>
>
>

[jira] Commented: (PIG-1828) HBaseStorage has problems with processing multiregion tables

2011-01-28 Thread Lukas (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12988040#action_12988040
 ] 

Lukas commented on PIG-1828:


Hi there,
I set pig.splitCombination to false in the pig.properties and now the table is 
fully processed/the bug went away. Pig issued one map job for each region.

> HBaseStorage has problems with processing multiregion tables
> 
>
> Key: PIG-1828
> URL: https://issues.apache.org/jira/browse/PIG-1828
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
> Environment: Hadoop 0.20.2, Hbase 0.20.6, Distributed mode
>Reporter: Lukas
>
> As brought up in the pig user mailing list 
> (http://www.mail-archive.com/user%40pig.apache.org/msg00606.html) Pig does 
> sometime not scan the full HBase table.
> It seems that HBaseStorage has problems scanning large tables. It issues just 
> one mapper job instead of one mapper job per table region.
> Ian Stevens, who brought this issue up in the mailing list, attached a script 
> to reproduce the problem (https://gist.github.com/766929).
> However, in my case, the problem only occurred, after the table was split 
> into more than one regions.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1828) HBaseStorage has problems with processing multiregion tables

2011-01-28 Thread Ashutosh Chauhan (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12988025#action_12988025
 ] 

Ashutosh Chauhan commented on PIG-1828:
---

Oh, I just saw that you stated TableSplits are comparable. Can you explain a 
bit how are 2 TableSplits  compared? Do they define any property on keys? If 
TableSplit can faithfully implement OLF, then split combination may not be 
safe. Fix then is to stop combining  straight away when a loader implements OLF 
and not check further whether loader is used for merge join later on or not. 

> HBaseStorage has problems with processing multiregion tables
> 
>
> Key: PIG-1828
> URL: https://issues.apache.org/jira/browse/PIG-1828
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
> Environment: Hadoop 0.20.2, Hbase 0.20.6, Distributed mode
>Reporter: Lukas
>
> As brought up in the pig user mailing list 
> (http://www.mail-archive.com/user%40pig.apache.org/msg00606.html) Pig does 
> sometime not scan the full HBase table.
> It seems that HBaseStorage has problems scanning large tables. It issues just 
> one mapper job instead of one mapper job per table region.
> Ian Stevens, who brought this issue up in the mailing list, attached a script 
> to reproduce the problem (https://gist.github.com/766929).
> However, in my case, the problem only occurred, after the table was split 
> into more than one regions.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1828) HBaseStorage has problems with processing multiregion tables

2011-01-28 Thread Ashutosh Chauhan (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12988022#action_12988022
 ] 

Ashutosh Chauhan commented on PIG-1828:
---

I don't think we have sufficient evidence yet to point finger at split 
combination for this bug. Theoretically, combination of multiple TableSplits 
into one Split within Pig should not result in any problem, if you honor the 
semantics of InputFormat imposed by MR framework, which is each split is 
stateless in a sense it doesn't maintain any state. One TableSplit should know 
nothing about another one. I don't know enough about TableSplit, but I would 
assume they are indeed stateless. 

OrderedLoadFunc tries to impose this restriction by defining an order on 
Splits. It dictates that all keys in one split are smaller then another one. 
Thus, ideally Pig should *not* combine the loaders implementing it. But for 
reasons discussed in PIG-1518 it was eventually decided that for feature to be 
useful, Pig wouldn't  combine OrderedLoadFunc loaders *only* if loader is also 
used for MergeJoin or map-side cogroups in scripts. So, adding OLF won't turn 
off the combination in all cases. If you suspect combination is causing a bug 
(potentially because TableSplits are stateful w.r.t each other) then only 
setting the flag to false will ensure no-combination. But, I doubt that 
TableSplits have state and the split combination is causing the bug. Ian, Lukas 
can you confirm if setting pig.splitCombination to false results in bug going 
away?  


> HBaseStorage has problems with processing multiregion tables
> 
>
> Key: PIG-1828
> URL: https://issues.apache.org/jira/browse/PIG-1828
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
> Environment: Hadoop 0.20.2, Hbase 0.20.6, Distributed mode
>Reporter: Lukas
>
> As brought up in the pig user mailing list 
> (http://www.mail-archive.com/user%40pig.apache.org/msg00606.html) Pig does 
> sometime not scan the full HBase table.
> It seems that HBaseStorage has problems scanning large tables. It issues just 
> one mapper job instead of one mapper job per table region.
> Ian Stevens, who brought this issue up in the mailing list, attached a script 
> to reproduce the problem (https://gist.github.com/766929).
> However, in my case, the problem only occurred, after the table was split 
> into more than one regions.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1831) Variation in output while using streaming udfs in local mode

2011-01-28 Thread Vivek Padmanabhan (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vivek Padmanabhan updated PIG-1831:
---

Description: 
The below script when run in local mode gives me a different output. It looks 
like in local mode I have to store a relation obtained through streaming in 
order to use it afterwards.

 For example consider the below script : 

DEFINE MySTREAMUDF `test.sh`;
A  = LOAD 'myinput' USING PigStorage() AS (myId:chararray, data2, data3,data4 );
B = STREAM A THROUGH MySTREAMUDF AS (wId:chararray, num:int);
--STORE B into 'output.B';
C = JOIN B by wId LEFT OUTER, A by myId;
D = FOREACH C GENERATE B::wId,B::num,data4 ;
D = STREAM D THROUGH MySTREAMUDF AS (f1:chararray,f2:int);
--STORE D into 'output.D';
E = foreach B GENERATE wId,num;
F = DISTINCT E;
G = GROUP F ALL;
H = FOREACH G GENERATE COUNT_STAR(F) as TotalCount;
I = CROSS D,H;
STORE I  into 'output.I';


test.sh
-
#/bin/bash
cut -f1,3


And input is 
abcdlabel1  11  feature1
acbdlabel2  22  feature2
adbclabel3  33  feature3


Here if I store relation B and D then everytime i get the result  :
acbd3
abcd3
adbc3

But if i dont store relations B and D then I get an empty output.  Here again I 
have observed that this behaviour is random ie sometimes like 1out of 5 runs 
there will be output. 


  was:
The below script when run in local mode gives me a different output. It looks 
like in local mode I have to store a relation obtained through streaming in 
order to use it afterwards.

 For example consider the below script : 

DEFINE MySTREAMUDF `test.sh`;
A  = LOAD 'myinput' USING PigStorage() AS (myId:chararray, data2, data3,data4 );
B = STREAM A THROUGH MySTREAMUDF AS (wId:chararray, num:int);

--STORE B into 'output.B';

C = JOIN B by wId LEFT OUTER, A by myId;
D = FOREACH C GENERATE B::wId,B::num,data4 ;
D = STREAM D THROUGH MySTREAMUDF AS (f1:chararray,f2:int);

--STORE D into 'output.D';

E = foreach B GENERATE wId,num;
F = DISTINCT E;
G = GROUP F ALL;
H = FOREACH G GENERATE COUNT_STAR(F) as TotalCount;
I = CROSS D,H;
STORE I  into 'output.I';


#/bin/bash
cut -f1,3


And input is 
abcdlabel1  11  feature1
acbdlabel2  22  feature2
adbclabel3  33  feature3


Here if I store relation B and D then everytime i get the result  :
acbd3
abcd3
adbc3

But if i dont store relations B and D then I get an empty output.  





> Variation in output while using streaming udfs in local mode
> 
>
> Key: PIG-1831
> URL: https://issues.apache.org/jira/browse/PIG-1831
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Vivek Padmanabhan
>
> The below script when run in local mode gives me a different output. It looks 
> like in local mode I have to store a relation obtained through streaming in 
> order to use it afterwards.
>  For example consider the below script : 
> DEFINE MySTREAMUDF `test.sh`;
> A  = LOAD 'myinput' USING PigStorage() AS (myId:chararray, data2, data3,data4 
> );
> B = STREAM A THROUGH MySTREAMUDF AS (wId:chararray, num:int);
> --STORE B into 'output.B';
> C = JOIN B by wId LEFT OUTER, A by myId;
> D = FOREACH C GENERATE B::wId,B::num,data4 ;
> D = STREAM D THROUGH MySTREAMUDF AS (f1:chararray,f2:int);
> --STORE D into 'output.D';
> E = foreach B GENERATE wId,num;
> F = DISTINCT E;
> G = GROUP F ALL;
> H = FOREACH G GENERATE COUNT_STAR(F) as TotalCount;
> I = CROSS D,H;
> STORE I  into 'output.I';
> test.sh
> -
> #/bin/bash
> cut -f1,3
> And input is 
> abcdlabel1  11  feature1
> acbdlabel2  22  feature2
> adbclabel3  33  feature3
> Here if I store relation B and D then everytime i get the result  :
> acbd3
> abcd3
> adbc3
> But if i dont store relations B and D then I get an empty output.  Here again 
> I have observed that this behaviour is random ie sometimes like 1out of 5 
> runs there will be output. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1831) Variation in output while using streaming udfs in local mode

2011-01-28 Thread Vivek Padmanabhan (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vivek Padmanabhan updated PIG-1831:
---

Description: 
The below script when run in local mode gives me a different output. It looks 
like in local mode I have to store a relation obtained through streaming in 
order to use it afterwards.

 For example consider the below script : 

DEFINE MySTREAMUDF `test.sh`;
A  = LOAD 'myinput' USING PigStorage() AS (myId:chararray, data2, data3,data4 );
B = STREAM A THROUGH MySTREAMUDF AS (wId:chararray, num:int);

--STORE B into 'output.B';

C = JOIN B by wId LEFT OUTER, A by myId;
D = FOREACH C GENERATE B::wId,B::num,data4 ;
D = STREAM D THROUGH MySTREAMUDF AS (f1:chararray,f2:int);

--STORE D into 'output.D';

E = foreach B GENERATE wId,num;
F = DISTINCT E;
G = GROUP F ALL;
H = FOREACH G GENERATE COUNT_STAR(F) as TotalCount;
I = CROSS D,H;
STORE I  into 'output.I';


#/bin/bash
cut -f1,3


And input is 
abcdlabel1  11  feature1
acbdlabel2  22  feature2
adbclabel3  33  feature3


Here if I store relation B and D then everytime i get the result  :
acbd3
abcd3
adbc3

But if i dont store relations B and D then I get an empty output.  




  was:
The below script when run in local mode gives me a different output. It looks 
like in local mode I have to store a relation obtained through streaming in 
order to use it afterwards.

 For example consider the below script : 
{code:lang=scala|title=} 
DEFINE MySTREAMUDF `test.sh`;

A  = LOAD 'myinput' USING PigStorage() AS (myId:chararray, data2, data3,data4 );
B = STREAM A THROUGH MySTREAMUDF AS (wId:chararray, num:int);

--STORE B into 'output.B';

C = JOIN B by wId LEFT OUTER, A by myId;
D = FOREACH C GENERATE B::wId,B::num,data4 ;
D = STREAM D THROUGH MySTREAMUDF AS (f1:chararray,f2:int);

--STORE D into 'output.D';

E = foreach B GENERATE wId,num;
F = DISTINCT E;
G = GROUP F ALL;
H = FOREACH G GENERATE COUNT_STAR(F) as TotalCount;
I = CROSS D,H;
STORE I  into 'output.I';
{code}

{code:lang=scala|title=test.sh}
#/bin/bash
cut -f1,3
{code}

And input is 
>abcdlabel1  11  feature1
>acbdlabel2  22  feature2
>adbclabel3  33  feature3


Here if I store relation B and D then everytime i get the result  :
acbd3
abcd3
adbc3

But if i dont store relations B and D then I get an empty output.  





> Variation in output while using streaming udfs in local mode
> 
>
> Key: PIG-1831
> URL: https://issues.apache.org/jira/browse/PIG-1831
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Vivek Padmanabhan
>
> The below script when run in local mode gives me a different output. It looks 
> like in local mode I have to store a relation obtained through streaming in 
> order to use it afterwards.
>  For example consider the below script : 
> DEFINE MySTREAMUDF `test.sh`;
> A  = LOAD 'myinput' USING PigStorage() AS (myId:chararray, data2, data3,data4 
> );
> B = STREAM A THROUGH MySTREAMUDF AS (wId:chararray, num:int);
> --STORE B into 'output.B';
> C = JOIN B by wId LEFT OUTER, A by myId;
> D = FOREACH C GENERATE B::wId,B::num,data4 ;
> D = STREAM D THROUGH MySTREAMUDF AS (f1:chararray,f2:int);
> --STORE D into 'output.D';
> E = foreach B GENERATE wId,num;
> F = DISTINCT E;
> G = GROUP F ALL;
> H = FOREACH G GENERATE COUNT_STAR(F) as TotalCount;
> I = CROSS D,H;
> STORE I  into 'output.I';
> #/bin/bash
> cut -f1,3
> And input is 
> abcdlabel1  11  feature1
> acbdlabel2  22  feature2
> adbclabel3  33  feature3
> Here if I store relation B and D then everytime i get the result  :
> acbd3
> abcd3
> adbc3
> But if i dont store relations B and D then I get an empty output.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PIG-1831) Variation in output while using streaming udfs in local mode

2011-01-28 Thread Vivek Padmanabhan (JIRA)

Variation in output while using streaming udfs in local mode


 Key: PIG-1831
 URL: https://issues.apache.org/jira/browse/PIG-1831
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Vivek Padmanabhan


The below script when run in local mode gives me a different output. It looks 
like in local mode I have to store a relation obtained through streaming in 
order to use it afterwards.

 For example consider the below script : 
{code:lang=scala|title=} 
DEFINE MySTREAMUDF `test.sh`;

A  = LOAD 'myinput' USING PigStorage() AS (myId:chararray, data2, data3,data4 );
B = STREAM A THROUGH MySTREAMUDF AS (wId:chararray, num:int);

--STORE B into 'output.B';

C = JOIN B by wId LEFT OUTER, A by myId;
D = FOREACH C GENERATE B::wId,B::num,data4 ;
D = STREAM D THROUGH MySTREAMUDF AS (f1:chararray,f2:int);

--STORE D into 'output.D';

E = foreach B GENERATE wId,num;
F = DISTINCT E;
G = GROUP F ALL;
H = FOREACH G GENERATE COUNT_STAR(F) as TotalCount;
I = CROSS D,H;
STORE I  into 'output.I';
{code}

{code:lang=scala|title=test.sh}
#/bin/bash
cut -f1,3
{code}

And input is 
>abcdlabel1  11  feature1
>acbdlabel2  22  feature2
>adbclabel3  33  feature3


Here if I store relation B and D then everytime i get the result  :
acbd3
abcd3
adbc3

But if i dont store relations B and D then I get an empty output.  




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1782) Add ability to load data by column family in HBaseStorage

2011-01-28 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12987991#action_12987991
 ] 

Dmitriy V. Ryaboy commented on PIG-1782:


Bill, I think what you are suggesting is the "correct" way but I'd prefer not 
to break people's existing scripts which is what would happen if we changed 
what we return when a schema like 'cf2:foo cf2:bar' is specified in your 
proposal...

There are also usability benefits to having the flat return schema you get from 
HBaseStorage now -- it looks exactly like loading from PigStorage, so no 
surprises. You ask for 2 columns, and get 2 values in a tuple, it's sort of 
what you'd expect.

Perhaps we take your suggestion, put that into builtins.AdvancedHBaseStorage, 
deprecate the current HBaseStorage, and move the current code to 
builtins.SimpleHBaseStorage ?

> Add ability to load data by column family in HBaseStorage
> -
>
> Key: PIG-1782
> URL: https://issues.apache.org/jira/browse/PIG-1782
> Project: Pig
>  Issue Type: New Feature
> Environment: Java 6, Mac OS X 10.6
>Reporter: Eric Yang
>Assignee: Bill Graham
>
> It would be nice to load all columns in the column family by using short hand 
> syntax like:
> {noformat}
> CpuMetrics = load 'hbase://SystemMetrics' USING 
> org.apache.pig.backend.hadoop.hbase.HBaseStorage('cpu:','-loadKey');
> {noformat}
> Assuming there are columns cpu: sys.0, cpu:sys.1, cpu:user.0, cpu:user.1,  in 
> cpu column family.
> CpuMetrics would contain something like:
> {noformat}
> (rowKey, cpu:sys.0, cpu:sys.1, cpu:user.0, cpu:user.1)
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1782) Add ability to load data by column family in HBaseStorage

2011-01-28 Thread Bill Graham (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12987981#action_12987981
 ] 

Bill Graham commented on PIG-1782:
--

I was also thinking about a map, but I thought we might want to preserve the 
ordering of the fields specified when explicit fields are requested, as well as 
CFs, like Dmitriy's example. We'd get the CF fields in the natural ordering 
that Hbase stores them in too. The more I think about it though, I don't think 
this is that useful and I think a map approach seems the way to go. 

@Eric: Yes pig doesn't have any ts control upon writes currently (and that 
should be improved), but that shouldn't rule out the ability to read them. I 
can see many use cases where some non-Pig process is populating HBase, but Pig 
is used for queries.

@Dmitriy: I prototyped that exact use case using tuples of tuples, but ran into 
the downsides you point out. Also each row read has a variable length of 
tuples, which would seem really difficult to work with. 

I like this approach when reading all columns in a family:

{code}
( rowKey, { col1 => ((val1, ts), ..), col2 => ((val2, ts), ..) } ) 
{code}

For Dymitriy's use case, having the same schema returned (alwaya a map) 
regardless of how the column families are specified (i.e., 'cf1: cf2:foo' vs 
'cf1:' vs 'cf2:foo cf2:bar') is one option. Another is to return a map for CFs 
and a ((val1, ts), ..) for explicit columns. I'm not sure which approach would 
make life easier on the script writer.


> Add ability to load data by column family in HBaseStorage
> -
>
> Key: PIG-1782
> URL: https://issues.apache.org/jira/browse/PIG-1782
> Project: Pig
>  Issue Type: New Feature
> Environment: Java 6, Mac OS X 10.6
>Reporter: Eric Yang
>Assignee: Bill Graham
>
> It would be nice to load all columns in the column family by using short hand 
> syntax like:
> {noformat}
> CpuMetrics = load 'hbase://SystemMetrics' USING 
> org.apache.pig.backend.hadoop.hbase.HBaseStorage('cpu:','-loadKey');
> {noformat}
> Assuming there are columns cpu: sys.0, cpu:sys.1, cpu:user.0, cpu:user.1,  in 
> cpu column family.
> CpuMetrics would contain something like:
> {noformat}
> (rowKey, cpu:sys.0, cpu:sys.1, cpu:user.0, cpu:user.1)
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1828) HBaseStorage has problems with processing multiregion tables

2011-01-28 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12987979#action_12987979
 ] 

Dmitriy V. Ryaboy commented on PIG-1828:


Found the issue!

Turns out HBaseStorage is doing the right thing and returning the correct set 
of splits; but PIG-1518 is merging the splits back into a single split! No 
wonder I wasn't seeing it, i was running with combinations turned off.

Short term fix: set pig.splitCombination to false.

Long term fix: I added OrderedLoadFunc implementation to the loader, so that 
PIG-1518 doesn't apply. I think this is correct, since TableSplits are in fact 
comparable, but I am not sure what exact consequences implementing this 
interface will have with regards to merge joins and such.  Ashutosh, can you 
comment?

For the folks using the EB version -- you are not affected, since this is only 
a 0.8 problem.

> HBaseStorage has problems with processing multiregion tables
> 
>
> Key: PIG-1828
> URL: https://issues.apache.org/jira/browse/PIG-1828
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
> Environment: Hadoop 0.20.2, Hbase 0.20.6, Distributed mode
>Reporter: Lukas
>
> As brought up in the pig user mailing list 
> (http://www.mail-archive.com/user%40pig.apache.org/msg00606.html) Pig does 
> sometime not scan the full HBase table.
> It seems that HBaseStorage has problems scanning large tables. It issues just 
> one mapper job instead of one mapper job per table region.
> Ian Stevens, who brought this issue up in the mailing list, attached a script 
> to reproduce the problem (https://gist.github.com/766929).
> However, in my case, the problem only occurred, after the table was split 
> into more than one regions.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

45 matches

Mail list logo