[jira] [Commented] (PIG-2103) Support for Squid logs parsing/loading (Loader UDF)

2011-05-31 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13041976#comment-13041976
 ] 

Ashutosh Chauhan commented on PIG-2103:
---

Did you accidentally forgot to attach the patch or did you accidentally changed 
the status to Patch Available?

> Support for Squid logs parsing/loading (Loader UDF)
> ---
>
> Key: PIG-2103
> URL: https://issues.apache.org/jira/browse/PIG-2103
> Project: Pig
>  Issue Type: New Feature
>Reporter: Julian Gutierrez Oschmann
>Priority: Minor
>
> As proposed in the development list, LoadFunc UDF for parsing/loading of 
> Squid logs (default, common, squidmime and combined log levels)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-2103) Support for Squid logs parsing/loading (Loader UDF)

2011-05-31 Thread Julian Gutierrez Oschmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julian Gutierrez Oschmann updated PIG-2103:
---

Status: Patch Available  (was: Open)

> Support for Squid logs parsing/loading (Loader UDF)
> ---
>
> Key: PIG-2103
> URL: https://issues.apache.org/jira/browse/PIG-2103
> Project: Pig
>  Issue Type: New Feature
>Reporter: Julian Gutierrez Oschmann
>Priority: Minor
>
> As proposed in the development list, LoadFunc UDF for parsing/loading of 
> Squid logs (default, common, squidmime and combined log levels)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-2103) Support for Squid logs parsing/loading (Loader UDF)

2011-05-31 Thread Julian Gutierrez Oschmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julian Gutierrez Oschmann updated PIG-2103:
---

Summary: Support for Squid logs parsing/loading (Loader UDF)  (was: support 
for Squid logs parsing/loading (Loader UDF))

> Support for Squid logs parsing/loading (Loader UDF)
> ---
>
> Key: PIG-2103
> URL: https://issues.apache.org/jira/browse/PIG-2103
> Project: Pig
>  Issue Type: New Feature
>Reporter: Julian Gutierrez Oschmann
>Priority: Minor
>
> As proposed in the development list, LoadFunc UDF for parsing/loading of 
> Squid logs (default, common, squidmime and combined log levels)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (PIG-2103) support for Squid logs parsing/loading (Loader UDF)

2011-05-31 Thread Julian Gutierrez Oschmann (JIRA)
support for Squid logs parsing/loading (Loader UDF)
---

 Key: PIG-2103
 URL: https://issues.apache.org/jira/browse/PIG-2103
 Project: Pig
  Issue Type: New Feature
Reporter: Julian Gutierrez Oschmann
Priority: Minor


As proposed in the development list, LoadFunc UDF for parsing/loading of Squid 
logs (default, common, squidmime and combined log levels)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-1890) Fix piggybank unit test TestAvroStorage

2011-05-31 Thread Ken Goodhope (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13041915#comment-13041915
 ] 

Ken Goodhope commented on PIG-1890:
---

I need some clarification on the contract for POProject.getNext(Tuple).  Right 
now, if it receives a tuple with a single element, it extracts that element and 
attempts to cast it as a tuple and return it.  This breaks with any single 
element tuple that where the single element is not a tuple.  The code could be 
modified to not extract non-tuple elements.

> Fix piggybank unit test TestAvroStorage
> ---
>
> Key: PIG-1890
> URL: https://issues.apache.org/jira/browse/PIG-1890
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.9.0
>Reporter: Daniel Dai
>Assignee: Jakob Homan
> Fix For: 0.9.0
>
> Attachments: PIG-1890-1.patch
>
>
> TestAvroStorage fail on trunk. There are two reasons:
> 1. After PIG-1680, we call LoadFunc.setLocation one more time.
> 2. The schema for AvroStorage seems to be wrong. For example, in first test 
> case testArrayDefault, the schema for "in" is set to "PIG_WRAPPER: (FIELD: 
> {PIG_WRAPPER: (ARRAY_ELEM: float)})". It seems PIG_WRAPPER is redundant. This 
> issue is hidden until PIG-1188 checked in.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-1904) Default split destination

2011-05-31 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1904:


Fix Version/s: 0.10

> Default split destination
> -
>
> Key: PIG-1904
> URL: https://issues.apache.org/jira/browse/PIG-1904
> Project: Pig
>  Issue Type: New Feature
>Reporter: Daniel Dai
>  Labels: gsoc2011
> Fix For: 0.10
>
>
> "split" statement is better to have a default destination, eg:
> {code}
> SPLIT A INTO X IF f1<7, Y IF f2==5, Z IF (f3<6 OR f3>6), OTHER otherwise; -- 
> OTHERS has all tuples with f1>=7 && f2!=5 && f3==6
> {code}
> This is a candidate project for Google summer of code 2011. More information 
> about the program can be found at http://wiki.apache.org/pig/GSoc2011

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-2090) re-enable TestGrunt test cases

2011-05-31 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-2090:


Fix Version/s: 0.10

> re-enable TestGrunt test cases
> --
>
> Key: PIG-2090
> URL: https://issues.apache.org/jira/browse/PIG-2090
> Project: Pig
>  Issue Type: Task
>Affects Versions: 0.8.0, 0.9.0
>Reporter: Thejas M Nair
>Assignee: Thejas M Nair
> Fix For: 0.10
>
>
> Some test cases in TestGrunt.java were commented out in PIG-928. But it seems 
> to have been done by mistake. I re-enabled a few of the working ones as part 
> of changes in PIG-2084. The rest of them should be fixed, or if what they 
> test is no longer valid they should be removed from the test file . 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-2090) re-enable TestGrunt test cases

2011-05-31 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-2090:


Fix Version/s: (was: 0.9.0)

> re-enable TestGrunt test cases
> --
>
> Key: PIG-2090
> URL: https://issues.apache.org/jira/browse/PIG-2090
> Project: Pig
>  Issue Type: Task
>Affects Versions: 0.8.0, 0.9.0
>Reporter: Thejas M Nair
>Assignee: Thejas M Nair
> Fix For: 0.10
>
>
> Some test cases in TestGrunt.java were commented out in PIG-928. But it seems 
> to have been done by mistake. I re-enabled a few of the working ones as part 
> of changes in PIG-2084. The rest of them should be fixed, or if what they 
> test is no longer valid they should be removed from the test file . 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Review Request: PIG-1702. Fix for task output logs for streaming jobs containing null input-split information.

2011-05-31 Thread Adam Warrington

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/547/#review383
---



trunk/src/org/apache/pig/backend/hadoop/streaming/HadoopExecutableManager.java


Referencing PigMapReduce.sJobContext may cause a race condition in local 
Pig jobs, similar to what is described in PIG-1831. Should a similar fix be 
applied where the context in PigMapReduce is in thread local storage?


- Adam


On 2011-05-19 16:27:22, Adam Warrington wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/547/
> ---
> 
> (Updated 2011-05-19 16:27:22)
> 
> 
> Review request for pig.
> 
> 
> Summary
> ---
> 
> This is a patch for PIG-1702, which describes an issue where the task output 
> logs for PIG streaming jobs contains null input-split information. The 
> ability to query the input-split information through the JobConf went away 
> with the new MR API. We must now gain a reference to the underlying 
> FiletSplit, and query this reference for that information.
> 
> 
> Diffs
> -
> 
>   
> trunk/src/org/apache/pig/backend/hadoop/streaming/HadoopExecutableManager.java
>  1088692 
> 
> Diff: https://reviews.apache.org/r/547/diff
> 
> 
> Testing
> ---
> 
> To test this, I wrote a very simple python script to pass data through using 
> PIG. After checking the task logs of the completed task, the stderr logs now 
> contain valid input split information. Below are the scripts and test data 
> used.
> 
> ### PIG commands run ###
> DEFINE testpy `test.py` SHIP ('test.py');
> raw_records = LOAD '/test.txt2'; 
> T1 = STREAM raw_records THROUGH testpy;
> dump T1;
> 
> ### test.py ###
> #!/usr/bin/python
> import sys
> 
> cnt = 0
> for line in sys.stdin:
> print line.strip() + " " + str(cnt)
> cnt += 1
> 
> ### contents of /test.txt on hdfs ###
> one line
> two line
> three line
> four line
> 
> 
> Thanks,
> 
> Adam
> 
>



[jira] [Created] (PIG-2102) MonitoredUDF does not work

2011-05-31 Thread Alan Gates (JIRA)
MonitoredUDF does not work
--

 Key: PIG-2102
 URL: https://issues.apache.org/jira/browse/PIG-2102
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.8.1, 0.8.0, 0.9.0
Reporter: Alan Gates


The MonitoredUDF feature doesn't work.  When a UDF is annotated with it, job 
setup fails with an internal error.  The stack is long, but the salient line 
appears to be:

{code}
Caused by: java.io.IOException: Serialization error: 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.util.MonitoredUDFExecutor
{code}

I think making this class implement Serializable would solve the issue.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-2100) 'explain -script' does not perform parameter substitution for parameters specified on commandline

2011-05-31 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-2100:


Fix Version/s: (was: 0.9.0)

> 'explain -script' does not perform parameter substitution for parameters 
> specified on commandline
> -
>
> Key: PIG-2100
> URL: https://issues.apache.org/jira/browse/PIG-2100
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0, 0.8.1, 0.9.0
>Reporter: Thejas M Nair
>
> {code}
> # the file
> $  cat t.pig
> a = load '$file' as (a0, a1);
> dump a;
> # parameter on commandline gets substituted 
> $ java -Xmx500m  -classpath pig.jar org.apache.pig.Main -x local  -dryrun -p 
> file=x t.pig
> 2011-05-31 14:00:24,999 [main] INFO  org.apache.pig.Main - Logging error 
> messages to: /Users/tejas/pig_lpgen_2083/trunk/pig_1306875624997.log
> 2011-05-31 14:00:25,321 [main] INFO  org.apache.pig.Main - Dry run completed. 
> Substituted pig script is at t.pig.substituted
> $ cat t.pig.substituted 
> a = load 'x' as (a0, a1);
> dump a;
> # but param in commandline does not get used for explain command, and it fails
> java -Xmx500m  -classpath pig.jar org.apache.pig.Main -x local-p file=x  
> -e 'explain -script t.pig;'
> 2011-05-31 14:01:07,217 [main] INFO  org.apache.pig.Main - Logging error 
> messages to: /Users/tejas/pig_lpgen_2083/trunk/pig_1306875667215.log
> 2011-05-31 14:01:07,364 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting 
> to hadoop file system at: file:///
> 2011-05-31 14:01:07,547 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
> 2999: Unexpected internal error. Undefined parameter : file
> # parameter gets substituted when specified using %declare statement.
> cat t2.pig
> %declare file x
> a = load '$file' as (a0, a1);
> dump a;
> java -Xmx500m  -classpath pig.jar org.apache.pig.Main -x local-p file=x  
> -e 'explain -script t2.pig;'
> ..
> 2011-05-31 14:01:44,059 [main] WARN  org.apache.pig.tools.grunt.GruntParser - 
> 'dump' statement is ignored while processing 'explain -script' or '-check'
> Logical plan is empty.
> Physical plan is empty.
> Execution plan is empty.
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (PIG-2101) Registering a Python function in a directory other than the current working directory fails

2011-05-31 Thread Alan Gates (JIRA)
Registering a Python function in a directory other than the current working 
directory fails
---

 Key: PIG-2101
 URL: https://issues.apache.org/jira/browse/PIG-2101
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.8.1
Reporter: Alan Gates


In MapReduce mode, if the register command references a directory other than 
the current one, executing the Python UDF on the backend fails with:   
Deserialization error: could not instantiate 
'org.apache.pig.scripting.jython.JythonFunction' with arguments 
'[../udfs/python/production.py, production]'

I assume it is using the path on the backend to try to locate the UDF.

The script is:

{code}
register '../udfs/python/production.py' using jython as bballudfs;
players  = load 'baseball' as (name:chararray, team:chararray,
pos:bag{t:(p:chararray)}, bat:map[]);
nonnull  = filter players by bat#'slugging_percentage' is not null and
bat#'on_base_percentage' is not null;
calcprod = foreach nonnull generate name, bballudfs.production(
(float)bat#'slugging_percentage',
(float)bat#'on_base_percentage');
dump calcprod;
{code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (PIG-2100) 'explain -script' does not perform parameter substitution for parameters specified on commandline

2011-05-31 Thread Thejas M Nair (JIRA)
'explain -script' does not perform parameter substitution for parameters 
specified on commandline
-

 Key: PIG-2100
 URL: https://issues.apache.org/jira/browse/PIG-2100
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.1, 0.8.0, 0.9.0
Reporter: Thejas M Nair
 Fix For: 0.9.0


{code}
# the file
$  cat t.pig
a = load '$file' as (a0, a1);
dump a;

# parameter on commandline gets substituted 
$ java -Xmx500m  -classpath pig.jar org.apache.pig.Main -x local  -dryrun -p 
file=x t.pig
2011-05-31 14:00:24,999 [main] INFO  org.apache.pig.Main - Logging error 
messages to: /Users/tejas/pig_lpgen_2083/trunk/pig_1306875624997.log
2011-05-31 14:00:25,321 [main] INFO  org.apache.pig.Main - Dry run completed. 
Substituted pig script is at t.pig.substituted

$ cat t.pig.substituted 
a = load 'x' as (a0, a1);
dump a;

# but param in commandline does not get used for explain command, and it fails

java -Xmx500m  -classpath pig.jar org.apache.pig.Main -x local-p file=x  -e 
'explain -script t.pig;'
2011-05-31 14:01:07,217 [main] INFO  org.apache.pig.Main - Logging error 
messages to: /Users/tejas/pig_lpgen_2083/trunk/pig_1306875667215.log
2011-05-31 14:01:07,364 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to 
hadoop file system at: file:///
2011-05-31 14:01:07,547 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
2999: Unexpected internal error. Undefined parameter : file

# parameter gets substituted when specified using %declare statement.
cat t2.pig
%declare file x
a = load '$file' as (a0, a1);
dump a;

java -Xmx500m  -classpath pig.jar org.apache.pig.Main -x local-p file=x  -e 
'explain -script t2.pig;'
..
2011-05-31 14:01:44,059 [main] WARN  org.apache.pig.tools.grunt.GruntParser - 
'dump' statement is ignored while processing 'explain -script' or '-check'
Logical plan is empty.
Physical plan is empty.
Execution plan is empty.

{code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: No of reducers

2011-05-31 Thread Thejas M Nair
In pig 0.8 the default number of reducers changed from 1 to a value computed 
based on input data size -
http://pig.apache.org/docs/r0.8.1/cookbook.html#Use+the+Parallel+Features

-Thejas


On 5/27/11 6:46 AM, "Jonathan Coveney"  wrote:

SET default_parallel X; will set the PARALLEL keyword for all parallel
functions (ie set the reducers for the job)

I am not sure how the default is calculated...for a while it was set to 1 I
believe, ostensibly to force people to set it to something more reasonable.

2011/5/27 Harsh J 

> The PARALLEL keyword controls the number of reducers used in the job.
> If unspecified, a default number is applied. Is this what you're
> looking for?
>
> On Fri, May 27, 2011 at 3:46 PM, Sudharsan Sampath
>  wrote:
> >
> > Hi,
> >
> > Is there a reference on how the number of reducers required for a job is
> calculated?
> >
> > Thanks
> > Sudharsan S
> >
> >
>
>
>
> --
> Harsh J
>



--



A question about types

2011-05-31 Thread Jonathan Coveney
Disclaimer: I'm still learning my way around the Pig and Hadoop internals,
so this question is aimed at better understanding that and some of the pig
design choices...

Is there a reason why in Pig we are restricted to a set of types (roughly
corresponding to types in java), instead of having an abstract type like in
Hadoop ie Writable or WritableComparable? I guess I got to thinking about
this when thinking about the Algebraic interface... in Hadoop if you want to
have some crazy intermediate objects, you can do that easily as long as they
are serializable (ie Writable, and WritableComparable if they are going to
the reducer in the shuffle). In fact, in Hadoop there is no notion of some
special class of objects which we work with -- everything is simply Writable
or WritableComparable. In Pig we are more limited, and I was just thinking
about why that needs to be the case. Is there any reason why we can't have
abstract types at the same level as String or Integer? My guess would be it
has to do with how these objects are treated internally, but beyond that am
not sure.

Thanks for helping me think about this
Jon