[jira] Subscription: PIG patch available

2014-05-02 Thread jira
Issue Subscription
Filter: PIG patch available (13 issues)

Subscriber: pigdaily

Key Summary
PIG-3902PigServer creates cycle
https://issues.apache.org/jira/browse/PIG-3902
PIG-3901Organize the Pig properties file and document all properties
https://issues.apache.org/jira/browse/PIG-3901
PIG-3877Getting Geo Latitude/Longitude from Address Lines
https://issues.apache.org/jira/browse/PIG-3877
PIG-3874FileLocalizer temp path can sometimes be non-unique
https://issues.apache.org/jira/browse/PIG-3874
PIG-3873Geo distance calculation using Haversine
https://issues.apache.org/jira/browse/PIG-3873
PIG-3867Added hadoop home to build classpath for build pig with unit test 
on windows
https://issues.apache.org/jira/browse/PIG-3867
PIG-3866Create ThreadLocal classloader per PigContext
https://issues.apache.org/jira/browse/PIG-3866
PIG-3861duplicate jars get added to distributed cache
https://issues.apache.org/jira/browse/PIG-3861
PIG-3825Stats collection needs to be changed for hadoop2 (with auto local 
mode)
https://issues.apache.org/jira/browse/PIG-3825
PIG-3668COR built-in function when atleast one of the coefficient values is 
NaN
https://issues.apache.org/jira/browse/PIG-3668
PIG-3635Fix e2e tests for Hadoop 2.X on Windows
https://issues.apache.org/jira/browse/PIG-3635
PIG-3587add functionality for rolling over dates
https://issues.apache.org/jira/browse/PIG-3587
PIG-3441Allow Pig to use default resources from Configuration objects
https://issues.apache.org/jira/browse/PIG-3441

You may edit this subscription at:
https://issues.apache.org/jira/secure/FilterSubscription!default.jspa?subId=13225&filterId=12322384


[jira] [Resolved] (PIG-3880) After compiling trunk, I am seeing ClassLoaderObjectInputStream ClassNotFoundException.

2014-05-02 Thread David Medinets (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Medinets resolved PIG-3880.
-

Resolution: Fixed

I feel the consensus is saying;  Don't run old versions of hadoop. And I can 
live with that.

> After compiling trunk, I am seeing ClassLoaderObjectInputStream 
> ClassNotFoundException.
> ---
>
> Key: PIG-3880
> URL: https://issues.apache.org/jira/browse/PIG-3880
> Project: Pig
>  Issue Type: Bug
>  Components: grunt
>Affects Versions: 0.13.0
>Reporter: David Medinets
>
> I pulled trunk from subversion using the following commands:
> mkdir pig
> cd pig
> svn co http://svn.apache.org/repos/asf/pig/trunk
> cd trunk
> ant
> export PATH=$PATH:$HOME/pig/trunk/bin
> export ACCUMULO_HOME=/opt/accumulo
> export HADOOP_HOME=/opt/hadoop
> export PIG_HOME=$HOME/pig/trunk
> export PIG_CLASSPATH="$HOME/pig/trunk/build/ivy/lib/Pig/*"
> export PIG_CLASSPATH="$ACCUMULO_HOME/lib/*:$PIG_CLASSPATH"
> cd ~
> pig
> Then I ran into this error:
> java.lang.NoClassDefFoundError: 
> org/apache/commons/io/input/ClassLoaderObjectInputStream
>   at org.apache.pig.Main.run(Main.java:399)
> When I change PIG_JAR to use the fat jar, I was able to run the pig command 
> without getting the exception.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PIG-3373) XMLLoader returns non-matching nodes when a tag name spans through the block boundary

2014-05-02 Thread Cheolsoo Park (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13988142#comment-13988142
 ] 

Cheolsoo Park commented on PIG-3373:


Isn't this addressed as part of PIG-3865 (rewrite of XMLLoader)? If so, can we 
close this jira? 

> XMLLoader returns non-matching nodes when a tag name spans through the block 
> boundary
> -
>
> Key: PIG-3373
> URL: https://issues.apache.org/jira/browse/PIG-3373
> Project: Pig
>  Issue Type: Bug
>  Components: piggybank
>Affects Versions: site
>Reporter: Ahmed Eldawy
>Assignee: Ahmed Eldawy
>  Labels: patch
> Attachments: PIG3373.patch, PIG3373_1.patch, PIG3373_2.patch, 
> PIG3373_3.patch, bad-file.xml.bz2, test-file-2.xml.bz2
>
>
> When node start tag spans two blocks this tag is returned even if it is not 
> of the type.
> Example: For the following input file
> 
>   BLOCK BOUNDARY
> entually id="dfasd">
> XMLoader with tag type 'event' should return only the first one but it 
> actually returns both of them



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (PIG-3373) XMLLoader returns non-matching nodes when a tag name spans through the block boundary

2014-05-02 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-3373:


Status: Open  (was: Patch Available)

Sorry, but the patch no longer applies and I couldn't figure out how apply it 
manually.

> XMLLoader returns non-matching nodes when a tag name spans through the block 
> boundary
> -
>
> Key: PIG-3373
> URL: https://issues.apache.org/jira/browse/PIG-3373
> Project: Pig
>  Issue Type: Bug
>  Components: piggybank
>Affects Versions: site
>Reporter: Ahmed Eldawy
>Assignee: Ahmed Eldawy
>  Labels: patch
> Attachments: PIG3373.patch, PIG3373_1.patch, PIG3373_2.patch, 
> PIG3373_3.patch, bad-file.xml.bz2, test-file-2.xml.bz2
>
>
> When node start tag spans two blocks this tag is returned even if it is not 
> of the type.
> Example: For the following input file
> 
>   BLOCK BOUNDARY
> entually id="dfasd">
> XMLoader with tag type 'event' should return only the first one but it 
> actually returns both of them



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (PIG-3916) isEmpty should not be early terminating

2014-05-02 Thread Rohini Palaniswamy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohini Palaniswamy updated PIG-3916:


  Resolution: Fixed
Hadoop Flags: Reviewed
  Status: Resolved  (was: Patch Available)

Committed to branch-0.12 and trunk. Thanks Daniel for the review.

> isEmpty should not be early terminating
> ---
>
> Key: PIG-3916
> URL: https://issues.apache.org/jira/browse/PIG-3916
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.11
>Reporter: Rohini Palaniswamy
>Assignee: Rohini Palaniswamy
>Priority: Critical
> Fix For: 0.13.0, 0.12.2
>
> Attachments: PIG-3916-1.patch
>
>
> PIG-2066 makes isEmpty early terminating which is very wrong. When there is a 
> binary condition with isEmpty() followed by something like SUM, it skips 
> records leading to wrong results. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (PIG-3918) Implement PPNL for Tez mode (Pig side changes)

2014-05-02 Thread Cheolsoo Park (JIRA)
Cheolsoo Park created PIG-3918:
--

 Summary: Implement PPNL for Tez mode (Pig side changes)
 Key: PIG-3918
 URL: https://issues.apache.org/jira/browse/PIG-3918
 Project: Pig
  Issue Type: Sub-task
  Components: tez
Affects Versions: tez-branch
Reporter: Cheolsoo Park
 Fix For: tez-branch


Currently, no event notifications are sent from Pig to PPNL in Tez mode. For 
example,
{code}
emitInitialPlanNotification
emitLaunchStartedNotification
emitJobsSubmittedNotification
etc.
{code}

In addition, Pig should expose progress status information to PPNL.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PIG-3916) isEmpty should not be early terminating

2014-05-02 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13987909#comment-13987909
 ] 

Daniel Dai commented on PIG-3916:
-

+1 for the patch. As I mentioned before, let's reverting isEmpty first. 

> isEmpty should not be early terminating
> ---
>
> Key: PIG-3916
> URL: https://issues.apache.org/jira/browse/PIG-3916
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.11
>Reporter: Rohini Palaniswamy
>Assignee: Rohini Palaniswamy
>Priority: Critical
> Fix For: 0.13.0, 0.12.2
>
> Attachments: PIG-3916-1.patch
>
>
> PIG-2066 makes isEmpty early terminating which is very wrong. When there is a 
> binary condition with isEmpty() followed by something like SUM, it skips 
> records leading to wrong results. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (PIG-3902) PigServer creates cycle

2014-05-02 Thread Jacob Perkins (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacob Perkins updated PIG-3902:
---

Fix Version/s: 0.11.1
 Assignee: Cheolsoo Park
Affects Version/s: 0.11.1
   Status: Patch Available  (was: Open)

This goes deep. I updated the LogicalPlanBuilder such that, when building a 
load op, it checks the plan that's been generated "so far" for stores that 
match the location the load references. If so it links them. 

Some of the multiquery compilation code exploited the bug that 
plan.getSinks() always returned the stores. You could get away with that before 
the "postProcess" method on PigServer got called because loads were not yet 
linked to the stores they depended on. Hence, all the stores were leaves of the 
plan. However, with this patch that bug exploitation will no longer work. In 
order to get the loads you'll have to get the operators of the plan explicitly 
and create a list of stores. Presumably this happens elsewhere but I only ran 
the multiquery tests with my patch.

> PigServer creates cycle
> ---
>
> Key: PIG-3902
> URL: https://issues.apache.org/jira/browse/PIG-3902
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.11.1
>Reporter: Jacob Perkins
>Assignee: Cheolsoo Park
> Fix For: 0.11.1
>
> Attachments: broken-plan.png, cycle.diff, multiquery-cycle.diff
>
>
> Under certain conditions PigServer creates a cycle in the logical plan. 
> Consider the following pseudocode:
> {code}
> A = load from 'A' using F1;
> ...process...
> B = store X into 'B' using F2;
> C = load from 'B' using F3;
> ...process...
> D = store Y into 'A' using F1;
> {code}
> PigServer will, in ordinary cases, notice that an output path is equal to an 
> input path, and, if there's no path from the input to the output, make the 
> input a dependency of the output. However, PigServer orders the loads and 
> stores arbitrarily during that logic. Sometimes, in the code above, C is 
> correctly wired as a dependency of B and, since that creates a path from A to 
> D, A won't be made a dependency of D and we're good. On occasion though, the 
> ordering being arbitrary, A is wired as a dependency of D. That's no good. To 
> be fair, it's not actually a cycle, since when A is wired to D, there's a 
> path between C and B so the cycle won't actually get created. But it's still 
> a broken plan.
> The offending PigServer code: 
> https://github.com/apache/pig/blob/branch-0.11/src/org/apache/pig/PigServer.java#L1678-L1693
> And here's some actual pig code that should reproduce the broken plan. Notice 
> I had to use a store function that wouldn't check the output. If you're just 
> using PigStorage this won't be reproducible since you can't write to the same 
> location you read from in that case.
> {code}
> A = load '$A' as (line:chararray);
> A = foreach A generate flatten(TOKENIZE(LOWER(line))) as token;
> store A into '$B';
> B = load '$B' as (token:chararray);
> B = filter B by SIZE(token) > 3;
> store B into '$A' using 
> org.apache.pig.piggybank.storage.DBStorage('com.mysql.jdbc.Driver', 
> 'dbc:mysql://localhost/test', 'INSERT INTO foobar (token) VALUES(?)');
> {code}
> As far as a fix goes... I'd love some input. I've got some workarounds in 
> mind for the specific use case that brought this up, but the general problem 
> is more difficult. 
> As an aside, there's other issues with the PigServer code referenced above. 
> For example, it should almost certainly be using the full path (after 
> LoadFunc/StoreFunc.relativeToAbsolutePath) no? Try storing to a relative path 
> then loading from the absolute representation of that path in the same 
> script... Also, why isn't it checking the FuncSpec as well as the location? 
> Just trying to open up the discussion.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (PIG-3902) PigServer creates cycle

2014-05-02 Thread Jacob Perkins (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacob Perkins updated PIG-3902:
---

Attachment: multiquery-cycle.diff

> PigServer creates cycle
> ---
>
> Key: PIG-3902
> URL: https://issues.apache.org/jira/browse/PIG-3902
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.11.1
>Reporter: Jacob Perkins
>Assignee: Cheolsoo Park
> Fix For: 0.11.1
>
> Attachments: broken-plan.png, cycle.diff, multiquery-cycle.diff
>
>
> Under certain conditions PigServer creates a cycle in the logical plan. 
> Consider the following pseudocode:
> {code}
> A = load from 'A' using F1;
> ...process...
> B = store X into 'B' using F2;
> C = load from 'B' using F3;
> ...process...
> D = store Y into 'A' using F1;
> {code}
> PigServer will, in ordinary cases, notice that an output path is equal to an 
> input path, and, if there's no path from the input to the output, make the 
> input a dependency of the output. However, PigServer orders the loads and 
> stores arbitrarily during that logic. Sometimes, in the code above, C is 
> correctly wired as a dependency of B and, since that creates a path from A to 
> D, A won't be made a dependency of D and we're good. On occasion though, the 
> ordering being arbitrary, A is wired as a dependency of D. That's no good. To 
> be fair, it's not actually a cycle, since when A is wired to D, there's a 
> path between C and B so the cycle won't actually get created. But it's still 
> a broken plan.
> The offending PigServer code: 
> https://github.com/apache/pig/blob/branch-0.11/src/org/apache/pig/PigServer.java#L1678-L1693
> And here's some actual pig code that should reproduce the broken plan. Notice 
> I had to use a store function that wouldn't check the output. If you're just 
> using PigStorage this won't be reproducible since you can't write to the same 
> location you read from in that case.
> {code}
> A = load '$A' as (line:chararray);
> A = foreach A generate flatten(TOKENIZE(LOWER(line))) as token;
> store A into '$B';
> B = load '$B' as (token:chararray);
> B = filter B by SIZE(token) > 3;
> store B into '$A' using 
> org.apache.pig.piggybank.storage.DBStorage('com.mysql.jdbc.Driver', 
> 'dbc:mysql://localhost/test', 'INSERT INTO foobar (token) VALUES(?)');
> {code}
> As far as a fix goes... I'd love some input. I've got some workarounds in 
> mind for the specific use case that brought this up, but the general problem 
> is more difficult. 
> As an aside, there's other issues with the PigServer code referenced above. 
> For example, it should almost certainly be using the full path (after 
> LoadFunc/StoreFunc.relativeToAbsolutePath) no? Try storing to a relative path 
> then loading from the absolute representation of that path in the same 
> script... Also, why isn't it checking the FuncSpec as well as the location? 
> Just trying to open up the discussion.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PIG-3916) isEmpty should not be early terminating

2014-05-02 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13987788#comment-13987788
 ] 

Rohini Palaniswamy commented on PIG-3916:
-

Tried to see if only terminating isEmpty() was possible, but it is lot more 
work. For now reverting isEmpty() as it causes data loss. 

A more detailed description of impact for users to assess damage.
Impact:
The isEmpty() function in a FOREACH clause causes wrong results when used
after a GROUP or COGROUP clause when accumulative mode is used (i.e there are
no other non-Accumulator UDFs in the FOREACH clause). Also saw isEmpty() being 
evaluated to true when it was not.

For eg:
   - If the FOREACH had only isEmpty() and SUM() functions in a binary
condition which are both Accumulator UDFs, then wrong results are produced for
the SUM() where it only sums up the first batch and terminates after that.
SUM,COUNT, MIN, MAX are all accumulator UDFs.
   - If the FOREACH had any another non-accumulator UDF (For eg: STRSPLIT()),
then it is not affected as accumulative mode would not be used. 

> isEmpty should not be early terminating
> ---
>
> Key: PIG-3916
> URL: https://issues.apache.org/jira/browse/PIG-3916
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.11
>Reporter: Rohini Palaniswamy
>Assignee: Rohini Palaniswamy
>Priority: Critical
> Fix For: 0.13.0, 0.12.2
>
> Attachments: PIG-3916-1.patch
>
>
> PIG-2066 makes isEmpty early terminating which is very wrong. When there is a 
> binary condition with isEmpty() followed by something like SUM, it skips 
> records leading to wrong results. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (PIG-3917) Task process exit with nonzero status of -1. how to solve this?

2014-05-02 Thread Devendran s (JIRA)
Devendran s created PIG-3917:


 Summary:  Task process exit with nonzero status of -1. how to 
solve this?
 Key: PIG-3917
 URL: https://issues.apache.org/jira/browse/PIG-3917
 Project: Pig
  Issue Type: Bug
  Components: build, grunt
Affects Versions: 0.12.0
 Environment: windows 8
Reporter: Devendran s
 Fix For: 0.12.0


I tried to run Pig scripts in Windows 8 system using cygwin. When i run 
pigscripts in local mode it works, but in mapreduce mode it shows the following 
error.

2014-05-02 14:47:35,345 INFO org.apache.hadoop.mapred.TaskTracker: addFreeSlot 
: current free slots : 6
2014-05-02 14:47:35,591 INFO org.apache.hadoop.mapred.TaskTracker: 
LaunchTaskAction (registerTask): attempt_201405021138_0010_m_01_2 task's 
state:UNASSIGNED
2014-05-02 14:47:35,591 INFO org.apache.hadoop.mapred.TaskTracker: Trying to 
launch : attempt_201405021138_0010_m_01_2 which needs 1 slots
2014-05-02 14:47:35,591 INFO org.apache.hadoop.mapred.TaskTracker: In 
TaskLauncher, current free slots : 6 and trying to launch 
attempt_201405021138_0010_m_01_2 which needs 1 slots
2014-05-02 14:47:35,679 INFO org.apache.hadoop.mapred.JvmManager: In JvmRunner 
constructed JVM ID: jvm_201405021138_0010_m_949588628
2014-05-02 14:47:35,680 INFO org.apache.hadoop.mapred.JvmManager: JVM Runner 
jvm_201405021138_0010_m_949588628 spawned.
2014-05-02 14:47:35,685 INFO org.apache.hadoop.mapred.JvmManager: JVM Not 
killed jvm_201405021138_0010_m_949588628 but just removed
2014-05-02 14:47:35,685 INFO org.apache.hadoop.mapred.JvmManager: JVM : 
jvm_201405021138_0010_m_949588628 exited with exit code -1. Number of tasks it 
ran: 0
2014-05-02 14:47:35,685 WARN org.apache.hadoop.mapred.TaskRunner: 
attempt_201405021138_0010_m_01_2 : Child Error
java.io.IOException: Task process exit with nonzero status of -1.
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258).

Please help to find out the solution for the above.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PIG-3558) ORC support for Pig

2014-05-02 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13987469#comment-13987469
 ] 

Daniel Dai commented on PIG-3558:
-

[~chitnis] Not sure if I understand correctly. Hive BINARY has clear meaning, 
but Pig bytearray is not. It means unknown datatype where user does not 
explicitly declare the datatype. The real data can be anything not just 
DataByteArray. So it should be safe to convert Hive BINARY to Pig bytearray but 
not vice versa.

> ORC support for Pig
> ---
>
> Key: PIG-3558
> URL: https://issues.apache.org/jira/browse/PIG-3558
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Reporter: Daniel Dai
>Assignee: Daniel Dai
>  Labels: porc
> Fix For: 0.13.0
>
> Attachments: PIG-3558-1.patch, PIG-3558-2.patch, PIG-3558-3.patch, 
> PIG-3558-4.patch, PIG-3558-5.patch
>
>
> Adding LoadFunc and StoreFunc for ORC.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PIG-3916) isEmpty should not be early terminating

2014-05-02 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13987461#comment-13987461
 ] 

Daniel Dai commented on PIG-3916:
-

Seems isEmpty is the main use case for PIG-2066. I am fine commit the patch as 
an immediate remedy but we also need to either rollback PIG-2066 completely 
(other early terminating UDF will share the same issue), or redo PIG-2066 in a 
right way (only terminate IsEmpty but not SUM).

> isEmpty should not be early terminating
> ---
>
> Key: PIG-3916
> URL: https://issues.apache.org/jira/browse/PIG-3916
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.11
>Reporter: Rohini Palaniswamy
>Assignee: Rohini Palaniswamy
>Priority: Critical
> Fix For: 0.13.0, 0.12.2
>
> Attachments: PIG-3916-1.patch
>
>
> PIG-2066 makes isEmpty early terminating which is very wrong. When there is a 
> binary condition with isEmpty() followed by something like SUM, it skips 
> records leading to wrong results. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)