date:20160601

[VOTE] Release Pig 0.16.0 (candidate 0)

2016-06-01 Thread Daniel Dai

Hi,

I have created a candidate build for Pig 0.16.0.

Keys used to sign the release are available at
http://svn.apache.org/viewvc/pig/trunk/KEYS?view=markup.

Please download, test, and try it out:
http://people.apache.org/~daijy/pig-0.16.0-rc0/

Release notes and the rat report are available at the same location.

Should we release this? Vote closes on Monday EOD, June 6th 2016.

Thanks,
Daniel

[jira] Subscription: PIG patch available

2016-06-01 Thread jira

Issue Subscription
Filter: PIG patch available (31 issues)

Subscriber: pigdaily

Key Summary
PIG-4918Pig on Tez cannot switch pig.temp.dir to another fs
https://issues.apache.org/jira/browse/PIG-4918
PIG-4916Pig on Tez fail to remove temporary HDFS files in some cases
https://issues.apache.org/jira/browse/PIG-4916
PIG-4906Add Bigdecimal functions in Over function
https://issues.apache.org/jira/browse/PIG-4906
PIG-4897Scope of param substitution for run/exec commands
https://issues.apache.org/jira/browse/PIG-4897
PIG-4896Param substitution ignored when redefined 
https://issues.apache.org/jira/browse/PIG-4896
PIG-4886Add PigSplit#getLocationInfo to fix the NPE found in log in spark 
mode
https://issues.apache.org/jira/browse/PIG-4886
PIG-4871 Not use OperatorPlan#forceConnect in MultiQueryOptimizationSpark
https://issues.apache.org/jira/browse/PIG-4871
PIG-4854Merge spark branch to trunk
https://issues.apache.org/jira/browse/PIG-4854
PIG-4849pig on tez will cause tez-ui to crash,because the content from 
timeline server is too long. 
https://issues.apache.org/jira/browse/PIG-4849
PIG-4797Analyze JOIN performance and improve the same.
https://issues.apache.org/jira/browse/PIG-4797
PIG-4788the value BytesRead metric info always returns 0 even the length of 
input file is not 0 in spark engine
https://issues.apache.org/jira/browse/PIG-4788
PIG-4745DataBag should protect content of passed list of tuples
https://issues.apache.org/jira/browse/PIG-4745
PIG-4684Exception should be changed to warning when job diagnostics cannot 
be fetched
https://issues.apache.org/jira/browse/PIG-4684
PIG-4656Improve String serialization and comparator performance in 
BinInterSedes
https://issues.apache.org/jira/browse/PIG-4656
PIG-4598Allow user defined plan optimizer rules
https://issues.apache.org/jira/browse/PIG-4598
PIG-4551Partition filter is not pushed down in case of SPLIT
https://issues.apache.org/jira/browse/PIG-4551
PIG-4539New PigUnit
https://issues.apache.org/jira/browse/PIG-4539
PIG-4515org.apache.pig.builtin.Distinct throws ClassCastException
https://issues.apache.org/jira/browse/PIG-4515
PIG-4323PackageConverter hanging in Spark
https://issues.apache.org/jira/browse/PIG-4323
PIG-4313StackOverflowError in LIMIT operation on Spark
https://issues.apache.org/jira/browse/PIG-4313
PIG-4251Pig on Storm
https://issues.apache.org/jira/browse/PIG-4251
PIG-4002Disable combiner when map-side aggregation is used
https://issues.apache.org/jira/browse/PIG-4002
PIG-3952PigStorage accepts '-tagSplit' to return full split information
https://issues.apache.org/jira/browse/PIG-3952
PIG-3911Define unique fields with @OutputSchema
https://issues.apache.org/jira/browse/PIG-3911
PIG-3877Getting Geo Latitude/Longitude from Address Lines
https://issues.apache.org/jira/browse/PIG-3877
PIG-3873Geo distance calculation using Haversine
https://issues.apache.org/jira/browse/PIG-3873
PIG-3864ToDate(userstring, format, timezone) computes DateTime with strange 
handling of Daylight Saving Time with location based timezones
https://issues.apache.org/jira/browse/PIG-3864
PIG-3851Upgrade jline to 2.11
https://issues.apache.org/jira/browse/PIG-3851
PIG-3668COR built-in function when atleast one of the coefficient values is 
NaN
https://issues.apache.org/jira/browse/PIG-3668
PIG-3587add functionality for rolling over dates
https://issues.apache.org/jira/browse/PIG-3587
PIG-2315Make as clause work in generate
https://issues.apache.org/jira/browse/PIG-2315

You may edit this subscription at:
https://issues.apache.org/jira/secure/FilterSubscription!default.jspa?subId=16328&filterId=12322384

[jira] [Updated] (PIG-4903) Avoid add all spark dependency jars to SPARK_YARN_DIST_FILES and SPARK_DIST_CLASSPATH

2016-06-01 Thread liyunzhang_intel (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-4903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated PIG-4903:
--
Attachment: PIG-4903_1.patch

[~rohini]: 
 I have submitted PIG-4903_1.patch. In this patch, it requires end-users to set 
SPARK_JAR(the  hdfs location of spark-assembly*.jar) otherwise it will give 
error message and exit.
[~sriksun] and [~rohini]:
 As both of you are familiar with this part, can you help review?  i will 
include this change into the final patch of pig on spark(PIG-4854) after review.

> Avoid add all spark dependency jars to  SPARK_YARN_DIST_FILES and 
> SPARK_DIST_CLASSPATH
> --
>
> Key: PIG-4903
> URL: https://issues.apache.org/jira/browse/PIG-4903
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
> Attachments: PIG-4903.patch, PIG-4903_1.patch
>
>
> There are some comments about bin/pig on 
> https://reviews.apache.org/r/45667/#comment198955.
> {code}
> # ADDING SPARK DEPENDENCIES ##
> # Spark typically works with a single assembly file. However this
> # assembly isn't available as a artifact to pull in via ivy.
> # To work around this short coming, we add all the jars barring
> # spark-yarn to DIST through dist-files and then add them to classpath
> # of the executors through an independent env variable. The reason
> # for excluding spark-yarn is because spark-yarn is already being added
> # by the spark-yarn-client via jarOf(Client.Class)
> for f in $PIG_HOME/lib/*.jar; do
> if [[ $f == $PIG_HOME/lib/spark-assembly* ]]; then
> # Exclude spark-assembly.jar from shipped jars, but retain in 
> classpath
> SPARK_JARS=${SPARK_JARS}:$f;
> else
> SPARK_JARS=${SPARK_JARS}:$f;
> SPARK_YARN_DIST_FILES=${SPARK_YARN_DIST_FILES},file://$f;
> SPARK_DIST_CLASSPATH=${SPARK_DIST_CLASSPATH}:\${PWD}/`basename $f`
> fi
> done
> CLASSPATH=${CLASSPATH}:${SPARK_JARS}
> export SPARK_YARN_DIST_FILES=`echo ${SPARK_YARN_DIST_FILES} | sed 's/^,//g'`
> export SPARK_JARS=${SPARK_YARN_DIST_FILES}
> export SPARK_DIST_CLASSPATH
> {code}
> Here we first copy all spark dependency jar like 
> spark-network-shuffle_2.10-1.6.1 jar to distcache(SPARK_YARN_DIST_FILES) then 
> add them to the classpath of executor(SPARK_DIST_CLASSPATH). Actually we need 
> not copy all these depency jar to SPARK_DIST_CLASSPATH because all these 
> dependency jars are included in spark-assembly.jar and spark-assembly.jar is 
> uploaded with the spark job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: Is there any code style check tool in pig project?

2016-06-01 Thread Daniel Dai

There is an ant target “checkstyle”. But we didn’t enforce it for a long time. 
The only formatting we consistently check is don’t use tab, use space.

Thanks,
Daniel

On 6/1/16, 7:59 PM, "Zhang, Liyun"  wrote:

>
>Hi all:
>  Now I'm doing  the merge work of spark branch, I found that there are a lot 
> of code need to be formatted. So is there any code style check tool in pig ?
>
>
>
>
>Kelly Zhang/Zhang,Liyun
>Best Regards
>

[jira] [Updated] (PIG-4919) Upgrade spark.version to 1.6.1

2016-06-01 Thread liyunzhang_intel (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-4919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated PIG-4919:
--
Attachment: PIG-4919.patch

Changes in PIG-4919.patch
1. upgrade spark.version to 1.6.1
2. upgrade snappy-java to 1.1.1.3
3. upgrade the scala version of spark-core and spark-yarn to 2.11
4. spark ui depends on scala-xml, so add scala-xml dependency
5. modify GlobalRearrangeConverter and JobMetricsListener because of Spark API 
change


> Upgrade spark.version to 1.6.1
> --
>
> Key: PIG-4919
> URL: https://issues.apache.org/jira/browse/PIG-4919
> Project: Pig
>  Issue Type: Sub-task
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: PIG-4919.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (PIG-4919) Upgrade spark.version to 1.6.1

2016-06-01 Thread liyunzhang_intel (JIRA)

liyunzhang_intel created PIG-4919:
-

 Summary: Upgrade spark.version to 1.6.1
 Key: PIG-4919
 URL: https://issues.apache.org/jira/browse/PIG-4919
 Project: Pig
  Issue Type: Sub-task
Reporter: liyunzhang_intel
Assignee: liyunzhang_intel






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Is there any code style check tool in pig project?

2016-06-01 Thread Zhang, Liyun


Hi all:
  Now I'm doing  the merge work of spark branch, I found that there are a lot 
of code need to be formatted. So is there any code style check tool in pig ?




Kelly Zhang/Zhang,Liyun
Best Regards

[jira] [Commented] (PIG-4916) Pig on Tez fail to remove temporary HDFS files in some cases

2016-06-01 Thread Chris Nauroth (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-4916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15311391#comment-15311391
 ] 

Chris Nauroth commented on PIG-4916:


Hello [~daijy].  Thank you for the patch.  +1 (non-binding) from me.  I agree 
that it isn't feasible to write a unit test for this.  We have confirmation 
from your manual testing that it worked though.

> Pig on Tez fail to remove temporary HDFS files in some cases
> 
>
> Key: PIG-4916
> URL: https://issues.apache.org/jira/browse/PIG-4916
> Project: Pig
>  Issue Type: Bug
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.17.0, 0.16.1
>
> Attachments: PIG-4916-1.patch
>
>
> We saw the following stack trace when running Pig on S3:
> {code}
> 2016-06-01 22:04:22,714 [Thread-19] INFO  
> org.apache.hadoop.service.AbstractService - Service 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl failed in state 
> STOPPED; cause: java.io.IOException: Filesystem closed
> java.io.IOException: Filesystem closed
>   at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:808)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(DFSOutputStream.java:2034)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream.hflush(DFSOutputStream.java:1980)
>   at 
> org.apache.hadoop.fs.FSDataOutputStream.hflush(FSDataOutputStream.java:130)
>   at 
> org.apache.hadoop.yarn.client.api.impl.FileSystemTimelineWriter$LogFD.flush(FileSystemTimelineWriter.java:370)
>   at 
> org.apache.hadoop.yarn.client.api.impl.FileSystemTimelineWriter$LogFDsCache.flush(FileSystemTimelineWriter.java:485)
>   at 
> org.apache.hadoop.yarn.client.api.impl.FileSystemTimelineWriter.close(FileSystemTimelineWriter.java:271)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.serviceStop(TimelineClientImpl.java:326)
>   at 
> org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
>   at 
> org.apache.tez.dag.history.ats.acls.ATSV15HistoryACLPolicyManager.close(ATSV15HistoryACLPolicyManager.java:259)
>   at org.apache.tez.client.TezClient.stop(TezClient.java:582)
>   at 
> org.apache.pig.backend.hadoop.executionengine.tez.TezSessionManager.shutdown(TezSessionManager.java:308)
>   at 
> org.apache.pig.backend.hadoop.executionengine.tez.TezSessionManager$1.run(TezSessionManager.java:53)
> 2016-06-01 22:04:22,718 [Thread-19] ERROR 
> org.apache.pig.backend.hadoop.executionengine.tez.TezSessionManager - Error 
> shutting down Tez session org.apache.tez.client.TezClient@48bf833a
> org.apache.hadoop.service.ServiceStateException: java.io.IOException: 
> Filesystem closed
>   at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
>   at 
> org.apache.hadoop.service.AbstractService.stop(AbstractService.java:225)
>   at 
> org.apache.tez.dag.history.ats.acls.ATSV15HistoryACLPolicyManager.close(ATSV15HistoryACLPolicyManager.java:259)
>   at org.apache.tez.client.TezClient.stop(TezClient.java:582)
>   at 
> org.apache.pig.backend.hadoop.executionengine.tez.TezSessionManager.shutdown(TezSessionManager.java:308)
>   at 
> org.apache.pig.backend.hadoop.executionengine.tez.TezSessionManager$1.run(TezSessionManager.java:53)
> Caused by: java.io.IOException: Filesystem closed
>   at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:808)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(DFSOutputStream.java:2034)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream.hflush(DFSOutputStream.java:1980)
>   at 
> org.apache.hadoop.fs.FSDataOutputStream.hflush(FSDataOutputStream.java:130)
>   at 
> org.apache.hadoop.yarn.client.api.impl.FileSystemTimelineWriter$LogFD.flush(FileSystemTimelineWriter.java:370)
>   at 
> org.apache.hadoop.yarn.client.api.impl.FileSystemTimelineWriter$LogFDsCache.flush(FileSystemTimelineWriter.java:485)
>   at 
> org.apache.hadoop.yarn.client.api.impl.FileSystemTimelineWriter.close(FileSystemTimelineWriter.java:271)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.serviceStop(TimelineClientImpl.java:326)
>   at 
> org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
>   ... 4 more
> {code}
> The job run successfully, but the temporary hdfs files are not removed.
> [~cnauroth] points out FileSystem also use shutdown hook to close FileSystem 
> instances and it might run before Pig's shutdown hook in Main. By switching 
> to Hadoop's ShutdownHookManager, we can put an order on shutdown hook.
> This has been verified by testing the following code in Main:
> {code}
> ShutdownHookManager.get().addShutdownHook(new Runnable() {
> @Override
> public void run() {
>

[jira] [Updated] (PIG-4918) Pig on Tez cannot switch pig.temp.dir to another fs

2016-06-01 Thread Daniel Dai (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-4918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-4918:

Status: Patch Available  (was: Open)

Don't find an easy way to write a test since we need different fs to reproduce.

> Pig on Tez cannot switch pig.temp.dir to another fs
> ---
>
> Key: PIG-4918
> URL: https://issues.apache.org/jira/browse/PIG-4918
> Project: Pig
>  Issue Type: Bug
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.17.0, 0.16.1
>
> Attachments: PIG-4918-1.patch
>
>
> If pig.temp.dir points to another fs, Pig fails. One such case is the 
> defaultFS is set to s3, but use hdfs as temp dir. Error message:
> {code}
> org.apache.pig.backend.hadoop.executionengine.JobCreationException: ERROR 
> 2017: Internal error creating job configuration.
> at 
> org.apache.pig.backend.hadoop.executionengine.tez.TezJobCompiler.getJob(TezJobCompiler.java:141)
> at 
> org.apache.pig.backend.hadoop.executionengine.tez.TezJobCompiler.compile(TezJobCompiler.java:79)
> at 
> org.apache.pig.backend.hadoop.executionengine.tez.TezLauncher.launchPig(TezLauncher.java:194)
> at 
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.launchPig(HExecutionEngine.java:304)
> at org.apache.pig.PigServer.launchPlan(PigServer.java:1431)
> at 
> org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1416)
> at org.apache.pig.PigServer.execute(PigServer.java:1405)
> at org.apache.pig.PigServer.executeBatch(PigServer.java:456)
> at org.apache.pig.PigServer.executeBatch(PigServer.java:439)
> at 
> org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:171)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:234)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:205)
> at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:81)
> at org.apache.pig.Main.run(Main.java:631)
> at org.apache.pig.Main.main(Main.java:177)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
> Caused by: java.lang.IllegalArgumentException: Wrong FS: 
> hdfs://pig-aws-devenv-5.openstacklocal:8020/tmp/daijy/temp-265134702/automaton-1.11-8.jar,
>  expected: s3a://pig-aws-devenv
> at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:658)
> at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:478)
> at 
> org.apache.pig.backend.hadoop.executionengine.tez.TezResourceManager.addTezResource(TezResourceManager.java:82)
> at 
> org.apache.pig.backend.hadoop.executionengine.tez.TezResourceManager.addTezResources(TezResourceManager.java:106)
> at 
> org.apache.pig.backend.hadoop.executionengine.tez.plan.TezPlanContainer.getLocalResources(TezPlanContainer.java:107)
> at 
> org.apache.pig.backend.hadoop.executionengine.tez.TezJobCompiler.getJob(TezJobCompiler.java:95)
> ... 20 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (PIG-4918) Pig on Tez cannot switch pig.temp.dir to another fs

2016-06-01 Thread Daniel Dai (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-4918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-4918:

Attachment: PIG-4918-1.patch

> Pig on Tez cannot switch pig.temp.dir to another fs
> ---
>
> Key: PIG-4918
> URL: https://issues.apache.org/jira/browse/PIG-4918
> Project: Pig
>  Issue Type: Bug
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.17.0, 0.16.1
>
> Attachments: PIG-4918-1.patch
>
>
> If pig.temp.dir points to another fs, Pig fails. One such case is the 
> defaultFS is set to s3, but use hdfs as temp dir. Error message:
> {code}
> org.apache.pig.backend.hadoop.executionengine.JobCreationException: ERROR 
> 2017: Internal error creating job configuration.
> at 
> org.apache.pig.backend.hadoop.executionengine.tez.TezJobCompiler.getJob(TezJobCompiler.java:141)
> at 
> org.apache.pig.backend.hadoop.executionengine.tez.TezJobCompiler.compile(TezJobCompiler.java:79)
> at 
> org.apache.pig.backend.hadoop.executionengine.tez.TezLauncher.launchPig(TezLauncher.java:194)
> at 
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.launchPig(HExecutionEngine.java:304)
> at org.apache.pig.PigServer.launchPlan(PigServer.java:1431)
> at 
> org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1416)
> at org.apache.pig.PigServer.execute(PigServer.java:1405)
> at org.apache.pig.PigServer.executeBatch(PigServer.java:456)
> at org.apache.pig.PigServer.executeBatch(PigServer.java:439)
> at 
> org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:171)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:234)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:205)
> at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:81)
> at org.apache.pig.Main.run(Main.java:631)
> at org.apache.pig.Main.main(Main.java:177)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
> Caused by: java.lang.IllegalArgumentException: Wrong FS: 
> hdfs://pig-aws-devenv-5.openstacklocal:8020/tmp/daijy/temp-265134702/automaton-1.11-8.jar,
>  expected: s3a://pig-aws-devenv
> at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:658)
> at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:478)
> at 
> org.apache.pig.backend.hadoop.executionengine.tez.TezResourceManager.addTezResource(TezResourceManager.java:82)
> at 
> org.apache.pig.backend.hadoop.executionengine.tez.TezResourceManager.addTezResources(TezResourceManager.java:106)
> at 
> org.apache.pig.backend.hadoop.executionengine.tez.plan.TezPlanContainer.getLocalResources(TezPlanContainer.java:107)
> at 
> org.apache.pig.backend.hadoop.executionengine.tez.TezJobCompiler.getJob(TezJobCompiler.java:95)
> ... 20 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (PIG-4917) Pig on Tez cannot switch pig.temp.dir to another fs

2016-06-01 Thread Daniel Dai (JIRA)

Daniel Dai created PIG-4917:
---

 Summary: Pig on Tez cannot switch pig.temp.dir to another fs
 Key: PIG-4917
 URL: https://issues.apache.org/jira/browse/PIG-4917
 Project: Pig
  Issue Type: Bug
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.17.0, 0.16.1


If pig.temp.dir points to another fs, Pig fails. One such case is the defaultFS 
is set to s3, but use hdfs as temp dir. Error message:
{code}
org.apache.pig.backend.hadoop.executionengine.JobCreationException: ERROR 2017: 
Internal error creating job configuration.
at 
org.apache.pig.backend.hadoop.executionengine.tez.TezJobCompiler.getJob(TezJobCompiler.java:141)
at 
org.apache.pig.backend.hadoop.executionengine.tez.TezJobCompiler.compile(TezJobCompiler.java:79)
at 
org.apache.pig.backend.hadoop.executionengine.tez.TezLauncher.launchPig(TezLauncher.java:194)
at 
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.launchPig(HExecutionEngine.java:304)
at org.apache.pig.PigServer.launchPlan(PigServer.java:1431)
at 
org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1416)
at org.apache.pig.PigServer.execute(PigServer.java:1405)
at org.apache.pig.PigServer.executeBatch(PigServer.java:456)
at org.apache.pig.PigServer.executeBatch(PigServer.java:439)
at 
org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:171)
at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:234)
at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:205)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:81)
at org.apache.pig.Main.run(Main.java:631)
at org.apache.pig.Main.main(Main.java:177)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Caused by: java.lang.IllegalArgumentException: Wrong FS: 
hdfs://pig-aws-devenv-5.openstacklocal:8020/tmp/daijy/temp-265134702/automaton-1.11-8.jar,
 expected: s3a://pig-aws-devenv
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:658)
at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:478)
at 
org.apache.pig.backend.hadoop.executionengine.tez.TezResourceManager.addTezResource(TezResourceManager.java:82)
at 
org.apache.pig.backend.hadoop.executionengine.tez.TezResourceManager.addTezResources(TezResourceManager.java:106)
at 
org.apache.pig.backend.hadoop.executionengine.tez.plan.TezPlanContainer.getLocalResources(TezPlanContainer.java:107)
at 
org.apache.pig.backend.hadoop.executionengine.tez.TezJobCompiler.getJob(TezJobCompiler.java:95)
... 20 more
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (PIG-4918) Pig on Tez cannot switch pig.temp.dir to another fs

2016-06-01 Thread Daniel Dai (JIRA)

Daniel Dai created PIG-4918:
---

 Summary: Pig on Tez cannot switch pig.temp.dir to another fs
 Key: PIG-4918
 URL: https://issues.apache.org/jira/browse/PIG-4918
 Project: Pig
  Issue Type: Bug
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.17.0, 0.16.1


If pig.temp.dir points to another fs, Pig fails. One such case is the defaultFS 
is set to s3, but use hdfs as temp dir. Error message:
{code}
org.apache.pig.backend.hadoop.executionengine.JobCreationException: ERROR 2017: 
Internal error creating job configuration.
at 
org.apache.pig.backend.hadoop.executionengine.tez.TezJobCompiler.getJob(TezJobCompiler.java:141)
at 
org.apache.pig.backend.hadoop.executionengine.tez.TezJobCompiler.compile(TezJobCompiler.java:79)
at 
org.apache.pig.backend.hadoop.executionengine.tez.TezLauncher.launchPig(TezLauncher.java:194)
at 
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.launchPig(HExecutionEngine.java:304)
at org.apache.pig.PigServer.launchPlan(PigServer.java:1431)
at 
org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1416)
at org.apache.pig.PigServer.execute(PigServer.java:1405)
at org.apache.pig.PigServer.executeBatch(PigServer.java:456)
at org.apache.pig.PigServer.executeBatch(PigServer.java:439)
at 
org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:171)
at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:234)
at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:205)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:81)
at org.apache.pig.Main.run(Main.java:631)
at org.apache.pig.Main.main(Main.java:177)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Caused by: java.lang.IllegalArgumentException: Wrong FS: 
hdfs://pig-aws-devenv-5.openstacklocal:8020/tmp/daijy/temp-265134702/automaton-1.11-8.jar,
 expected: s3a://pig-aws-devenv
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:658)
at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:478)
at 
org.apache.pig.backend.hadoop.executionengine.tez.TezResourceManager.addTezResource(TezResourceManager.java:82)
at 
org.apache.pig.backend.hadoop.executionengine.tez.TezResourceManager.addTezResources(TezResourceManager.java:106)
at 
org.apache.pig.backend.hadoop.executionengine.tez.plan.TezPlanContainer.getLocalResources(TezPlanContainer.java:107)
at 
org.apache.pig.backend.hadoop.executionengine.tez.TezJobCompiler.getJob(TezJobCompiler.java:95)
... 20 more
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (PIG-4916) Pig on Tez fail to remove temporary HDFS files in some cases

2016-06-01 Thread Daniel Dai (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-4916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-4916:

Status: Patch Available  (was: Open)

I don't see a way to write a test case since this is non-deterministic in 
nature.

> Pig on Tez fail to remove temporary HDFS files in some cases
> 
>
> Key: PIG-4916
> URL: https://issues.apache.org/jira/browse/PIG-4916
> Project: Pig
>  Issue Type: Bug
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.17.0, 0.16.1
>
> Attachments: PIG-4916-1.patch
>
>
> We saw the following stack trace when running Pig on S3:
> {code}
> 2016-06-01 22:04:22,714 [Thread-19] INFO  
> org.apache.hadoop.service.AbstractService - Service 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl failed in state 
> STOPPED; cause: java.io.IOException: Filesystem closed
> java.io.IOException: Filesystem closed
>   at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:808)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(DFSOutputStream.java:2034)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream.hflush(DFSOutputStream.java:1980)
>   at 
> org.apache.hadoop.fs.FSDataOutputStream.hflush(FSDataOutputStream.java:130)
>   at 
> org.apache.hadoop.yarn.client.api.impl.FileSystemTimelineWriter$LogFD.flush(FileSystemTimelineWriter.java:370)
>   at 
> org.apache.hadoop.yarn.client.api.impl.FileSystemTimelineWriter$LogFDsCache.flush(FileSystemTimelineWriter.java:485)
>   at 
> org.apache.hadoop.yarn.client.api.impl.FileSystemTimelineWriter.close(FileSystemTimelineWriter.java:271)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.serviceStop(TimelineClientImpl.java:326)
>   at 
> org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
>   at 
> org.apache.tez.dag.history.ats.acls.ATSV15HistoryACLPolicyManager.close(ATSV15HistoryACLPolicyManager.java:259)
>   at org.apache.tez.client.TezClient.stop(TezClient.java:582)
>   at 
> org.apache.pig.backend.hadoop.executionengine.tez.TezSessionManager.shutdown(TezSessionManager.java:308)
>   at 
> org.apache.pig.backend.hadoop.executionengine.tez.TezSessionManager$1.run(TezSessionManager.java:53)
> 2016-06-01 22:04:22,718 [Thread-19] ERROR 
> org.apache.pig.backend.hadoop.executionengine.tez.TezSessionManager - Error 
> shutting down Tez session org.apache.tez.client.TezClient@48bf833a
> org.apache.hadoop.service.ServiceStateException: java.io.IOException: 
> Filesystem closed
>   at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
>   at 
> org.apache.hadoop.service.AbstractService.stop(AbstractService.java:225)
>   at 
> org.apache.tez.dag.history.ats.acls.ATSV15HistoryACLPolicyManager.close(ATSV15HistoryACLPolicyManager.java:259)
>   at org.apache.tez.client.TezClient.stop(TezClient.java:582)
>   at 
> org.apache.pig.backend.hadoop.executionengine.tez.TezSessionManager.shutdown(TezSessionManager.java:308)
>   at 
> org.apache.pig.backend.hadoop.executionengine.tez.TezSessionManager$1.run(TezSessionManager.java:53)
> Caused by: java.io.IOException: Filesystem closed
>   at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:808)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(DFSOutputStream.java:2034)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream.hflush(DFSOutputStream.java:1980)
>   at 
> org.apache.hadoop.fs.FSDataOutputStream.hflush(FSDataOutputStream.java:130)
>   at 
> org.apache.hadoop.yarn.client.api.impl.FileSystemTimelineWriter$LogFD.flush(FileSystemTimelineWriter.java:370)
>   at 
> org.apache.hadoop.yarn.client.api.impl.FileSystemTimelineWriter$LogFDsCache.flush(FileSystemTimelineWriter.java:485)
>   at 
> org.apache.hadoop.yarn.client.api.impl.FileSystemTimelineWriter.close(FileSystemTimelineWriter.java:271)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.serviceStop(TimelineClientImpl.java:326)
>   at 
> org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
>   ... 4 more
> {code}
> The job run successfully, but the temporary hdfs files are not removed.
> [~cnauroth] points out FileSystem also use shutdown hook to close FileSystem 
> instances and it might run before Pig's shutdown hook in Main. By switching 
> to Hadoop's ShutdownHookManager, we can put an order on shutdown hook.
> This has been verified by testing the following code in Main:
> {code}
> ShutdownHookManager.get().addShutdownHook(new Runnable() {
> @Override
> public void run() {
> FileLocalizer.deleteTempResourceFiles();
> }
> }, priority);
> {code}
> Notice FileSystem.SHUTDOWN_HOOK_PRIORITY=10. Whe

[jira] [Updated] (PIG-4916) Pig on Tez fail to remove temporary HDFS files in some cases

2016-06-01 Thread Daniel Dai (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-4916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-4916:

Attachment: PIG-4916-1.patch

> Pig on Tez fail to remove temporary HDFS files in some cases
> 
>
> Key: PIG-4916
> URL: https://issues.apache.org/jira/browse/PIG-4916
> Project: Pig
>  Issue Type: Bug
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.17.0, 0.16.1
>
> Attachments: PIG-4916-1.patch
>
>
> We saw the following stack trace when running Pig on S3:
> {code}
> 2016-06-01 22:04:22,714 [Thread-19] INFO  
> org.apache.hadoop.service.AbstractService - Service 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl failed in state 
> STOPPED; cause: java.io.IOException: Filesystem closed
> java.io.IOException: Filesystem closed
>   at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:808)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(DFSOutputStream.java:2034)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream.hflush(DFSOutputStream.java:1980)
>   at 
> org.apache.hadoop.fs.FSDataOutputStream.hflush(FSDataOutputStream.java:130)
>   at 
> org.apache.hadoop.yarn.client.api.impl.FileSystemTimelineWriter$LogFD.flush(FileSystemTimelineWriter.java:370)
>   at 
> org.apache.hadoop.yarn.client.api.impl.FileSystemTimelineWriter$LogFDsCache.flush(FileSystemTimelineWriter.java:485)
>   at 
> org.apache.hadoop.yarn.client.api.impl.FileSystemTimelineWriter.close(FileSystemTimelineWriter.java:271)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.serviceStop(TimelineClientImpl.java:326)
>   at 
> org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
>   at 
> org.apache.tez.dag.history.ats.acls.ATSV15HistoryACLPolicyManager.close(ATSV15HistoryACLPolicyManager.java:259)
>   at org.apache.tez.client.TezClient.stop(TezClient.java:582)
>   at 
> org.apache.pig.backend.hadoop.executionengine.tez.TezSessionManager.shutdown(TezSessionManager.java:308)
>   at 
> org.apache.pig.backend.hadoop.executionengine.tez.TezSessionManager$1.run(TezSessionManager.java:53)
> 2016-06-01 22:04:22,718 [Thread-19] ERROR 
> org.apache.pig.backend.hadoop.executionengine.tez.TezSessionManager - Error 
> shutting down Tez session org.apache.tez.client.TezClient@48bf833a
> org.apache.hadoop.service.ServiceStateException: java.io.IOException: 
> Filesystem closed
>   at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
>   at 
> org.apache.hadoop.service.AbstractService.stop(AbstractService.java:225)
>   at 
> org.apache.tez.dag.history.ats.acls.ATSV15HistoryACLPolicyManager.close(ATSV15HistoryACLPolicyManager.java:259)
>   at org.apache.tez.client.TezClient.stop(TezClient.java:582)
>   at 
> org.apache.pig.backend.hadoop.executionengine.tez.TezSessionManager.shutdown(TezSessionManager.java:308)
>   at 
> org.apache.pig.backend.hadoop.executionengine.tez.TezSessionManager$1.run(TezSessionManager.java:53)
> Caused by: java.io.IOException: Filesystem closed
>   at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:808)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(DFSOutputStream.java:2034)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream.hflush(DFSOutputStream.java:1980)
>   at 
> org.apache.hadoop.fs.FSDataOutputStream.hflush(FSDataOutputStream.java:130)
>   at 
> org.apache.hadoop.yarn.client.api.impl.FileSystemTimelineWriter$LogFD.flush(FileSystemTimelineWriter.java:370)
>   at 
> org.apache.hadoop.yarn.client.api.impl.FileSystemTimelineWriter$LogFDsCache.flush(FileSystemTimelineWriter.java:485)
>   at 
> org.apache.hadoop.yarn.client.api.impl.FileSystemTimelineWriter.close(FileSystemTimelineWriter.java:271)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.serviceStop(TimelineClientImpl.java:326)
>   at 
> org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
>   ... 4 more
> {code}
> The job run successfully, but the temporary hdfs files are not removed.
> [~cnauroth] points out FileSystem also use shutdown hook to close FileSystem 
> instances and it might run before Pig's shutdown hook in Main. By switching 
> to Hadoop's ShutdownHookManager, we can put an order on shutdown hook.
> This has been verified by testing the following code in Main:
> {code}
> ShutdownHookManager.get().addShutdownHook(new Runnable() {
> @Override
> public void run() {
> FileLocalizer.deleteTempResourceFiles();
> }
> }, priority);
> {code}
> Notice FileSystem.SHUTDOWN_HOOK_PRIORITY=10. When priority=9, Pig fail. When 
> priority=11, Pig success.



--
This message was sent by Atl

[jira] [Created] (PIG-4916) Pig on Tez fail to remove temporary HDFS files in some cases

2016-06-01 Thread Daniel Dai (JIRA)

Daniel Dai created PIG-4916:
---

 Summary: Pig on Tez fail to remove temporary HDFS files in some 
cases
 Key: PIG-4916
 URL: https://issues.apache.org/jira/browse/PIG-4916
 Project: Pig
  Issue Type: Bug
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.16.1, 0.17.0


We saw the following stack trace when running Pig on S3:
{code}
2016-06-01 22:04:22,714 [Thread-19] INFO  
org.apache.hadoop.service.AbstractService - Service 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl failed in state 
STOPPED; cause: java.io.IOException: Filesystem closed
java.io.IOException: Filesystem closed
at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:808)
at 
org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(DFSOutputStream.java:2034)
at 
org.apache.hadoop.hdfs.DFSOutputStream.hflush(DFSOutputStream.java:1980)
at 
org.apache.hadoop.fs.FSDataOutputStream.hflush(FSDataOutputStream.java:130)
at 
org.apache.hadoop.yarn.client.api.impl.FileSystemTimelineWriter$LogFD.flush(FileSystemTimelineWriter.java:370)
at 
org.apache.hadoop.yarn.client.api.impl.FileSystemTimelineWriter$LogFDsCache.flush(FileSystemTimelineWriter.java:485)
at 
org.apache.hadoop.yarn.client.api.impl.FileSystemTimelineWriter.close(FileSystemTimelineWriter.java:271)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.serviceStop(TimelineClientImpl.java:326)
at 
org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
at 
org.apache.tez.dag.history.ats.acls.ATSV15HistoryACLPolicyManager.close(ATSV15HistoryACLPolicyManager.java:259)
at org.apache.tez.client.TezClient.stop(TezClient.java:582)
at 
org.apache.pig.backend.hadoop.executionengine.tez.TezSessionManager.shutdown(TezSessionManager.java:308)
at 
org.apache.pig.backend.hadoop.executionengine.tez.TezSessionManager$1.run(TezSessionManager.java:53)
2016-06-01 22:04:22,718 [Thread-19] ERROR 
org.apache.pig.backend.hadoop.executionengine.tez.TezSessionManager - Error 
shutting down Tez session org.apache.tez.client.TezClient@48bf833a
org.apache.hadoop.service.ServiceStateException: java.io.IOException: 
Filesystem closed
at 
org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
at 
org.apache.hadoop.service.AbstractService.stop(AbstractService.java:225)
at 
org.apache.tez.dag.history.ats.acls.ATSV15HistoryACLPolicyManager.close(ATSV15HistoryACLPolicyManager.java:259)
at org.apache.tez.client.TezClient.stop(TezClient.java:582)
at 
org.apache.pig.backend.hadoop.executionengine.tez.TezSessionManager.shutdown(TezSessionManager.java:308)
at 
org.apache.pig.backend.hadoop.executionengine.tez.TezSessionManager$1.run(TezSessionManager.java:53)
Caused by: java.io.IOException: Filesystem closed
at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:808)
at 
org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(DFSOutputStream.java:2034)
at 
org.apache.hadoop.hdfs.DFSOutputStream.hflush(DFSOutputStream.java:1980)
at 
org.apache.hadoop.fs.FSDataOutputStream.hflush(FSDataOutputStream.java:130)
at 
org.apache.hadoop.yarn.client.api.impl.FileSystemTimelineWriter$LogFD.flush(FileSystemTimelineWriter.java:370)
at 
org.apache.hadoop.yarn.client.api.impl.FileSystemTimelineWriter$LogFDsCache.flush(FileSystemTimelineWriter.java:485)
at 
org.apache.hadoop.yarn.client.api.impl.FileSystemTimelineWriter.close(FileSystemTimelineWriter.java:271)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.serviceStop(TimelineClientImpl.java:326)
at 
org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
... 4 more
{code}
The job run successfully, but the temporary hdfs files are not removed.

[~cnauroth] points out FileSystem also use shutdown hook to close FileSystem 
instances and it might run before Pig's shutdown hook in Main. By switching to 
Hadoop's ShutdownHookManager, we can put an order on shutdown hook.

This has been verified by testing the following code in Main:
{code}
ShutdownHookManager.get().addShutdownHook(new Runnable() {
@Override
public void run() {
FileLocalizer.deleteTempResourceFiles();
}
}, priority);
{code}

Notice FileSystem.SHUTDOWN_HOOK_PRIORITY=10. When priority=9, Pig fail. When 
priority=11, Pig success.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PIG-2315) Make as clause work in generate

2016-06-01 Thread Daniel Dai (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-2315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15311296#comment-15311296
 ] 

Daniel Dai commented on PIG-2315:
-

+1.

Also note there is a performance regression in some cases. For example:
{code}
crawl = load 'webcrawl' as (url, pageid);
extracted = foreach crawl generate flatten(REGEX_EXTRACT_ALL(url, 
'(http|https)://(.*?)/(.*)')) as (protocol:chararray, host:chararray, 
path:chararray);
{code}

Here the users just try to give additional information to Pig since 
REGEX_EXTRACT_ALL didn't declare types inside tuple and not intend to cast. 
With the change, Pig force a cast and there is no way to avoid that. The 
performance hit should be small and I believe it worth to clarify the syntax.

> Make as clause work in generate
> ---
>
> Key: PIG-2315
> URL: https://issues.apache.org/jira/browse/PIG-2315
> Project: Pig
>  Issue Type: Bug
>Reporter: Olga Natkovich
>Assignee: Daniel Dai
> Fix For: 0.17.0
>
> Attachments: PIG-2315-1-rebase.patch, PIG-2315-1.patch, 
> PIG-2315-1.patch, pig-2315-2-after-rebase.patch, pig-2315-3-merged.patch
>
>
> Currently, the following syntax is supported and ignored causing confusing 
> with users:
> A1 = foreach A1 generate a as a:chararray ;
> After this statement a just retains its previous type



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PIG-4903) Avoid add all spark dependency jars to SPARK_YARN_DIST_FILES and SPARK_DIST_CLASSPATH

2016-06-01 Thread Rohini Palaniswamy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-4903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15311028#comment-15311028
 ] 

Rohini Palaniswamy commented on PIG-4903:
-

Please refer to my comment in 

https://issues.apache.org/jira/browse/PIG-4893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15311013#comment-15311013

You should avoid shipping spark jars altogether similar to mapreduce and tez. 
Please take hdfs location of spark jar(s) as input from the user. This will be 
an extra setup step that user will have to do. But that should be fine as we 
already have same thing to be done for Tez and Mapreduce . For mapreduce, tar 
balls approach is required to have smooth rolling upgrades. Older method of 
using mapreduce installation of node manager also works and most still do that. 
But recommended approach with yarn is to use hdfs tarballs for application 
master dependencies.

> Avoid add all spark dependency jars to  SPARK_YARN_DIST_FILES and 
> SPARK_DIST_CLASSPATH
> --
>
> Key: PIG-4903
> URL: https://issues.apache.org/jira/browse/PIG-4903
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
> Attachments: PIG-4903.patch
>
>
> There are some comments about bin/pig on 
> https://reviews.apache.org/r/45667/#comment198955.
> {code}
> # ADDING SPARK DEPENDENCIES ##
> # Spark typically works with a single assembly file. However this
> # assembly isn't available as a artifact to pull in via ivy.
> # To work around this short coming, we add all the jars barring
> # spark-yarn to DIST through dist-files and then add them to classpath
> # of the executors through an independent env variable. The reason
> # for excluding spark-yarn is because spark-yarn is already being added
> # by the spark-yarn-client via jarOf(Client.Class)
> for f in $PIG_HOME/lib/*.jar; do
> if [[ $f == $PIG_HOME/lib/spark-assembly* ]]; then
> # Exclude spark-assembly.jar from shipped jars, but retain in 
> classpath
> SPARK_JARS=${SPARK_JARS}:$f;
> else
> SPARK_JARS=${SPARK_JARS}:$f;
> SPARK_YARN_DIST_FILES=${SPARK_YARN_DIST_FILES},file://$f;
> SPARK_DIST_CLASSPATH=${SPARK_DIST_CLASSPATH}:\${PWD}/`basename $f`
> fi
> done
> CLASSPATH=${CLASSPATH}:${SPARK_JARS}
> export SPARK_YARN_DIST_FILES=`echo ${SPARK_YARN_DIST_FILES} | sed 's/^,//g'`
> export SPARK_JARS=${SPARK_YARN_DIST_FILES}
> export SPARK_DIST_CLASSPATH
> {code}
> Here we first copy all spark dependency jar like 
> spark-network-shuffle_2.10-1.6.1 jar to distcache(SPARK_YARN_DIST_FILES) then 
> add them to the classpath of executor(SPARK_DIST_CLASSPATH). Actually we need 
> not copy all these depency jar to SPARK_DIST_CLASSPATH because all these 
> dependency jars are included in spark-assembly.jar and spark-assembly.jar is 
> uploaded with the spark job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Jenkins build is still unstable: Pig-trunk-commit #2339

2016-06-01 Thread Apache Jenkins Server

See

[jira] [Commented] (PIG-4893) Task deserialization time is too long for spark on yarn mode

2016-06-01 Thread Rohini Palaniswamy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-4893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15311013#comment-15311013
 ] 

Rohini Palaniswamy commented on PIG-4893:
-

You should have the spark jars in global hdfs location (similar to Mapreduce 
and Tez tar balls) and reference that instead of shipping every time. This will 
ensure it is downloaded only once to a node how many times different users run 
scripts.

You should not be shipping everything under lib directory. Refer to the 
Mapreduce and Tez distributed cache setup code. Only default essential jars - 
JarManager.getDefaultJars() are shipped.  jython jar and jruby jar are added by 
those ScriptEngines if they are part of the script. Rest come from Mapreduce 
(mapreduce.application.framework.path) and Tez (tez.lib.uris) tar balls in hdfs.

> Task deserialization time is too long for spark on yarn mode
> 
>
> Key: PIG-4893
> URL: https://issues.apache.org/jira/browse/PIG-4893
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: time.PNG
>
>
> I found the task deserialization time is a bit long when i run any scripts of 
> pigmix in spark on yarn mode.  see the attachment picture.  The duration time 
> is 3s but the task deserialization is 13s.  
> My env is hadoop2.6+spark1.6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (PIG-4915) Eliminate duplicate split calculation for Order by and Skewed Join

2016-06-01 Thread Rohini Palaniswamy (JIRA)

Rohini Palaniswamy created PIG-4915:
---

 Summary: Eliminate duplicate split calculation for Order by and 
Skewed Join
 Key: PIG-4915
 URL: https://issues.apache.org/jira/browse/PIG-4915
 Project: Pig
  Issue Type: Sub-task
Reporter: Rohini Palaniswamy


  Currently we calculate splits and do combining of splits twice - once for the 
Sampler Vertex and once for Partitioner Vertex for the same LOAD statement ( 
case of no 1-1 edge). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: please unsubscribe

2016-06-01 Thread Alan Gates

To unsubscribe send email to dev-unsubscr...@pig.apache.org

Alan.

> On Jun 1, 2016, at 07:40, asser dennis  wrote:
> 
>

please unsubscribe

2016-06-01 Thread asser dennis

[jira] [Resolved] (PIG-4898) Fix unit test failure after PIG-4771's patch was checked in

2016-06-01 Thread Xuefu Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-4898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang resolved PIG-4898.
--
Resolution: Fixed

Committed to Spark branch. Thanks, Liyun!

> Fix unit test failure after PIG-4771's patch was checked in
> ---
>
> Key: PIG-4898
> URL: https://issues.apache.org/jira/browse/PIG-4898
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-4898.patch
>
>
> Now in the [lastest jenkins|https://builds.apache.org/job/Pig-spark/#328], it 
> shows that  following unit test cases fail:
>  org.apache.pig.test.TestFRJoin.testDistinctFRJoin
>  org.apache.pig.test.TestPigRunner.simpleMultiQueryTest3



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PIG-4893) Task deserialization time is too long for spark on yarn mode

2016-06-01 Thread Pallavi Rao (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-4893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15310108#comment-15310108
 ] 

Pallavi Rao commented on PIG-4893:
--

+1 for addressing this. 
When I had noticed this problem, one way I thought we could solve the problem 
was by:
1. Excluding certain jars we know for certain are not needed.
2. Also, provide an option to user to specify an environment variable which 
contains the list of jars that needs to be loaded to dcache. We should default 
to our list, if this env. variable is not specified. 

> Task deserialization time is too long for spark on yarn mode
> 
>
> Key: PIG-4893
> URL: https://issues.apache.org/jira/browse/PIG-4893
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: time.PNG
>
>
> I found the task deserialization time is a bit long when i run any scripts of 
> pigmix in spark on yarn mode.  see the attachment picture.  The duration time 
> is 3s but the task deserialization is 13s.  
> My env is hadoop2.6+spark1.6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Build failed in Jenkins: Pig-trunk #1917

2016-06-01 Thread Apache Jenkins Server

See 

Changes:

[daijy] PIG-4719: Documentation for PIG-4704: Customizable Error Handling for 
Storers in Pig

[rohini] PIG-4821: Pig chararray field with special UTF-8 chars as part of 
tuple join key produces wrong results in Tez (rohini)

--
[...truncated 59 lines...]
[ivy:configure] :: Ivy 2.2.0 - 20100923230623 :: http://ant.apache.org/ivy/ ::
[ivy:configure] :: loading settings :: file = 


ivy-resolve:

ivy-compile:
[ivy:cachepath] DEPRECATED: 'ivy.conf.file' is deprecated, use 
'ivy.settings.file' instead
[ivy:cachepath] :: loading settings :: file = 


init:
[mkdir] Created dir: 

[mkdir] Created dir: 

[mkdir] Created dir: 

[mkdir] Created dir: 

[mkdir] Created dir: 

[mkdir] Created dir: 

[mkdir] Created dir: 

 [move] Moving 1 file to 


cc-compile:
   [javacc] Java Compiler Compiler Version 4.2 (Parser Generator)
   [javacc] (type "javacc" with no arguments for help)
   [javacc] Reading from file 

 . . .
   [javacc] File "TokenMgrError.java" does not exist.  Will create one.
   [javacc] File "ParseException.java" does not exist.  Will create one.
   [javacc] File "Token.java" does not exist.  Will create one.
   [javacc] File "JavaCharStream.java" does not exist.  Will create one.
   [javacc] Parser generated successfully.
   [javacc] Java Compiler Compiler Version 4.2 (Parser Generator)
   [javacc] (type "javacc" with no arguments for help)
   [javacc] Reading from file 

 . . .
   [javacc] Warning: Lookahead adequacy checking not being performed since 
option LOOKAHEAD is more than 1.  Set option FORCE_LA_CHECK to true to force 
checking.
   [javacc] File "TokenMgrError.java" does not exist.  Will create one.
   [javacc] File "ParseException.java" does not exist.  Will create one.
   [javacc] File "Token.java" does not exist.  Will create one.
   [javacc] File "JavaCharStream.java" does not exist.  Will create one.
   [javacc] Parser generated with 0 errors and 1 warnings.
   [javacc] Java Compiler Compiler Version 4.2 (Parser Generator)
   [javacc] (type "javacc" with no arguments for help)
   [javacc] Reading from file 

 . . .
   [javacc] File "TokenMgrError.java" is being rebuilt.
   [javacc] File "ParseException.java" is being rebuilt.
   [javacc] File "Token.java" is being rebuilt.
   [javacc] File "JavaCharStream.java" is being rebuilt.
   [javacc] Parser generated successfully.
   [jjtree] Java Compiler Compiler Version 4.2 (Tree Builder)
   [jjtree] (type "jjtree" with no arguments for help)
   [jjtree] Reading from file 

 . . .
   [jjtree] File "Node.java" does not exist.  Will create one.
   [jjtree] File "SimpleNode.java" does not exist.  Will create one.
   [jjtree] File "DOTParserTreeConstants.java" does not exist.  Will create one.
   [jjtree] File "JJTDOTParserState.java" does not exist.  Will create one.
   [jjtree] Annotated grammar generated successfully in 

   [javacc] Java Compiler Compiler Version 4.2 (Parser Generator)
   [javacc] (type "javacc" with no arguments for help)
   [javacc] Reading from file 

 . . .
   [javacc] File "TokenMgrError.java" does not exist.  Will create one.
   [javacc] File "ParseException.java" does not exist.  Will create one.
   [javacc] File "Token.java" does not exist.  Will create one.
   [javacc] File "SimpleCharStream.java" does not exist.  Will create one.
   [javacc] Parser generated successfully.

prepare:
[mkdir] Created dir: 


genLexer:

genParser:

genTreeParser:

gen:

compile:
 [echo] *** Build

[jira] [Commented] (PIG-4893) Task deserialization time is too long for spark on yarn mode

2016-06-01 Thread liyunzhang_intel (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-4893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15309940#comment-15309940
 ] 

liyunzhang_intel commented on PIG-4893:
---

Here summary the reason why task deserialization time is too long:
 we add all dependency jars under $PIG_HOME/lib/ and $PIG_HOME/lib/spark/ to 
$SPARK_JARS, spark will ship all these jars to hadoop distributed cache. Yarn 
container will download all these jars when deserializing a 
job([org.apache.spark.executor.Executor#updateDependencies|https://github.com/apache/spark/blob/d6dc12ef0146ae409834c78737c116050961f350/core/src/main/scala/org/apache/spark/executor/Executor.scala#L424].
  

After removing some big dependencies in $PIG_HOME/lib/( such as 
jython-standalone-2.5.3.jar,jruby-complete-1.6.7.jar and so on, we don't need 
these jars when running a simple pig script), the deserialization time is 
reduced from 12s to 4s. So do we need ship all the jars under $PIG_HOME/lib/* 
every time even though some jars actually are not needed? 



> Task deserialization time is too long for spark on yarn mode
> 
>
> Key: PIG-4893
> URL: https://issues.apache.org/jira/browse/PIG-4893
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: time.PNG
>
>
> I found the task deserialization time is a bit long when i run any scripts of 
> pigmix in spark on yarn mode.  see the attachment picture.  The duration time 
> is 3s but the task deserialization is 13s.  
> My env is hadoop2.6+spark1.6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Jenkins build is still unstable: Pig-trunk-commit #2338

2016-06-01 Thread Apache Jenkins Server

See

unsuscribe

2016-06-01 Thread GUILLERMO GARRIDO YUSTE

unsuscribe

[VOTE] Release Pig 0.16.0 (candidate 0)

[jira] Subscription: PIG patch available

[jira] [Updated] (PIG-4903) Avoid add all spark dependency jars to SPARK_YARN_DIST_FILES and SPARK_DIST_CLASSPATH

Re: Is there any code style check tool in pig project?

[jira] [Updated] (PIG-4919) Upgrade spark.version to 1.6.1

[jira] [Created] (PIG-4919) Upgrade spark.version to 1.6.1

Is there any code style check tool in pig project?

[jira] [Commented] (PIG-4916) Pig on Tez fail to remove temporary HDFS files in some cases

[jira] [Updated] (PIG-4918) Pig on Tez cannot switch pig.temp.dir to another fs

[jira] [Updated] (PIG-4918) Pig on Tez cannot switch pig.temp.dir to another fs

[jira] [Created] (PIG-4917) Pig on Tez cannot switch pig.temp.dir to another fs

[jira] [Created] (PIG-4918) Pig on Tez cannot switch pig.temp.dir to another fs

[jira] [Updated] (PIG-4916) Pig on Tez fail to remove temporary HDFS files in some cases

[jira] [Updated] (PIG-4916) Pig on Tez fail to remove temporary HDFS files in some cases

[jira] [Created] (PIG-4916) Pig on Tez fail to remove temporary HDFS files in some cases

[jira] [Commented] (PIG-2315) Make as clause work in generate

[jira] [Commented] (PIG-4903) Avoid add all spark dependency jars to SPARK_YARN_DIST_FILES and SPARK_DIST_CLASSPATH

Jenkins build is still unstable: Pig-trunk-commit #2339

[jira] [Commented] (PIG-4893) Task deserialization time is too long for spark on yarn mode

[jira] [Created] (PIG-4915) Eliminate duplicate split calculation for Order by and Skewed Join

Re: please unsubscribe

please unsubscribe

[jira] [Resolved] (PIG-4898) Fix unit test failure after PIG-4771's patch was checked in

[jira] [Commented] (PIG-4893) Task deserialization time is too long for spark on yarn mode

Build failed in Jenkins: Pig-trunk #1917

[jira] [Commented] (PIG-4893) Task deserialization time is too long for spark on yarn mode

Jenkins build is still unstable: Pig-trunk-commit #2338

unsuscribe

28 matches

Site Navigation

Mail list logo

Footer information