[jira] [Commented] (PIG-4903) Avoid add all spark dependency jars to SPARK_YARN_DIST_FILES and SPARK_DIST_CLASSPATH
[ https://issues.apache.org/jira/browse/PIG-4903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15303400#comment-15303400 ] Srikanth Sundarrajan commented on PIG-4903: --- If spark-assembly is present, then nothing else is needed, however if spark-assembly isn't available and you need to materialize the dependency through all direct & transitive dependency of spark-core & spark-yarn, I think you will need them all the dependencies in the Yarn container classpath. The SPARK_YARN_DIST_FILES & SPARK_DIST_CLASSPATH, will help achieve this. (org/apache/pig/backend/hadoop/executionengine/spark/SparkLauncher.java uses the env var SPARK_JARS to figure the SPARK_JARS while launching). I dont recall the exact reason for excluding the spark-yarn* explicitly, but I vaguely remember it causing duplicate spark-yarn*.jar in the dist cache and that causing issues. I can dig that up and revert. > Avoid add all spark dependency jars to SPARK_YARN_DIST_FILES and > SPARK_DIST_CLASSPATH > -- > > Key: PIG-4903 > URL: https://issues.apache.org/jira/browse/PIG-4903 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: liyunzhang_intel > > There are some comments about bin/pig on > https://reviews.apache.org/r/45667/#comment198955. > {code} > # ADDING SPARK DEPENDENCIES ## > # Spark typically works with a single assembly file. However this > # assembly isn't available as a artifact to pull in via ivy. > # To work around this short coming, we add all the jars barring > # spark-yarn to DIST through dist-files and then add them to classpath > # of the executors through an independent env variable. The reason > # for excluding spark-yarn is because spark-yarn is already being added > # by the spark-yarn-client via jarOf(Client.Class) > for f in $PIG_HOME/lib/*.jar; do > if [[ $f == $PIG_HOME/lib/spark-assembly* ]]; then > # Exclude spark-assembly.jar from shipped jars, but retain in > classpath > SPARK_JARS=${SPARK_JARS}:$f; > else > SPARK_JARS=${SPARK_JARS}:$f; > SPARK_YARN_DIST_FILES=${SPARK_YARN_DIST_FILES},file://$f; > SPARK_DIST_CLASSPATH=${SPARK_DIST_CLASSPATH}:\${PWD}/`basename $f` > fi > done > CLASSPATH=${CLASSPATH}:${SPARK_JARS} > export SPARK_YARN_DIST_FILES=`echo ${SPARK_YARN_DIST_FILES} | sed 's/^,//g'` > export SPARK_JARS=${SPARK_YARN_DIST_FILES} > export SPARK_DIST_CLASSPATH > {code} > Here we first copy all spark dependency jar like > spark-network-shuffle_2.10-1.6.1 jar to distcache(SPARK_YARN_DIST_FILES) then > add them to the classpath of executor(SPARK_DIST_CLASSPATH). Actually we need > not copy all these depency jar to SPARK_DIST_CLASSPATH because all these > dependency jars are included in spark-assembly.jar and spark-assembly.jar is > uploaded with the spark job. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (PIG-4903) Avoid add all spark dependency jars to SPARK_YARN_DIST_FILES and SPARK_DIST_CLASSPATH
[ https://issues.apache.org/jira/browse/PIG-4903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15303337#comment-15303337 ] liyunzhang_intel edited comment on PIG-4903 at 5/27/16 2:08 AM: [~sriksun]: thanks for your reply, here is my understanding of the code your provide: 1. SPARK_JARS includes all the dependency jars which is in $PIG_HOME/lib/ and $PIG_HOME/lib/spark/ and we need add those jars to the classpath of pig. 2. SPARK_YARN_DIST_FILES includes all the dependency jars we need to be shipped. 3.SPARK_DIST_CLASSPATH includes all the dependency jars later the executors needs in spark on yarn mode. In above code you provide, i don't understand following 1 point: 1. why we need exclude spark-yarn.jar from shipped jar. can you explain detaily about these? now i'm investigating spark code to understand it. i found that we only need ship the jar under $PIG_HOME/lib/ and add spark-assembly.jar to the classpath of pig to make it run successfully: {code} if [ -n "$SPARK_HOME" ]; then echo "Using Spark Home: " ${SPARK_HOME} SPARK_JARS=`ls ${SPARK_HOME}/lib/spark-assembly*` fi for f in $PIG_HOME/lib/*.jar; do SPARK_JARS=${SPARK_JARS}:$f; SPARK_YARN_DIST_FILES=${SPARK_YARN_DIST_FILES},file://$f; SPARK_DIST_CLASSPATH=${SPARK_DIST_CLASSPATH}:\${PWD}/`basename $f` done {code} It is very strange spark-assembly.jar will be automatically uploaded in this code while only spark-yarn.jar will be uploaded in PIG-4667. If spark-assembly.jar will be automatically uploaded, we need not ship jars under $PIG_HOME/lib/spark/. was (Author: kellyzly): [~sriksun]: thanks for your reply, here is my understanding of the code your provide: 1. SPARK_JARS includes all the dependency jars which is in $PIG_HOME/lib/ and $PIG_HOME/lib/spark/ and we need add those jars to the classpath of pig. 2. SPARK_YARN_DIST_FILES includes all the dependency jars we need to be shipped. 3.SPARK_DIST_CLASSPATH includes all the dependency jars later the executors needs in spark on yarn mode. In above code you provide, i don't understand following 2 points: 1. why we need exclude spark-yarn.jar from shipped jar. can you explain detaily about these? now i'm investigating spark code to understand it. 2. i found that we only need ship the jar under $PIG_HOME/lib/ and add spark-assembly.jar to the classpath of pig to make it run successfully: {code} if [ -n "$SPARK_HOME" ]; then echo "Using Spark Home: " ${SPARK_HOME} SPARK_JARS=`ls ${SPARK_HOME}/lib/spark-assembly*` fi for f in $PIG_HOME/lib/*.jar; do SPARK_JARS=${SPARK_JARS}:$f; SPARK_YARN_DIST_FILES=${SPARK_YARN_DIST_FILES},file://$f; SPARK_DIST_CLASSPATH=${SPARK_DIST_CLASSPATH}:\${PWD}/`basename $f` done {code} It is very strange spark-assembly.jar will be automatically uploaded in this code while only spark-yarn.jar will be uploaded in PIG-4667. If spark-assembly.jar will be automatically uploaded, we need not ship jars under $PIG_HOME/lib/spark/. > Avoid add all spark dependency jars to SPARK_YARN_DIST_FILES and > SPARK_DIST_CLASSPATH > -- > > Key: PIG-4903 > URL: https://issues.apache.org/jira/browse/PIG-4903 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: liyunzhang_intel > > There are some comments about bin/pig on > https://reviews.apache.org/r/45667/#comment198955. > {code} > # ADDING SPARK DEPENDENCIES ## > # Spark typically works with a single assembly file. However this > # assembly isn't available as a artifact to pull in via ivy. > # To work around this short coming, we add all the jars barring > # spark-yarn to DIST through dist-files and then add them to classpath > # of the executors through an independent env variable. The reason > # for excluding spark-yarn is because spark-yarn is already being added > # by the spark-yarn-client via jarOf(Client.Class) > for f in $PIG_HOME/lib/*.jar; do > if [[ $f == $PIG_HOME/lib/spark-assembly* ]]; then > # Exclude spark-assembly.jar from shipped jars, but retain in > classpath > SPARK_JARS=${SPARK_JARS}:$f; > else > SPARK_JARS=${SPARK_JARS}:$f; > SPARK_YARN_DIST_FILES=${SPARK_YARN_DIST_FILES},file://$f; > SPARK_DIST_CLASSPATH=${SPARK_DIST_CLASSPATH}:\${PWD}/`basename $f` > fi > done > CLASSPATH=${CLASSPATH}:${SPARK_JARS} > export SPARK_YARN_DIST_FILES=`echo ${SPARK_YARN_DIST_FILES} | sed 's/^,//g'` > export SPARK_JARS=${SPARK_YARN_DIST_FILES} > export SPARK_DIST_CLASSPATH > {code} > Here we first copy all spark dependency jar like > spark-network-shuffle_2.10-1.6.1 jar to distcache(SPARK_YARN_DIST_FILES) then > add them to the classpath of executor(SPARK_DIST_CLASSPATH). Actually we
[jira] [Commented] (PIG-4903) Avoid add all spark dependency jars to SPARK_YARN_DIST_FILES and SPARK_DIST_CLASSPATH
[ https://issues.apache.org/jira/browse/PIG-4903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15303337#comment-15303337 ] liyunzhang_intel commented on PIG-4903: --- [~sriksun]: thanks for your reply, here is my understanding of the code your provide: 1. SPARK_JARS includes all the dependency jars which is in $PIG_HOME/lib/ and $PIG_HOME/lib/spark/ and we need add those jars to the classpath of pig. 2. SPARK_YARN_DIST_FILES includes all the dependency jars we need to be shipped. 3.SPARK_DIST_CLASSPATH includes all the dependency jars later the executors needs in spark on yarn mode. In above code you provide, i don't understand following 2 points: 1. why we need exclude spark-yarn.jar from shipped jar. can you explain detaily about these? now i'm investigating spark code to understand it. 2. i found that we only need ship the jar under $PIG_HOME/lib/ and add spark-assembly.jar to the classpath of pig to make it run successfully: {code} if [ -n "$SPARK_HOME" ]; then echo "Using Spark Home: " ${SPARK_HOME} SPARK_JARS=`ls ${SPARK_HOME}/lib/spark-assembly*` fi for f in $PIG_HOME/lib/*.jar; do SPARK_JARS=${SPARK_JARS}:$f; SPARK_YARN_DIST_FILES=${SPARK_YARN_DIST_FILES},file://$f; SPARK_DIST_CLASSPATH=${SPARK_DIST_CLASSPATH}:\${PWD}/`basename $f` done {code} It is very strange spark-assembly.jar will be automatically uploaded in this code while only spark-yarn.jar will be uploaded in PIG-4667. If spark-assembly.jar will be automatically uploaded, we need not ship jars under $PIG_HOME/lib/spark/. > Avoid add all spark dependency jars to SPARK_YARN_DIST_FILES and > SPARK_DIST_CLASSPATH > -- > > Key: PIG-4903 > URL: https://issues.apache.org/jira/browse/PIG-4903 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: liyunzhang_intel > > There are some comments about bin/pig on > https://reviews.apache.org/r/45667/#comment198955. > {code} > # ADDING SPARK DEPENDENCIES ## > # Spark typically works with a single assembly file. However this > # assembly isn't available as a artifact to pull in via ivy. > # To work around this short coming, we add all the jars barring > # spark-yarn to DIST through dist-files and then add them to classpath > # of the executors through an independent env variable. The reason > # for excluding spark-yarn is because spark-yarn is already being added > # by the spark-yarn-client via jarOf(Client.Class) > for f in $PIG_HOME/lib/*.jar; do > if [[ $f == $PIG_HOME/lib/spark-assembly* ]]; then > # Exclude spark-assembly.jar from shipped jars, but retain in > classpath > SPARK_JARS=${SPARK_JARS}:$f; > else > SPARK_JARS=${SPARK_JARS}:$f; > SPARK_YARN_DIST_FILES=${SPARK_YARN_DIST_FILES},file://$f; > SPARK_DIST_CLASSPATH=${SPARK_DIST_CLASSPATH}:\${PWD}/`basename $f` > fi > done > CLASSPATH=${CLASSPATH}:${SPARK_JARS} > export SPARK_YARN_DIST_FILES=`echo ${SPARK_YARN_DIST_FILES} | sed 's/^,//g'` > export SPARK_JARS=${SPARK_YARN_DIST_FILES} > export SPARK_DIST_CLASSPATH > {code} > Here we first copy all spark dependency jar like > spark-network-shuffle_2.10-1.6.1 jar to distcache(SPARK_YARN_DIST_FILES) then > add them to the classpath of executor(SPARK_DIST_CLASSPATH). Actually we need > not copy all these depency jar to SPARK_DIST_CLASSPATH because all these > dependency jars are included in spark-assembly.jar and spark-assembly.jar is > uploaded with the spark job. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-3911) Define unique fields with @OutputSchema
[ https://issues.apache.org/jira/browse/PIG-3911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15303097#comment-15303097 ] Daniel Dai commented on PIG-3911: - I think it introduces some backward-incompatible changes, right? > Define unique fields with @OutputSchema > --- > > Key: PIG-3911 > URL: https://issues.apache.org/jira/browse/PIG-3911 > Project: Pig > Issue Type: Improvement >Affects Versions: 0.11, 0.12.0, 0.11.1, 0.12.1, 0.13.0 >Reporter: Lorand Bendig >Assignee: Lorand Bendig > Fix For: 0.17.0 > > Attachments: PIG-3911.patch > > > Based on PIG-2361, I took the liberty of extending {{@Outputschema}} so that > more flexible output schema can be defined through annotations. As a result, > the repeating patterns of {{EvalFunc#outputSchema()}} can be eliminated from > most of the UDFs. > Examples: > {code} > @OutputSchema("bytearray") > {code} > => equivalent to: > {code} > @Override > public Schema outputSchema(Schema input) { > return new Schema(new Schema.FieldSchema(null, DataType.BYTEARRAY)); > } > {code} > {code} > @OutputSchema("chararray") > @Unique > {code} > => equivalent to: > {code} > @Override > public Schema outputSchema(Schema input) { > return new Schema(new > Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(), > input), DataType.CHARARRAY)); > } > {code} > {code} > @OutputSchema(value = "dimensions:bag", useInputSchema = true) > {code} > => equivalent to: > {code} > @Override > public Schema outputSchema(Schema input) { > return new Schema(new FieldSchema("dimensions", input, DataType.BAG)); > } > {code} > {code} > @OutputSchema(value = "${0}:bag", useInputSchema = true) > @Unique("${0}") > {code} > => equivalent to: > {code} > @Override > public Schema outputSchema(Schema input) { > return new Schema(new > Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(), > input), input, DataType.BAG)); > } > {code} > If {{useInputSchema}} attribute is set then input schema will be applied to > the output schema, provided that: > * outputschema is "simple", i.e: \[name\]\[:type\] or '()', '{}', '[]' and > * it has complex field type (tuple, bag, map) > @Unique : this annotation defines which fields should be unique in the schema > * if no parameters are provided, all fields will be unique > * otherwise it takes a string array of fields name > Unique field generation: > A unique field is generated in the same manner that > {{EvalFunc#getSchemaName}} does. > * if field has an alias: > ** it's a placeholder ($\{i\}, i=0..n) : fieldName -> > com_myfunc_\[input_alias\]\_\[nextSchemaId\] > ** otherwise: fieldName -> fieldName\_\[nextSchemaId\] > * otherwise: com\_myfunc\_\[input_alias\]\_\[nextSchemaId\] > Supported scripting UDFs: Python, Jython, Groovy, JRuby -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4906) Add Bigdecimal functions in Over function
[ https://issues.apache.org/jira/browse/PIG-4906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15303071#comment-15303071 ] Daniel Dai commented on PIG-4906: - I remember once you've test case in your patch, is that right? > Add Bigdecimal functions in Over function > - > > Key: PIG-4906 > URL: https://issues.apache.org/jira/browse/PIG-4906 > Project: Pig > Issue Type: Improvement > Components: piggybank >Affects Versions: 0.15.0 > Environment: Hortonworks 2.4.2 >Reporter: Cristian Galán >Priority: Minor > Fix For: 0.17.0 > > Attachments: over.patch > > Original Estimate: 0.25h > Remaining Estimate: 0.25h > > In the piggybank, the Over function class don't include the bigdecimals > methods for sum, max, min and avg. > -I attach Over.class with this changes.- I attach the patch with this changes > and I add a new output schema also. If anybody can do the PR, I appreciate it -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (PIG-4911) Provide option to disable DAG recovery
Rohini Palaniswamy created PIG-4911: --- Summary: Provide option to disable DAG recovery Key: PIG-4911 URL: https://issues.apache.org/jira/browse/PIG-4911 Project: Pig Issue Type: Improvement Reporter: Rohini Palaniswamy Assignee: Rohini Palaniswamy Fix For: 0.16.0 Tez 0.7 has lot of issues with DAG recovery with auto parallelism causing hung dags in many cases as it was not writing auto parallelism decisions to recovery history. Rewrite was done in Tez 0.8 to handle that. Code was added to Tez to automatically disable recovery if there was auto parallelism so that it would benefit both Pig and Tez. It works fine and the second AM attempt fails with DAG cannot be recovered error when it sees there are vertices with auto parallelism. But problem is it is hard to see what the actual problem is for the users and is hard to debug as well as the whole UI state is rewritten with the partial recovery information. Doing the disabling of recovery in Pig itself by setting tez.dag.recovery.enabled=false will make it not go for the second attempt at all which will eventually fail. It also makes it easy to debug the original failure. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PIG-4906) Add Bigdecimal functions in Over function
[ https://issues.apache.org/jira/browse/PIG-4906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cristian Galán updated PIG-4906: Attachment: over.patch The final patch. > Add Bigdecimal functions in Over function > - > > Key: PIG-4906 > URL: https://issues.apache.org/jira/browse/PIG-4906 > Project: Pig > Issue Type: Improvement > Components: piggybank >Affects Versions: 0.15.0 > Environment: Hortonworks 2.4.2 >Reporter: Cristian Galán >Priority: Minor > Fix For: 0.17.0 > > Attachments: over.patch > > Original Estimate: 0.25h > Remaining Estimate: 0.25h > > In the piggybank, the Over function class don't include the bigdecimals > methods for sum, max, min and avg. > -I attach Over.class with this changes.- I attach the patch with this changes > and I add a new output schema also. If anybody can do the PR, I appreciate it -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Issue Comment Deleted] (PIG-4906) Add Bigdecimal functions in Over function
[ https://issues.apache.org/jira/browse/PIG-4906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cristian Galán updated PIG-4906: Comment: was deleted (was: The final patch) > Add Bigdecimal functions in Over function > - > > Key: PIG-4906 > URL: https://issues.apache.org/jira/browse/PIG-4906 > Project: Pig > Issue Type: Improvement > Components: piggybank >Affects Versions: 0.15.0 > Environment: Hortonworks 2.4.2 >Reporter: Cristian Galán >Priority: Minor > Fix For: 0.17.0 > > Attachments: over.patch > > Original Estimate: 0.25h > Remaining Estimate: 0.25h > > In the piggybank, the Over function class don't include the bigdecimals > methods for sum, max, min and avg. > -I attach Over.class with this changes.- I attach the patch with this changes > and I add a new output schema also. If anybody can do the PR, I appreciate it -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PIG-4906) Add Bigdecimal functions in Over function
[ https://issues.apache.org/jira/browse/PIG-4906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cristian Galán updated PIG-4906: Attachment: (was: over.patch) > Add Bigdecimal functions in Over function > - > > Key: PIG-4906 > URL: https://issues.apache.org/jira/browse/PIG-4906 > Project: Pig > Issue Type: Improvement > Components: piggybank >Affects Versions: 0.15.0 > Environment: Hortonworks 2.4.2 >Reporter: Cristian Galán >Priority: Minor > Fix For: 0.17.0 > > Original Estimate: 0.25h > Remaining Estimate: 0.25h > > In the piggybank, the Over function class don't include the bigdecimals > methods for sum, max, min and avg. > -I attach Over.class with this changes.- I attach the patch with this changes > and I add a new output schema also. If anybody can do the PR, I appreciate it -- This message was sent by Atlassian JIRA (v6.3.4#6332)