[jira] [Commented] (PIG-4903) Avoid add all spark dependency jars to SPARK_YARN_DIST_FILES and SPARK_DIST_CLASSPATH

2016-05-26 Thread Srikanth Sundarrajan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15303400#comment-15303400
 ] 

Srikanth Sundarrajan commented on PIG-4903:
---

If spark-assembly is present, then nothing else is needed, however if 
spark-assembly isn't available and you need to materialize the dependency 
through all direct & transitive dependency of spark-core & spark-yarn, I think 
you will need them all the dependencies in the Yarn container classpath. The 
SPARK_YARN_DIST_FILES & SPARK_DIST_CLASSPATH, will help achieve this. 
(org/apache/pig/backend/hadoop/executionengine/spark/SparkLauncher.java uses 
the env var SPARK_JARS to figure the SPARK_JARS while launching).

I dont recall the exact reason for excluding the spark-yarn* explicitly, but I 
vaguely remember it causing duplicate spark-yarn*.jar in the dist cache and 
that causing issues. I can dig that up and revert.

> Avoid add all spark dependency jars to  SPARK_YARN_DIST_FILES and 
> SPARK_DIST_CLASSPATH
> --
>
> Key: PIG-4903
> URL: https://issues.apache.org/jira/browse/PIG-4903
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
>
> There are some comments about bin/pig on 
> https://reviews.apache.org/r/45667/#comment198955.
> {code}
> # ADDING SPARK DEPENDENCIES ##
> # Spark typically works with a single assembly file. However this
> # assembly isn't available as a artifact to pull in via ivy.
> # To work around this short coming, we add all the jars barring
> # spark-yarn to DIST through dist-files and then add them to classpath
> # of the executors through an independent env variable. The reason
> # for excluding spark-yarn is because spark-yarn is already being added
> # by the spark-yarn-client via jarOf(Client.Class)
> for f in $PIG_HOME/lib/*.jar; do
> if [[ $f == $PIG_HOME/lib/spark-assembly* ]]; then
> # Exclude spark-assembly.jar from shipped jars, but retain in 
> classpath
> SPARK_JARS=${SPARK_JARS}:$f;
> else
> SPARK_JARS=${SPARK_JARS}:$f;
> SPARK_YARN_DIST_FILES=${SPARK_YARN_DIST_FILES},file://$f;
> SPARK_DIST_CLASSPATH=${SPARK_DIST_CLASSPATH}:\${PWD}/`basename $f`
> fi
> done
> CLASSPATH=${CLASSPATH}:${SPARK_JARS}
> export SPARK_YARN_DIST_FILES=`echo ${SPARK_YARN_DIST_FILES} | sed 's/^,//g'`
> export SPARK_JARS=${SPARK_YARN_DIST_FILES}
> export SPARK_DIST_CLASSPATH
> {code}
> Here we first copy all spark dependency jar like 
> spark-network-shuffle_2.10-1.6.1 jar to distcache(SPARK_YARN_DIST_FILES) then 
> add them to the classpath of executor(SPARK_DIST_CLASSPATH). Actually we need 
> not copy all these depency jar to SPARK_DIST_CLASSPATH because all these 
> dependency jars are included in spark-assembly.jar and spark-assembly.jar is 
> uploaded with the spark job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (PIG-4903) Avoid add all spark dependency jars to SPARK_YARN_DIST_FILES and SPARK_DIST_CLASSPATH

2016-05-26 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15303337#comment-15303337
 ] 

liyunzhang_intel edited comment on PIG-4903 at 5/27/16 2:08 AM:


[~sriksun]: thanks for your reply, here is my understanding of the code your 
provide:
1. SPARK_JARS includes all the dependency jars which is in $PIG_HOME/lib/ and 
$PIG_HOME/lib/spark/ and we need add those jars to the classpath of pig.
2. SPARK_YARN_DIST_FILES includes all the dependency jars we need to be shipped.
3.SPARK_DIST_CLASSPATH includes all the dependency jars later the executors 
needs in spark on yarn mode.


In above code you provide, i don't understand following 1 point:
1. why we need exclude spark-yarn.jar from shipped jar. can you explain detaily 
about these? now i'm investigating spark code to  understand it.

i found that  we only  need ship the jar under $PIG_HOME/lib/ and add 
spark-assembly.jar to the classpath of pig to make it run successfully:
  
{code}
if [ -n "$SPARK_HOME" ]; then
echo "Using Spark Home: " ${SPARK_HOME}
SPARK_JARS=`ls ${SPARK_HOME}/lib/spark-assembly*`
fi

for f in $PIG_HOME/lib/*.jar; do
SPARK_JARS=${SPARK_JARS}:$f;
SPARK_YARN_DIST_FILES=${SPARK_YARN_DIST_FILES},file://$f;
SPARK_DIST_CLASSPATH=${SPARK_DIST_CLASSPATH}:\${PWD}/`basename $f`
done
{code}
  It is very strange spark-assembly.jar will be automatically uploaded in this 
code while only spark-yarn.jar will be uploaded in PIG-4667. If 
spark-assembly.jar will be automatically uploaded, we need not ship jars under 
$PIG_HOME/lib/spark/.


was (Author: kellyzly):
[~sriksun]: thanks for your reply, here is my understanding of the code your 
provide:
1. SPARK_JARS includes all the dependency jars which is in $PIG_HOME/lib/ and 
$PIG_HOME/lib/spark/ and we need add those jars to the classpath of pig.
2. SPARK_YARN_DIST_FILES includes all the dependency jars we need to be shipped.
3.SPARK_DIST_CLASSPATH includes all the dependency jars later the executors 
needs in spark on yarn mode.


In above code you provide, i don't understand following 2 points:
1. why we need exclude spark-yarn.jar from shipped jar. can you explain detaily 
about these? now i'm investigating spark code to  understand it.
2. i found that  we only  need ship the jar under $PIG_HOME/lib/ and add 
spark-assembly.jar to the classpath of pig to make it run successfully:
  
{code}
if [ -n "$SPARK_HOME" ]; then
echo "Using Spark Home: " ${SPARK_HOME}
SPARK_JARS=`ls ${SPARK_HOME}/lib/spark-assembly*`
fi

for f in $PIG_HOME/lib/*.jar; do
SPARK_JARS=${SPARK_JARS}:$f;
SPARK_YARN_DIST_FILES=${SPARK_YARN_DIST_FILES},file://$f;
SPARK_DIST_CLASSPATH=${SPARK_DIST_CLASSPATH}:\${PWD}/`basename $f`
done
{code}
  It is very strange spark-assembly.jar will be automatically uploaded in this 
code while only spark-yarn.jar will be uploaded in PIG-4667. If 
spark-assembly.jar will be automatically uploaded, we need not ship jars under 
$PIG_HOME/lib/spark/.

> Avoid add all spark dependency jars to  SPARK_YARN_DIST_FILES and 
> SPARK_DIST_CLASSPATH
> --
>
> Key: PIG-4903
> URL: https://issues.apache.org/jira/browse/PIG-4903
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
>
> There are some comments about bin/pig on 
> https://reviews.apache.org/r/45667/#comment198955.
> {code}
> # ADDING SPARK DEPENDENCIES ##
> # Spark typically works with a single assembly file. However this
> # assembly isn't available as a artifact to pull in via ivy.
> # To work around this short coming, we add all the jars barring
> # spark-yarn to DIST through dist-files and then add them to classpath
> # of the executors through an independent env variable. The reason
> # for excluding spark-yarn is because spark-yarn is already being added
> # by the spark-yarn-client via jarOf(Client.Class)
> for f in $PIG_HOME/lib/*.jar; do
> if [[ $f == $PIG_HOME/lib/spark-assembly* ]]; then
> # Exclude spark-assembly.jar from shipped jars, but retain in 
> classpath
> SPARK_JARS=${SPARK_JARS}:$f;
> else
> SPARK_JARS=${SPARK_JARS}:$f;
> SPARK_YARN_DIST_FILES=${SPARK_YARN_DIST_FILES},file://$f;
> SPARK_DIST_CLASSPATH=${SPARK_DIST_CLASSPATH}:\${PWD}/`basename $f`
> fi
> done
> CLASSPATH=${CLASSPATH}:${SPARK_JARS}
> export SPARK_YARN_DIST_FILES=`echo ${SPARK_YARN_DIST_FILES} | sed 's/^,//g'`
> export SPARK_JARS=${SPARK_YARN_DIST_FILES}
> export SPARK_DIST_CLASSPATH
> {code}
> Here we first copy all spark dependency jar like 
> spark-network-shuffle_2.10-1.6.1 jar to distcache(SPARK_YARN_DIST_FILES) then 
> add them to the classpath of executor(SPARK_DIST_CLASSPATH). Actually we 

[jira] [Commented] (PIG-4903) Avoid add all spark dependency jars to SPARK_YARN_DIST_FILES and SPARK_DIST_CLASSPATH

2016-05-26 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15303337#comment-15303337
 ] 

liyunzhang_intel commented on PIG-4903:
---

[~sriksun]: thanks for your reply, here is my understanding of the code your 
provide:
1. SPARK_JARS includes all the dependency jars which is in $PIG_HOME/lib/ and 
$PIG_HOME/lib/spark/ and we need add those jars to the classpath of pig.
2. SPARK_YARN_DIST_FILES includes all the dependency jars we need to be shipped.
3.SPARK_DIST_CLASSPATH includes all the dependency jars later the executors 
needs in spark on yarn mode.


In above code you provide, i don't understand following 2 points:
1. why we need exclude spark-yarn.jar from shipped jar. can you explain detaily 
about these? now i'm investigating spark code to  understand it.
2. i found that  we only  need ship the jar under $PIG_HOME/lib/ and add 
spark-assembly.jar to the classpath of pig to make it run successfully:
  
{code}
if [ -n "$SPARK_HOME" ]; then
echo "Using Spark Home: " ${SPARK_HOME}
SPARK_JARS=`ls ${SPARK_HOME}/lib/spark-assembly*`
fi

for f in $PIG_HOME/lib/*.jar; do
SPARK_JARS=${SPARK_JARS}:$f;
SPARK_YARN_DIST_FILES=${SPARK_YARN_DIST_FILES},file://$f;
SPARK_DIST_CLASSPATH=${SPARK_DIST_CLASSPATH}:\${PWD}/`basename $f`
done
{code}
  It is very strange spark-assembly.jar will be automatically uploaded in this 
code while only spark-yarn.jar will be uploaded in PIG-4667. If 
spark-assembly.jar will be automatically uploaded, we need not ship jars under 
$PIG_HOME/lib/spark/.

> Avoid add all spark dependency jars to  SPARK_YARN_DIST_FILES and 
> SPARK_DIST_CLASSPATH
> --
>
> Key: PIG-4903
> URL: https://issues.apache.org/jira/browse/PIG-4903
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
>
> There are some comments about bin/pig on 
> https://reviews.apache.org/r/45667/#comment198955.
> {code}
> # ADDING SPARK DEPENDENCIES ##
> # Spark typically works with a single assembly file. However this
> # assembly isn't available as a artifact to pull in via ivy.
> # To work around this short coming, we add all the jars barring
> # spark-yarn to DIST through dist-files and then add them to classpath
> # of the executors through an independent env variable. The reason
> # for excluding spark-yarn is because spark-yarn is already being added
> # by the spark-yarn-client via jarOf(Client.Class)
> for f in $PIG_HOME/lib/*.jar; do
> if [[ $f == $PIG_HOME/lib/spark-assembly* ]]; then
> # Exclude spark-assembly.jar from shipped jars, but retain in 
> classpath
> SPARK_JARS=${SPARK_JARS}:$f;
> else
> SPARK_JARS=${SPARK_JARS}:$f;
> SPARK_YARN_DIST_FILES=${SPARK_YARN_DIST_FILES},file://$f;
> SPARK_DIST_CLASSPATH=${SPARK_DIST_CLASSPATH}:\${PWD}/`basename $f`
> fi
> done
> CLASSPATH=${CLASSPATH}:${SPARK_JARS}
> export SPARK_YARN_DIST_FILES=`echo ${SPARK_YARN_DIST_FILES} | sed 's/^,//g'`
> export SPARK_JARS=${SPARK_YARN_DIST_FILES}
> export SPARK_DIST_CLASSPATH
> {code}
> Here we first copy all spark dependency jar like 
> spark-network-shuffle_2.10-1.6.1 jar to distcache(SPARK_YARN_DIST_FILES) then 
> add them to the classpath of executor(SPARK_DIST_CLASSPATH). Actually we need 
> not copy all these depency jar to SPARK_DIST_CLASSPATH because all these 
> dependency jars are included in spark-assembly.jar and spark-assembly.jar is 
> uploaded with the spark job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-3911) Define unique fields with @OutputSchema

2016-05-26 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15303097#comment-15303097
 ] 

Daniel Dai commented on PIG-3911:
-

I think it introduces some backward-incompatible changes, right?

> Define unique fields with @OutputSchema
> ---
>
> Key: PIG-3911
> URL: https://issues.apache.org/jira/browse/PIG-3911
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.11, 0.12.0, 0.11.1, 0.12.1, 0.13.0
>Reporter: Lorand Bendig
>Assignee: Lorand Bendig
> Fix For: 0.17.0
>
> Attachments: PIG-3911.patch
>
>
> Based on PIG-2361, I took the liberty of extending {{@Outputschema}} so that 
> more flexible output schema can be defined through annotations. As a result, 
> the repeating patterns of {{EvalFunc#outputSchema()}} can be eliminated from 
> most of the UDFs.
> Examples:
> {code}
> @OutputSchema("bytearray")
> {code}
> => equivalent to:
> {code}
> @Override
> public Schema outputSchema(Schema input) {
>   return new Schema(new Schema.FieldSchema(null, DataType.BYTEARRAY));
> }
> {code}
> {code}
> @OutputSchema("chararray")
> @Unique
> {code}
> => equivalent to:
> {code}
> @Override
> public Schema outputSchema(Schema input) {
>   return new Schema(new 
> Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(), 
> input), DataType.CHARARRAY));
> }
> {code}
> {code}
> @OutputSchema(value = "dimensions:bag", useInputSchema = true)
> {code}
> => equivalent to:
> {code}
> @Override
> public Schema outputSchema(Schema input) {
>   return new Schema(new FieldSchema("dimensions", input, DataType.BAG));
> }
> {code}
> {code}
> @OutputSchema(value = "${0}:bag", useInputSchema = true)
> @Unique("${0}")
> {code}
> => equivalent to:
> {code}
> @Override
> public Schema outputSchema(Schema input) {
> return new Schema(new 
> Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(), 
> input), input, DataType.BAG));
> }
> {code}
> If {{useInputSchema}} attribute is set then input schema will be applied to 
> the output schema, provided that:
> * outputschema is "simple", i.e: \[name\]\[:type\]  or '()', '{}', '[]' and
> * it has complex field type (tuple, bag, map)
> @Unique : this annotation defines which fields should be unique in the schema
> * if no parameters are provided, all fields will be unique
> * otherwise it takes a string array of fields name
> Unique field generation:
> A unique field is generated in the same manner that 
> {{EvalFunc#getSchemaName}} does.
> * if field has an alias:
>   ** it's a placeholder ($\{i\}, i=0..n) : fieldName -> 
> com_myfunc_\[input_alias\]\_\[nextSchemaId\]
>   ** otherwise: fieldName -> fieldName\_\[nextSchemaId\]
> * otherwise: com\_myfunc\_\[input_alias\]\_\[nextSchemaId\]
> Supported scripting UDFs: Python, Jython, Groovy, JRuby



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4906) Add Bigdecimal functions in Over function

2016-05-26 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15303071#comment-15303071
 ] 

Daniel Dai commented on PIG-4906:
-

I remember once you've test case in your patch, is that right?

> Add Bigdecimal functions in Over function
> -
>
> Key: PIG-4906
> URL: https://issues.apache.org/jira/browse/PIG-4906
> Project: Pig
>  Issue Type: Improvement
>  Components: piggybank
>Affects Versions: 0.15.0
> Environment: Hortonworks 2.4.2
>Reporter: Cristian Galán
>Priority: Minor
> Fix For: 0.17.0
>
> Attachments: over.patch
>
>   Original Estimate: 0.25h
>  Remaining Estimate: 0.25h
>
> In the piggybank, the Over function class don't include the bigdecimals 
> methods for sum, max, min and avg.
> -I attach Over.class with this changes.- I attach the patch with this changes 
> and I add a new output schema also. If anybody can do the PR, I appreciate it



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (PIG-4911) Provide option to disable DAG recovery

2016-05-26 Thread Rohini Palaniswamy (JIRA)
Rohini Palaniswamy created PIG-4911:
---

 Summary: Provide option to disable DAG recovery
 Key: PIG-4911
 URL: https://issues.apache.org/jira/browse/PIG-4911
 Project: Pig
  Issue Type: Improvement
Reporter: Rohini Palaniswamy
Assignee: Rohini Palaniswamy
 Fix For: 0.16.0


  Tez 0.7 has lot of issues with DAG recovery with auto parallelism causing 
hung dags in many cases as it was not writing auto parallelism decisions to 
recovery history. Rewrite was done in Tez 0.8 to handle that.
  Code was added to Tez to automatically disable recovery if there was auto 
parallelism so that it would benefit both Pig and Tez. It works fine and the 
second AM attempt fails with DAG cannot be recovered error when it sees there 
are vertices with auto parallelism. But problem is it is hard to see what the 
actual problem is for the users and is hard to debug as well as the whole UI 
state is rewritten with the partial recovery information.
Doing the disabling of recovery in Pig itself by setting 
tez.dag.recovery.enabled=false will make it not go for the second attempt at 
all which will eventually fail. It also makes it easy to debug the original 
failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4906) Add Bigdecimal functions in Over function

2016-05-26 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/PIG-4906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cristian Galán updated PIG-4906:

Attachment: over.patch

The final patch.

> Add Bigdecimal functions in Over function
> -
>
> Key: PIG-4906
> URL: https://issues.apache.org/jira/browse/PIG-4906
> Project: Pig
>  Issue Type: Improvement
>  Components: piggybank
>Affects Versions: 0.15.0
> Environment: Hortonworks 2.4.2
>Reporter: Cristian Galán
>Priority: Minor
> Fix For: 0.17.0
>
> Attachments: over.patch
>
>   Original Estimate: 0.25h
>  Remaining Estimate: 0.25h
>
> In the piggybank, the Over function class don't include the bigdecimals 
> methods for sum, max, min and avg.
> -I attach Over.class with this changes.- I attach the patch with this changes 
> and I add a new output schema also. If anybody can do the PR, I appreciate it



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Issue Comment Deleted] (PIG-4906) Add Bigdecimal functions in Over function

2016-05-26 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/PIG-4906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cristian Galán updated PIG-4906:

Comment: was deleted

(was: The final patch)

> Add Bigdecimal functions in Over function
> -
>
> Key: PIG-4906
> URL: https://issues.apache.org/jira/browse/PIG-4906
> Project: Pig
>  Issue Type: Improvement
>  Components: piggybank
>Affects Versions: 0.15.0
> Environment: Hortonworks 2.4.2
>Reporter: Cristian Galán
>Priority: Minor
> Fix For: 0.17.0
>
> Attachments: over.patch
>
>   Original Estimate: 0.25h
>  Remaining Estimate: 0.25h
>
> In the piggybank, the Over function class don't include the bigdecimals 
> methods for sum, max, min and avg.
> -I attach Over.class with this changes.- I attach the patch with this changes 
> and I add a new output schema also. If anybody can do the PR, I appreciate it



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4906) Add Bigdecimal functions in Over function

2016-05-26 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/PIG-4906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cristian Galán updated PIG-4906:

Attachment: (was: over.patch)

> Add Bigdecimal functions in Over function
> -
>
> Key: PIG-4906
> URL: https://issues.apache.org/jira/browse/PIG-4906
> Project: Pig
>  Issue Type: Improvement
>  Components: piggybank
>Affects Versions: 0.15.0
> Environment: Hortonworks 2.4.2
>Reporter: Cristian Galán
>Priority: Minor
> Fix For: 0.17.0
>
>   Original Estimate: 0.25h
>  Remaining Estimate: 0.25h
>
> In the piggybank, the Over function class don't include the bigdecimals 
> methods for sum, max, min and avg.
> -I attach Over.class with this changes.- I attach the patch with this changes 
> and I add a new output schema also. If anybody can do the PR, I appreciate it



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)