date:20160103

[jira] [Comment Edited] (SPARK-12607) spark-class produced null command strings for "exec"

2016-01-03 Thread SM Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15080499#comment-15080499
 ] 

SM Wang edited comment on SPARK-12607 at 1/3/16 8:29 PM:
-

Sure.  I think the problem is with the while loop's delimiter setting (-d '')  
or the launcher class' behavior in the MSYS64 environment.

Here is section of the script in version 1.4.0 where I added some echo commands 
(marked with the +++ prefix).

{quote}
{{
CMD=()
while IFS= read -d '' -r ARG; do
echo "+++ Parsed Arguments in while loop: $ARG"
  CMD+=("$ARG")
done < <("$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@")
echo "+++ Launcher Command" "$RUNNER" -cp "$LAUNCH_CLASSPATH" 
org.apache.spark.launcher.Main "$@"

echo "+++ First Element: ${CMD[0]}"
echo "+++ Command Array: ${CMD[@]}"

if [ "${CMD[0]}" = "usage" ]; then
  "${CMD[@]}"
else
  exec "${CMD[@]}"
fi
}}
{quote}

The output from "run-example SparkPi" is as follows:

+++ Launcher Command /apps/jdk1.7.0_80/bin/java -cp 
/apps/tmp/spark-1.4.0-bin-hadoop2.4/lib/spark-assembly-1.4.0-hadoop2.4.0.jar 
org.apache.spark.launcher.Main org.apache.spark.deploy.SparkSubmit --master 
local[*] --class org.apache.spark.examples.SparkPi 
/apps/tmp/spark-1.4.0-bin-hadoop2.4/lib/spark-examples-1.4.0-hadoop2.4.0.jar
+++ First Element:
+++ Command Array:

As you can see the command array is empty.

However, when running the launcher command manually I got the following:

C:/msys64/apps/jdk1.7.0_80\bin\java -cp 
"C:/msys64/apps/tmp/spark-1.4.0-bin-hadoop2.4\conf\;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\spark-assembly-1.4.0-hadoop2.4.0.jar;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\datanucleus-api-jdo-3.2.6.jar;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\datanucleus-core-3.2.10.jar;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\datanucleus-rdbms-3.2.9.jar"
 -Xms512m -Xmx512m "-XX:MaxPermSize=128m" org.apache.spark.deploy.SparkSubmit 
--master local[*] --class org.apache.spark.examples.SparkPi 
C:/msys64/apps/tmp/spark-1.4.0-bin-hadoop2.4/lib/spark-examples-1.4.0-hadoop2.4.0.jar

When I change the delimiter to "-d ' '" I was able to get an non-empty command 
array.

This is why I think the issue is either with the delimiter setting or the 
launcher that does not produce the command staring with the expected delimiter.

Hope this helps.

Thank you for looking into this.


was (Author: swang):
Sure.  I think the problem is with the while loop's delimiter setting (-d '')  
or the launcher class' behavior in the MSYS64 environment.

Here is section of the script in version 1.4.0 where I added some echo commands 
(marked with the +++ prefix).

{{
CMD=()
while IFS= read -d '' -r ARG; do
echo "+++ Parsed Arguments in while loop: $ARG"
  CMD+=("$ARG")
done < <("$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@")
echo "+++ Launcher Command" "$RUNNER" -cp "$LAUNCH_CLASSPATH" 
org.apache.spark.launcher.Main "$@"

echo "+++ First Element: ${CMD[0]}"
echo "+++ Command Array: ${CMD[@]}"

if [ "${CMD[0]}" = "usage" ]; then
  "${CMD[@]}"
else
  exec "${CMD[@]}"
fi
}}

The output from "run-example SparkPi" is as follows:

+++ Launcher Command /apps/jdk1.7.0_80/bin/java -cp 
/apps/tmp/spark-1.4.0-bin-hadoop2.4/lib/spark-assembly-1.4.0-hadoop2.4.0.jar 
org.apache.spark.launcher.Main org.apache.spark.deploy.SparkSubmit --master 
local[*] --class org.apache.spark.examples.SparkPi 
/apps/tmp/spark-1.4.0-bin-hadoop2.4/lib/spark-examples-1.4.0-hadoop2.4.0.jar
+++ First Element:
+++ Command Array:

As you can see the command array is empty.

However, when running the launcher command manually I got the following:

C:/msys64/apps/jdk1.7.0_80\bin\java -cp 
"C:/msys64/apps/tmp/spark-1.4.0-bin-hadoop2.4\conf\;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\spark-assembly-1.4.0-hadoop2.4.0.jar;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\datanucleus-api-jdo-3.2.6.jar;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\datanucleus-core-3.2.10.jar;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\datanucleus-rdbms-3.2.9.jar"
 -Xms512m -Xmx512m "-XX:MaxPermSize=128m" org.apache.spark.deploy.SparkSubmit 
--master local[*] --class org.apache.spark.examples.SparkPi 
C:/msys64/apps/tmp/spark-1.4.0-bin-hadoop2.4/lib/spark-examples-1.4.0-hadoop2.4.0.jar

When I change the delimiter to "-d ' '" I was able to get an non-empty command 
array.

This is why I think the issue is either with the delimiter setting or the 
launcher that does not produce the command staring with the expected delimiter.

Hope this helps.

Thank you for looking into this.

> spark-class produced null command strings for "exec"
> 
>
> Key: SPARK-12607
> URL: https://issues.apache.org/jira/browse/SPARK-12607
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>

[jira] [Comment Edited] (SPARK-12607) spark-class produced null command strings for "exec"

2016-01-03 Thread SM Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15080499#comment-15080499
 ] 

SM Wang edited comment on SPARK-12607 at 1/3/16 8:30 PM:
-

Sure.  I think the problem is with the while loop's delimiter setting (-d '')  
or the launcher class' behavior in the MSYS64 environment.

Here is section of the script in version 1.4.0 where I added some echo commands 
(marked with the +++ prefix).

{quote}
CMD=()
while IFS= read -d '' -r ARG; do
echo "+++ Parsed Arguments in while loop: $ARG"
  CMD+=("$ARG")
done < <("$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@")
echo "+++ Launcher Command" "$RUNNER" -cp "$LAUNCH_CLASSPATH" 
org.apache.spark.launcher.Main "$@"

echo "+++ First Element: ${CMD[0]}"
echo "+++ Command Array: ${CMD[@]}"

if [ "${CMD[0]}" = "usage" ]; then
  "${CMD[@]}"
else
  exec "${CMD[@]}"
fi
{quote}

The output from "run-example SparkPi" is as follows:

{{+++ Launcher Command /apps/jdk1.7.0_80/bin/java -cp 
/apps/tmp/spark-1.4.0-bin-hadoop2.4/lib/spark-assembly-1.4.0-hadoop2.4.0.jar 
org.apache.spark.launcher.Main org.apache.spark.deploy.SparkSubmit --master 
local[*] --class org.apache.spark.examples.SparkPi 
/apps/tmp/spark-1.4.0-bin-hadoop2.4/lib/spark-examples-1.4.0-hadoop2.4.0.jar
+++ First Element:
+++ Command Array:}}

As you can see the command array is empty.

However, when running the launcher command manually I got the following:

C:/msys64/apps/jdk1.7.0_80\bin\java -cp 
"C:/msys64/apps/tmp/spark-1.4.0-bin-hadoop2.4\conf\;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\spark-assembly-1.4.0-hadoop2.4.0.jar;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\datanucleus-api-jdo-3.2.6.jar;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\datanucleus-core-3.2.10.jar;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\datanucleus-rdbms-3.2.9.jar"
 -Xms512m -Xmx512m "-XX:MaxPermSize=128m" org.apache.spark.deploy.SparkSubmit 
--master local[*] --class org.apache.spark.examples.SparkPi 
C:/msys64/apps/tmp/spark-1.4.0-bin-hadoop2.4/lib/spark-examples-1.4.0-hadoop2.4.0.jar

When I change the delimiter to "-d ' '" I was able to get an non-empty command 
array.

This is why I think the issue is either with the delimiter setting or the 
launcher that does not produce the command staring with the expected delimiter.

Hope this helps.

Thank you for looking into this.


was (Author: swang):
Sure.  I think the problem is with the while loop's delimiter setting (-d '')  
or the launcher class' behavior in the MSYS64 environment.

Here is section of the script in version 1.4.0 where I added some echo commands 
(marked with the +++ prefix).

{quote}
{{
CMD=()
while IFS= read -d '' -r ARG; do
echo "+++ Parsed Arguments in while loop: $ARG"
  CMD+=("$ARG")
done < <("$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@")
echo "+++ Launcher Command" "$RUNNER" -cp "$LAUNCH_CLASSPATH" 
org.apache.spark.launcher.Main "$@"

echo "+++ First Element: ${CMD[0]}"
echo "+++ Command Array: ${CMD[@]}"

if [ "${CMD[0]}" = "usage" ]; then
  "${CMD[@]}"
else
  exec "${CMD[@]}"
fi
}}
{quote}

The output from "run-example SparkPi" is as follows:

+++ Launcher Command /apps/jdk1.7.0_80/bin/java -cp 
/apps/tmp/spark-1.4.0-bin-hadoop2.4/lib/spark-assembly-1.4.0-hadoop2.4.0.jar 
org.apache.spark.launcher.Main org.apache.spark.deploy.SparkSubmit --master 
local[*] --class org.apache.spark.examples.SparkPi 
/apps/tmp/spark-1.4.0-bin-hadoop2.4/lib/spark-examples-1.4.0-hadoop2.4.0.jar
+++ First Element:
+++ Command Array:

As you can see the command array is empty.

However, when running the launcher command manually I got the following:

C:/msys64/apps/jdk1.7.0_80\bin\java -cp 
"C:/msys64/apps/tmp/spark-1.4.0-bin-hadoop2.4\conf\;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\spark-assembly-1.4.0-hadoop2.4.0.jar;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\datanucleus-api-jdo-3.2.6.jar;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\datanucleus-core-3.2.10.jar;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\datanucleus-rdbms-3.2.9.jar"
 -Xms512m -Xmx512m "-XX:MaxPermSize=128m" org.apache.spark.deploy.SparkSubmit 
--master local[*] --class org.apache.spark.examples.SparkPi 
C:/msys64/apps/tmp/spark-1.4.0-bin-hadoop2.4/lib/spark-examples-1.4.0-hadoop2.4.0.jar

When I change the delimiter to "-d ' '" I was able to get an non-empty command 
array.

This is why I think the issue is either with the delimiter setting or the 
launcher that does not produce the command staring with the expected delimiter.

Hope this helps.

Thank you for looking into this.

> spark-class produced null command strings for "exec"
> 
>
> Key: SPARK-12607
> URL: https://issues.apache.org/jira/browse/SPARK-12607
> Project: Spark
>  Issue Type: Bug
>  Components:

[jira] [Updated] (SPARK-12594) Outer Join Elimination by Filter Condition

2016-01-03 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-12594:

Summary: Outer Join Elimination by Filter Condition  (was: Outer Join 
Elimination by Local Predicates)

> Outer Join Elimination by Filter Condition
> --
>
> Key: SPARK-12594
> URL: https://issues.apache.org/jira/browse/SPARK-12594
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Xiao Li
>Priority: Critical
>
> Elimination of outer joins, if the local predicates can restrict the result 
> sets so that all null-supplying rows are eliminated. 
> - full outer -> inner if both sides have such local predicates
> - left outer -> inner if the right side has such local predicates
> - right outer -> inner if the left side has such local predicates
> - full outer -> left outer if only the left side has such local predicates
> - full outer -> right outer if only the right side has such local predicates
> If applicable, this can greatly improve the performance. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12594) Outer Join Elimination by Filter Condition

2016-01-03 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-12594:

Description: 
Elimination of outer joins, if the predicates in the filter condition can 
restrict the result sets so that all null-supplying rows are eliminated. 

- full outer -> inner if both sides have such predicates
- left outer -> inner if the right side has such predicates
- right outer -> inner if the left side has such predicates
- full outer -> left outer if only the left side has such predicates
- full outer -> right outer if only the right side has such predicates

If applicable, this can greatly improve the performance. 

  was:
Elimination of outer joins, if the local predicates can restrict the result 
sets so that all null-supplying rows are eliminated. 

- full outer -> inner if both sides have such local predicates
- left outer -> inner if the right side has such local predicates
- right outer -> inner if the left side has such local predicates
- full outer -> left outer if only the left side has such local predicates
- full outer -> right outer if only the right side has such local predicates

If applicable, this can greatly improve the performance. 


> Outer Join Elimination by Filter Condition
> --
>
> Key: SPARK-12594
> URL: https://issues.apache.org/jira/browse/SPARK-12594
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Xiao Li
>Priority: Critical
>
> Elimination of outer joins, if the predicates in the filter condition can 
> restrict the result sets so that all null-supplying rows are eliminated. 
> - full outer -> inner if both sides have such predicates
> - left outer -> inner if the right side has such predicates
> - right outer -> inner if the left side has such predicates
> - full outer -> left outer if only the left side has such predicates
> - full outer -> right outer if only the right side has such predicates
> If applicable, this can greatly improve the performance. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12613) Elimination of Outer Join by Parent Join Condition

2016-01-03 Thread Xiao Li (JIRA)

Xiao Li created SPARK-12613:
---

 Summary: Elimination of Outer Join by Parent Join Condition
 Key: SPARK-12613
 URL: https://issues.apache.org/jira/browse/SPARK-12613
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.6.0
Reporter: Xiao Li
Priority: Critical


Given an outer join is involved in another join (called parent join), when the 
join type of the parent join is inner, left-semi, left-outer and right-outer, 
checking if the join condition of the parent join satisfies the following two 
conditions:
 1) there exist null filtering predicates against the columns in the 
null-supplying side of parent join.
 2) these columns are from the child join.

 If having such join predicates, execute the elimination rules:
 - full outer -> inner if both sides of the child join have such predicates
 - left outer -> inner if the right side of the child join has such predicates
 - right outer -> inner if the left side of the child join has such predicates
 - full outer -> left outer if only the left side of the child join has such 
predicates
 - full outer -> right outer if only the right side of the child join has such 
predicates



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-12607) spark-class produced null command strings for "exec"

2016-01-03 Thread SM Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15080499#comment-15080499
 ] 

SM Wang edited comment on SPARK-12607 at 1/3/16 8:42 PM:
-

Sure.  I think the problem is with the while loop's delimiter setting (-d '')  
or the launcher class' behavior in the MSYS64 environment.

Here is section of the script in version 1.4.0 where I added some echo commands 
(marked with the +++ prefix).

{panel}
CMD=()
while IFS= read -d '' -r ARG; do
echo "+++ Parsed Arguments in while loop: $ARG"
  CMD+=("$ARG")
done < <("$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@")
echo "+++ Launcher Command" "$RUNNER" -cp "$LAUNCH_CLASSPATH" 
org.apache.spark.launcher.Main "$@"

echo "+++ First Element: ${CMD[0]}"
echo "+++ Command Array: ${CMD[@]}"

if [ "${CMD[0]}" = "usage" ]; then
  "${CMD[@]}"
else
  exec "${CMD[@]}"
fi
{panel}

The output from "run-example SparkPi" is as follows:

{panel}
+++ Launcher Command /apps/jdk1.7.0_80/bin/java -cp 
/apps/tmp/spark-1.4.0-bin-hadoop2.4/lib/spark-assembly-1.4.0-hadoop2.4.0.jar 
org.apache.spark.launcher.Main org.apache.spark.deploy.SparkSubmit --master 
local[*] --class org.apache.spark.examples.SparkPi 
/apps/tmp/spark-1.4.0-bin-hadoop2.4/lib/spark-examples-1.4.0-hadoop2.4.0.jar
+++ First Element:
+++ Command Array:
{panel}

As you can see the command array is empty.

However, when running the launcher command manually I got the following:
{panel}
C:/msys64/apps/jdk1.7.0_80\bin\java -cp 
"C:/msys64/apps/tmp/spark-1.4.0-bin-hadoop2.4\conf\;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\spark-assembly-1.4.0-hadoop2.4.0.jar;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\datanucleus-api-jdo-3.2.6.jar;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\datanucleus-core-3.2.10.jar;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\datanucleus-rdbms-3.2.9.jar"
 -Xms512m -Xmx512m "-XX:MaxPermSize=128m" org.apache.spark.deploy.SparkSubmit 
--master local[*] --class org.apache.spark.examples.SparkPi 
C:/msys64/apps/tmp/spark-1.4.0-bin-hadoop2.4/lib/spark-examples-1.4.0-hadoop2.4.0.jar
{panel}

When I change the delimiter to *-d ' '* (a space between quotes) I was able to 
get an non-empty command array.

This is why I think the issue is either with the delimiter setting or the 
launcher that does not produce the command string including the delimiter 
expected by the read function of while loop.

Hope this helps.

Thank you for looking into this.


was (Author: swang):
Sure.  I think the problem is with the while loop's delimiter setting (-d '')  
or the launcher class' behavior in the MSYS64 environment.

Here is section of the script in version 1.4.0 where I added some echo commands 
(marked with the +++ prefix).

{panel}
CMD=()
while IFS= read -d '' -r ARG; do
echo "+++ Parsed Arguments in while loop: $ARG"
  CMD+=("$ARG")
done < <("$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@")
echo "+++ Launcher Command" "$RUNNER" -cp "$LAUNCH_CLASSPATH" 
org.apache.spark.launcher.Main "$@"

echo "+++ First Element: ${CMD[0]}"
echo "+++ Command Array: ${CMD[@]}"

if [ "${CMD[0]}" = "usage" ]; then
  "${CMD[@]}"
else
  exec "${CMD[@]}"
fi
{panel}

The output from "run-example SparkPi" is as follows:

{panel}
+++ Launcher Command /apps/jdk1.7.0_80/bin/java -cp 
/apps/tmp/spark-1.4.0-bin-hadoop2.4/lib/spark-assembly-1.4.0-hadoop2.4.0.jar 
org.apache.spark.launcher.Main org.apache.spark.deploy.SparkSubmit --master 
local[*] --class org.apache.spark.examples.SparkPi 
/apps/tmp/spark-1.4.0-bin-hadoop2.4/lib/spark-examples-1.4.0-hadoop2.4.0.jar
+++ First Element:
+++ Command Array:
{panel}

As you can see the command array is empty.

However, when running the launcher command manually I got the following:
{panel}
C:/msys64/apps/jdk1.7.0_80\bin\java -cp 
"C:/msys64/apps/tmp/spark-1.4.0-bin-hadoop2.4\conf\;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\spark-assembly-1.4.0-hadoop2.4.0.jar;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\datanucleus-api-jdo-3.2.6.jar;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\datanucleus-core-3.2.10.jar;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\datanucleus-rdbms-3.2.9.jar"
 -Xms512m -Xmx512m "-XX:MaxPermSize=128m" org.apache.spark.deploy.SparkSubmit 
--master local[*] --class org.apache.spark.examples.SparkPi 
C:/msys64/apps/tmp/spark-1.4.0-bin-hadoop2.4/lib/spark-examples-1.4.0-hadoop2.4.0.jar
{panel}

When I change the delimiter to *-d ' '* (a space between quotes) I was able to 
get an non-empty command array.

This is why I think the issue is either with the delimiter setting or the 
launcher that does not produce the command staring with the expected delimiter.

Hope this helps.

Thank you for looking into this.

> spark-class produced null command strings for "exec"
> 
>
> Key: SPARK-12607
>

[jira] [Comment Edited] (SPARK-12607) spark-class produced null command strings for "exec"

2016-01-03 Thread SM Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15080499#comment-15080499
 ] 

SM Wang edited comment on SPARK-12607 at 1/3/16 9:51 PM:
-

Sure.  I think the problem is with the while loop's delimiter setting (-d '')  
or the launcher class' behavior in the MSYS64 environment.

Here is section of the script in version 1.4.0 where I added some echo commands 
(marked with the +++ prefix).

{panel}
{quote}
CMD=()
while IFS= read -d '' -r ARG; do
echo "+++ Parsed Arguments in while loop: $ARG"
  CMD+=("$ARG")
done < <("$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@")
echo "+++ Launcher Command" "$RUNNER" -cp "$LAUNCH_CLASSPATH" 
org.apache.spark.launcher.Main "$@"

echo "+++ First Element: ${CMD[0]}"
echo "+++ Command Array: ${CMD[@]}"

if [ "${CMD[0]}" = "usage" ]; then
  "${CMD[@]}"
else
  exec "${CMD[@]}"
fi
{quote}
{panel}

The output from "run-example SparkPi" is as follows:

{panel}
+++ Launcher Command /apps/jdk1.7.0_80/bin/java -cp 
/apps/tmp/spark-1.4.0-bin-hadoop2.4/lib/spark-assembly-1.4.0-hadoop2.4.0.jar 
org.apache.spark.launcher.Main org.apache.spark.deploy.SparkSubmit --master 
local[*] --class org.apache.spark.examples.SparkPi 
/apps/tmp/spark-1.4.0-bin-hadoop2.4/lib/spark-examples-1.4.0-hadoop2.4.0.jar
+++ First Element:
+++ Command Array:
{panel}

As you can see the command array is empty.

However, when running the launcher command manually I got the following:
{panel}
C:/msys64/apps/jdk1.7.0_80\bin\java -cp 
"C:/msys64/apps/tmp/spark-1.4.0-bin-hadoop2.4\conf\;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\spark-assembly-1.4.0-hadoop2.4.0.jar;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\datanucleus-api-jdo-3.2.6.jar;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\datanucleus-core-3.2.10.jar;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\datanucleus-rdbms-3.2.9.jar"
 -Xms512m -Xmx512m "-XX:MaxPermSize=128m" org.apache.spark.deploy.SparkSubmit 
--master local[*] --class org.apache.spark.examples.SparkPi 
C:/msys64/apps/tmp/spark-1.4.0-bin-hadoop2.4/lib/spark-examples-1.4.0-hadoop2.4.0.jar
{panel}

When I change the delimiter to *-d ' '* (a space between quotes) I was able to 
get an non-empty command array.

This is why I think the issue is either with the delimiter setting or the 
launcher that does not produce the command string including the delimiter 
expected by the read function of while loop.

Hope this helps.

Thank you for looking into this.


was (Author: swang):
Sure.  I think the problem is with the while loop's delimiter setting (-d '')  
or the launcher class' behavior in the MSYS64 environment.

Here is section of the script in version 1.4.0 where I added some echo commands 
(marked with the +++ prefix).

{panel}
CMD=()
while IFS= read -d '' -r ARG; do
echo "+++ Parsed Arguments in while loop: $ARG"
  CMD+=("$ARG")
done < <("$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@")
echo "+++ Launcher Command" "$RUNNER" -cp "$LAUNCH_CLASSPATH" 
org.apache.spark.launcher.Main "$@"

echo "+++ First Element: ${CMD[0]}"
echo "+++ Command Array: ${CMD[@]}"

if [ "${CMD[0]}" = "usage" ]; then
  "${CMD[@]}"
else
  exec "${CMD[@]}"
fi
{panel}

The output from "run-example SparkPi" is as follows:

{panel}
+++ Launcher Command /apps/jdk1.7.0_80/bin/java -cp 
/apps/tmp/spark-1.4.0-bin-hadoop2.4/lib/spark-assembly-1.4.0-hadoop2.4.0.jar 
org.apache.spark.launcher.Main org.apache.spark.deploy.SparkSubmit --master 
local[*] --class org.apache.spark.examples.SparkPi 
/apps/tmp/spark-1.4.0-bin-hadoop2.4/lib/spark-examples-1.4.0-hadoop2.4.0.jar
+++ First Element:
+++ Command Array:
{panel}

As you can see the command array is empty.

However, when running the launcher command manually I got the following:
{panel}
C:/msys64/apps/jdk1.7.0_80\bin\java -cp 
"C:/msys64/apps/tmp/spark-1.4.0-bin-hadoop2.4\conf\;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\spark-assembly-1.4.0-hadoop2.4.0.jar;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\datanucleus-api-jdo-3.2.6.jar;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\datanucleus-core-3.2.10.jar;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\datanucleus-rdbms-3.2.9.jar"
 -Xms512m -Xmx512m "-XX:MaxPermSize=128m" org.apache.spark.deploy.SparkSubmit 
--master local[*] --class org.apache.spark.examples.SparkPi 
C:/msys64/apps/tmp/spark-1.4.0-bin-hadoop2.4/lib/spark-examples-1.4.0-hadoop2.4.0.jar
{panel}

When I change the delimiter to *-d ' '* (a space between quotes) I was able to 
get an non-empty command array.

This is why I think the issue is either with the delimiter setting or the 
launcher that does not produce the command string including the delimiter 
expected by the read function of while loop.

Hope this helps.

Thank you for looking into this.

> spark-class produced null command strings for "exec"
>

[jira] [Assigned] (SPARK-12611) test_infer_schema_to_local depended on old handling of missing value in row

2016-01-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12611:


Assignee: Apache Spark

> test_infer_schema_to_local depended on old handling of missing value in row
> ---
>
> Key: SPARK-12611
> URL: https://issues.apache.org/jira/browse/SPARK-12611
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Reporter: holdenk
>Assignee: Apache Spark
>Priority: Trivial
>
> test_infer_schema_to_local depended on the old handling of missing values in 
> row objects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12611) test_infer_schema_to_local depended on old handling of missing value in row

2016-01-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15080544#comment-15080544
 ] 

Apache Spark commented on SPARK-12611:
--

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/10564

> test_infer_schema_to_local depended on old handling of missing value in row
> ---
>
> Key: SPARK-12611
> URL: https://issues.apache.org/jira/browse/SPARK-12611
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Reporter: holdenk
>Priority: Trivial
>
> test_infer_schema_to_local depended on the old handling of missing values in 
> row objects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12611) test_infer_schema_to_local depended on old handling of missing value in row

2016-01-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12611:


Assignee: (was: Apache Spark)

> test_infer_schema_to_local depended on old handling of missing value in row
> ---
>
> Key: SPARK-12611
> URL: https://issues.apache.org/jira/browse/SPARK-12611
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Reporter: holdenk
>Priority: Trivial
>
> test_infer_schema_to_local depended on the old handling of missing values in 
> row objects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12347) Write script to run all MLlib examples for testing

2016-01-03 Thread Jun Zheng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15080564#comment-15080564
 ] 

Jun Zheng commented on SPARK-12347:
---

1. How to programmatically detect if a test requires input? I can see a 
".required()" keyword in OptionParser indicates the test needs input, but not 
all test that needs input have this keyword.

2. How to set input file names other than hard-coding?

> Write script to run all MLlib examples for testing
> --
>
> Key: SPARK-12347
> URL: https://issues.apache.org/jira/browse/SPARK-12347
> Project: Spark
>  Issue Type: Test
>  Components: ML, MLlib, PySpark, SparkR, Tests
>Reporter: Joseph K. Bradley
>
> It would facilitate testing to have a script which runs all MLlib examples 
> for all languages.
> Design sketch to ensure all examples are run:
> * Generate a list of examples to run programmatically (not from a fixed list).
> * Use a list of special examples to handle examples which require command 
> line arguments.
> * Make sure data, etc. used are small to keep the tests quick.
> This could be broken into subtasks for each language, though it would be nice 
> to provide a single script.
> Not sure where the script should live; perhaps in {{bin/}}?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12612) Add missing Hadoop profiles to dev/run-tests-*.py scripts

2016-01-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15080588#comment-15080588
 ] 

Apache Spark commented on SPARK-12612:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/10565

> Add missing Hadoop profiles to dev/run-tests-*.py scripts
> -
>
> Key: SPARK-12612
> URL: https://issues.apache.org/jira/browse/SPARK-12612
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Project Infra
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> There are a couple of places in the dev/run-tests-*.py scripts which deal 
> with Hadoop profiles, but the set of profiles that they handle does not 
> include all Hadoop profiles defined in our POM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9835) Iteratively reweighted least squares solver for GLMs

2016-01-03 Thread Yanbo Liang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15080618#comment-15080618
 ] 

Yanbo Liang commented on SPARK-9835:


[~mengxr] Are you working on this issue? If you are not working on it, I can 
send a PR in a few days.

> Iteratively reweighted least squares solver for GLMs
> 
>
> Key: SPARK-9835
> URL: https://issues.apache.org/jira/browse/SPARK-9835
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Critical
>
> After SPARK-9834, we can implement iteratively reweighted least squares 
> (IRLS) solver for GLMs with other families and link functions. It could 
> provide R-like summary statistics after training, but the number of features 
> cannot be very large, e.g. more than 4096.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3785) Support off-loading computations to a GPU

2016-01-03 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15080653#comment-15080653
 ] 

Kazuaki Ishizaki commented on SPARK-3785:
-

Let us reopen this thread :)

We are working for effectively and easily exploiting GPUs on Spark at  
[http://github.com/kiszk/spark-gpu]. Our project page is 
[http://kiszk.github.io/spark-gpu/]. A design document is 
[here|https://docs.google.com/document/d/1bo1hbQ7ikdUA9LYtYh6kU_TwjFK2ebkHsH66QlmbYP8/edit?usp=sharing]

Our ideas for exploiting GPUs are
# adding a new format for a partition in an RDD, which is a column-based 
structure in an array format, in addition to the current Iterator\[T\] format 
with Seq\[T\]
# generating parallelized GPU native code to access data in the new format from 
a Spark application program by using an optimizer and code generator (this is 
similar to [Project 
Tungsten|https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html])
 and pre-compiled library

The motivation of idea 1 is to reduce the overhead of serializing/deserializing 
partition data for copy between CPU and GPU. The motivation of idea 2 is to 
avoid writing hardware-dependent code by application programmers. At first, we 
are working for idea A (For idea B, we need to write 
[CUDA|https://en.wikipedia.org/wiki/CUDA] code for now). 

This prototype achieved [3.15x performance 
improvement|https://github.com/kiszk/spark-gpu/wiki/Benchmark] of logistic 
regression 
([SparkGPULR|https://github.com/kiszk/spark-gpu/blob/dev/examples/src/main/scala/org/apache/spark/examples/SparkGPULR.scala])
 in examples on a 16-thread IvyBridge box with an NVIDIA K40 GPU card over that 
with no GPU card

You can download the pre-build binary for x86_64 and ppc64le from 
[here|https://github.com/kiszk/spark-gpu/wiki/Downloads]. You can run this on 
Amazon EC2 by [the 
procedure|https://github.com/kiszk/spark-gpu/wiki/How-to-run-%28local-or-AWS-EC2%29],
 too.


> Support off-loading computations to a GPU
> -
>
> Key: SPARK-3785
> URL: https://issues.apache.org/jira/browse/SPARK-3785
> Project: Spark
>  Issue Type: Brainstorming
>  Components: MLlib
>Reporter: Thomas Darimont
>Priority: Minor
>
> Are there any plans to adding support for off-loading computations to the 
> GPU, e.g. via an open-cl binding? 
> http://www.jocl.org/
> https://code.google.com/p/javacl/
> http://lwjgl.org/wiki/index.php?title=OpenCL_in_LWJGL



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-3785) Support off-loading computations to a GPU

2016-01-03 Thread Kazuaki Ishizaki (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki reopened SPARK-3785:
-

Add a comment of our prototype to offload to GPU

> Support off-loading computations to a GPU
> -
>
> Key: SPARK-3785
> URL: https://issues.apache.org/jira/browse/SPARK-3785
> Project: Spark
>  Issue Type: Brainstorming
>  Components: MLlib
>Reporter: Thomas Darimont
>Priority: Minor
>
> Are there any plans to adding support for off-loading computations to the 
> GPU, e.g. via an open-cl binding? 
> http://www.jocl.org/
> https://code.google.com/p/javacl/
> http://lwjgl.org/wiki/index.php?title=OpenCL_in_LWJGL



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12611) test_infer_schema_to_local depended on old handling of missing value in row

2016-01-03 Thread holdenk (JIRA)

holdenk created SPARK-12611:
---

 Summary: test_infer_schema_to_local depended on old handling of 
missing value in row
 Key: SPARK-12611
 URL: https://issues.apache.org/jira/browse/SPARK-12611
 Project: Spark
  Issue Type: Bug
  Components: SQL, Tests
Reporter: holdenk
Priority: Trivial


test_infer_schema_to_local depended on the old handling of missing values in 
row objects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-3785) Support off-loading computations to a GPU

2016-01-03 Thread Kazuaki Ishizaki (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki updated SPARK-3785:

Comment: was deleted

(was: Add a comment of our prototype to offload to GPU)

> Support off-loading computations to a GPU
> -
>
> Key: SPARK-3785
> URL: https://issues.apache.org/jira/browse/SPARK-3785
> Project: Spark
>  Issue Type: Brainstorming
>  Components: MLlib
>Reporter: Thomas Darimont
>Priority: Minor
>
> Are there any plans to adding support for off-loading computations to the 
> GPU, e.g. via an open-cl binding? 
> http://www.jocl.org/
> https://code.google.com/p/javacl/
> http://lwjgl.org/wiki/index.php?title=OpenCL_in_LWJGL



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12613) Elimination of Outer Join by Parent Join Condition

2016-01-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12613:


Assignee: Apache Spark

> Elimination of Outer Join by Parent Join Condition
> --
>
> Key: SPARK-12613
> URL: https://issues.apache.org/jira/browse/SPARK-12613
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>Priority: Critical
>
> Given an outer join is involved in another join (called parent join), when 
> the join type of the parent join is inner, left-semi, left-outer and 
> right-outer, checking if the join condition of the parent join satisfies the 
> following two conditions:
>  1) there exist null filtering predicates against the columns in the 
> null-supplying side of parent join.
>  2) these columns are from the child join.
>  If having such join predicates, execute the elimination rules:
>  - full outer -> inner if both sides of the child join have such predicates
>  - left outer -> inner if the right side of the child join has such predicates
>  - right outer -> inner if the left side of the child join has such predicates
>  - full outer -> left outer if only the left side of the child join has such 
> predicates
>  - full outer -> right outer if only the right side of the child join has 
> such predicates



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-12607) spark-class produced null command strings for "exec"

2016-01-03 Thread SM Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15080499#comment-15080499
 ] 

SM Wang edited comment on SPARK-12607 at 1/3/16 9:55 PM:
-

Sure.  I think the problem is with the while loop's delimiter setting (-d '')  
or the launcher class' behavior in the MSYS64 environment.

Here is section of the script in version 1.4.0 where I added some echo commands 
(marked with the +++ prefix).

{quote}
CMD=()
while IFS= read -d '' -r ARG; do
echo "+++ Parsed Arguments in while loop: $ARG"
  CMD+=("$ARG")
done < <("$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@")
echo "+++ Launcher Command" "$RUNNER" -cp "$LAUNCH_CLASSPATH" 
org.apache.spark.launcher.Main "$@"

echo "+++ First Element: ${CMD[0]}"
echo "+++ Command Array: ${CMD[@]}"

if [ "${CMD[0]}" = "usage" ]; then
  "${CMD[@]}"
else
  exec "${CMD[@]}"
fi
{quote}

The output from "run-example SparkPi" is as follows:

{panel}
+++ Launcher Command /apps/jdk1.7.0_80/bin/java -cp 
/apps/tmp/spark-1.4.0-bin-hadoop2.4/lib/spark-assembly-1.4.0-hadoop2.4.0.jar 
org.apache.spark.launcher.Main org.apache.spark.deploy.SparkSubmit --master 
local[*] --class org.apache.spark.examples.SparkPi 
/apps/tmp/spark-1.4.0-bin-hadoop2.4/lib/spark-examples-1.4.0-hadoop2.4.0.jar
+++ First Element:
+++ Command Array:
{panel}

As you can see the command array is empty.

However, when running the launcher command manually I got the following:
{panel}
C:/msys64/apps/jdk1.7.0_80\bin\java -cp 
"C:/msys64/apps/tmp/spark-1.4.0-bin-hadoop2.4\conf\;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\spark-assembly-1.4.0-hadoop2.4.0.jar;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\datanucleus-api-jdo-3.2.6.jar;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\datanucleus-core-3.2.10.jar;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\datanucleus-rdbms-3.2.9.jar"
 -Xms512m -Xmx512m "-XX:MaxPermSize=128m" org.apache.spark.deploy.SparkSubmit 
--master local[*] --class org.apache.spark.examples.SparkPi 
C:/msys64/apps/tmp/spark-1.4.0-bin-hadoop2.4/lib/spark-examples-1.4.0-hadoop2.4.0.jar
{panel}

When I change the delimiter to *-d ' '* (a space between quotes) I was able to 
get an non-empty command array.

This is why I think the issue is either with the delimiter setting or the 
launcher that does not produce the command string including the delimiter 
expected by the read function of while loop.

Hope this helps.

Thank you for looking into this.


was (Author: swang):
Sure.  I think the problem is with the while loop's delimiter setting (-d '')  
or the launcher class' behavior in the MSYS64 environment.

Here is section of the script in version 1.4.0 where I added some echo commands 
(marked with the +++ prefix).

{panel}
{quote}
CMD=()
while IFS= read -d '' -r ARG; do
echo "+++ Parsed Arguments in while loop: $ARG"
  CMD+=("$ARG")
done < <("$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@")
echo "+++ Launcher Command" "$RUNNER" -cp "$LAUNCH_CLASSPATH" 
org.apache.spark.launcher.Main "$@"

echo "+++ First Element: ${CMD[0]}"
echo "+++ Command Array: ${CMD[@]}"

if [ "${CMD[0]}" = "usage" ]; then
  "${CMD[@]}"
else
  exec "${CMD[@]}"
fi
{quote}
{panel}

The output from "run-example SparkPi" is as follows:

{panel}
+++ Launcher Command /apps/jdk1.7.0_80/bin/java -cp 
/apps/tmp/spark-1.4.0-bin-hadoop2.4/lib/spark-assembly-1.4.0-hadoop2.4.0.jar 
org.apache.spark.launcher.Main org.apache.spark.deploy.SparkSubmit --master 
local[*] --class org.apache.spark.examples.SparkPi 
/apps/tmp/spark-1.4.0-bin-hadoop2.4/lib/spark-examples-1.4.0-hadoop2.4.0.jar
+++ First Element:
+++ Command Array:
{panel}

As you can see the command array is empty.

However, when running the launcher command manually I got the following:
{panel}
C:/msys64/apps/jdk1.7.0_80\bin\java -cp 
"C:/msys64/apps/tmp/spark-1.4.0-bin-hadoop2.4\conf\;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\spark-assembly-1.4.0-hadoop2.4.0.jar;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\datanucleus-api-jdo-3.2.6.jar;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\datanucleus-core-3.2.10.jar;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\datanucleus-rdbms-3.2.9.jar"
 -Xms512m -Xmx512m "-XX:MaxPermSize=128m" org.apache.spark.deploy.SparkSubmit 
--master local[*] --class org.apache.spark.examples.SparkPi 
C:/msys64/apps/tmp/spark-1.4.0-bin-hadoop2.4/lib/spark-examples-1.4.0-hadoop2.4.0.jar
{panel}

When I change the delimiter to *-d ' '* (a space between quotes) I was able to 
get an non-empty command array.

This is why I think the issue is either with the delimiter setting or the 
launcher that does not produce the command string including the delimiter 
expected by the read function of while loop.

Hope this helps.

Thank you for looking into this.

> spark-class produced null command strings for "exec"
>

[jira] [Comment Edited] (SPARK-12607) spark-class produced null command strings for "exec"

2016-01-03 Thread SM Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15080499#comment-15080499
 ] 

SM Wang edited comment on SPARK-12607 at 1/3/16 8:31 PM:
-

Sure.  I think the problem is with the while loop's delimiter setting (-d '')  
or the launcher class' behavior in the MSYS64 environment.

Here is section of the script in version 1.4.0 where I added some echo commands 
(marked with the +++ prefix).

{quote}
CMD=()
while IFS= read -d '' -r ARG; do
echo "+++ Parsed Arguments in while loop: $ARG"
  CMD+=("$ARG")
done < <("$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@")
echo "+++ Launcher Command" "$RUNNER" -cp "$LAUNCH_CLASSPATH" 
org.apache.spark.launcher.Main "$@"

echo "+++ First Element: ${CMD[0]}"
echo "+++ Command Array: ${CMD[@]}"

if [ "${CMD[0]}" = "usage" ]; then
  "${CMD[@]}"
else
  exec "${CMD[@]}"
fi
{quote}

The output from "run-example SparkPi" is as follows:

{quote}
+++ Launcher Command /apps/jdk1.7.0_80/bin/java -cp 
/apps/tmp/spark-1.4.0-bin-hadoop2.4/lib/spark-assembly-1.4.0-hadoop2.4.0.jar 
org.apache.spark.launcher.Main org.apache.spark.deploy.SparkSubmit --master 
local[*] --class org.apache.spark.examples.SparkPi 
/apps/tmp/spark-1.4.0-bin-hadoop2.4/lib/spark-examples-1.4.0-hadoop2.4.0.jar
+++ First Element:
+++ Command Array:}}
{quote}

As you can see the command array is empty.

However, when running the launcher command manually I got the following:
{quote}
C:/msys64/apps/jdk1.7.0_80\bin\java -cp 
"C:/msys64/apps/tmp/spark-1.4.0-bin-hadoop2.4\conf\;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\spark-assembly-1.4.0-hadoop2.4.0.jar;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\datanucleus-api-jdo-3.2.6.jar;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\datanucleus-core-3.2.10.jar;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\datanucleus-rdbms-3.2.9.jar"
 -Xms512m -Xmx512m "-XX:MaxPermSize=128m" org.apache.spark.deploy.SparkSubmit 
--master local[*] --class org.apache.spark.examples.SparkPi 
C:/msys64/apps/tmp/spark-1.4.0-bin-hadoop2.4/lib/spark-examples-1.4.0-hadoop2.4.0.jar
{quote}

When I change the delimiter to "-d ' '" I was able to get an non-empty command 
array.

This is why I think the issue is either with the delimiter setting or the 
launcher that does not produce the command staring with the expected delimiter.

Hope this helps.

Thank you for looking into this.


was (Author: swang):
Sure.  I think the problem is with the while loop's delimiter setting (-d '')  
or the launcher class' behavior in the MSYS64 environment.

Here is section of the script in version 1.4.0 where I added some echo commands 
(marked with the +++ prefix).

{quote}
CMD=()
while IFS= read -d '' -r ARG; do
echo "+++ Parsed Arguments in while loop: $ARG"
  CMD+=("$ARG")
done < <("$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@")
echo "+++ Launcher Command" "$RUNNER" -cp "$LAUNCH_CLASSPATH" 
org.apache.spark.launcher.Main "$@"

echo "+++ First Element: ${CMD[0]}"
echo "+++ Command Array: ${CMD[@]}"

if [ "${CMD[0]}" = "usage" ]; then
  "${CMD[@]}"
else
  exec "${CMD[@]}"
fi
{quote}

The output from "run-example SparkPi" is as follows:

{{+++ Launcher Command /apps/jdk1.7.0_80/bin/java -cp 
/apps/tmp/spark-1.4.0-bin-hadoop2.4/lib/spark-assembly-1.4.0-hadoop2.4.0.jar 
org.apache.spark.launcher.Main org.apache.spark.deploy.SparkSubmit --master 
local[*] --class org.apache.spark.examples.SparkPi 
/apps/tmp/spark-1.4.0-bin-hadoop2.4/lib/spark-examples-1.4.0-hadoop2.4.0.jar
+++ First Element:
+++ Command Array:}}

As you can see the command array is empty.

However, when running the launcher command manually I got the following:

C:/msys64/apps/jdk1.7.0_80\bin\java -cp 
"C:/msys64/apps/tmp/spark-1.4.0-bin-hadoop2.4\conf\;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\spark-assembly-1.4.0-hadoop2.4.0.jar;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\datanucleus-api-jdo-3.2.6.jar;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\datanucleus-core-3.2.10.jar;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\datanucleus-rdbms-3.2.9.jar"
 -Xms512m -Xmx512m "-XX:MaxPermSize=128m" org.apache.spark.deploy.SparkSubmit 
--master local[*] --class org.apache.spark.examples.SparkPi 
C:/msys64/apps/tmp/spark-1.4.0-bin-hadoop2.4/lib/spark-examples-1.4.0-hadoop2.4.0.jar

When I change the delimiter to "-d ' '" I was able to get an non-empty command 
array.

This is why I think the issue is either with the delimiter setting or the 
launcher that does not produce the command staring with the expected delimiter.

Hope this helps.

Thank you for looking into this.

> spark-class produced null command strings for "exec"
> 
>
> Key: SPARK-12607
> URL: https://issues.apache.org/jira/browse/SPARK-12607
> Project: Spark
>  Issue Type: Bug

[jira] [Commented] (SPARK-12613) Elimination of Outer Join by Parent Join Condition

2016-01-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15080665#comment-15080665
 ] 

Apache Spark commented on SPARK-12613:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/10566

> Elimination of Outer Join by Parent Join Condition
> --
>
> Key: SPARK-12613
> URL: https://issues.apache.org/jira/browse/SPARK-12613
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Xiao Li
>Priority: Critical
>
> Given an outer join is involved in another join (called parent join), when 
> the join type of the parent join is inner, left-semi, left-outer and 
> right-outer, checking if the join condition of the parent join satisfies the 
> following two conditions:
>  1) there exist null filtering predicates against the columns in the 
> null-supplying side of parent join.
>  2) these columns are from the child join.
>  If having such join predicates, execute the elimination rules:
>  - full outer -> inner if both sides of the child join have such predicates
>  - left outer -> inner if the right side of the child join has such predicates
>  - right outer -> inner if the left side of the child join has such predicates
>  - full outer -> left outer if only the left side of the child join has such 
> predicates
>  - full outer -> right outer if only the right side of the child join has 
> such predicates



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12613) Elimination of Outer Join by Parent Join Condition

2016-01-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12613:


Assignee: (was: Apache Spark)

> Elimination of Outer Join by Parent Join Condition
> --
>
> Key: SPARK-12613
> URL: https://issues.apache.org/jira/browse/SPARK-12613
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Xiao Li
>Priority: Critical
>
> Given an outer join is involved in another join (called parent join), when 
> the join type of the parent join is inner, left-semi, left-outer and 
> right-outer, checking if the join condition of the parent join satisfies the 
> following two conditions:
>  1) there exist null filtering predicates against the columns in the 
> null-supplying side of parent join.
>  2) these columns are from the child join.
>  If having such join predicates, execute the elimination rules:
>  - full outer -> inner if both sides of the child join have such predicates
>  - left outer -> inner if the right side of the child join has such predicates
>  - right outer -> inner if the left side of the child join has such predicates
>  - full outer -> left outer if only the left side of the child join has such 
> predicates
>  - full outer -> right outer if only the right side of the child join has 
> such predicates



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12594) Outer Join Elimination by Filter Condition

2016-01-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15080664#comment-15080664
 ] 

Apache Spark commented on SPARK-12594:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/10567

> Outer Join Elimination by Filter Condition
> --
>
> Key: SPARK-12594
> URL: https://issues.apache.org/jira/browse/SPARK-12594
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Xiao Li
>Priority: Critical
>
> Elimination of outer joins, if the predicates in the filter condition can 
> restrict the result sets so that all null-supplying rows are eliminated. 
> - full outer -> inner if both sides have such predicates
> - left outer -> inner if the right side has such predicates
> - right outer -> inner if the left side has such predicates
> - full outer -> left outer if only the left side has such predicates
> - full outer -> right outer if only the right side has such predicates
> If applicable, this can greatly improve the performance. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-12607) spark-class produced null command strings for "exec"

2016-01-03 Thread SM Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15080499#comment-15080499
 ] 

SM Wang edited comment on SPARK-12607 at 1/3/16 8:37 PM:
-

Sure.  I think the problem is with the while loop's delimiter setting (-d '')  
or the launcher class' behavior in the MSYS64 environment.

Here is section of the script in version 1.4.0 where I added some echo commands 
(marked with the +++ prefix).

{panel}
CMD=()
while IFS= read -d '' -r ARG; do
echo "+++ Parsed Arguments in while loop: $ARG"
  CMD+=("$ARG")
done < <("$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@")
echo "+++ Launcher Command" "$RUNNER" -cp "$LAUNCH_CLASSPATH" 
org.apache.spark.launcher.Main "$@"

echo "+++ First Element: ${CMD[0]}"
echo "+++ Command Array: ${CMD[@]}"

if [ "${CMD[0]}" = "usage" ]; then
  "${CMD[@]}"
else
  exec "${CMD[@]}"
fi
{panel}

The output from "run-example SparkPi" is as follows:

{panel}
+++ Launcher Command /apps/jdk1.7.0_80/bin/java -cp 
/apps/tmp/spark-1.4.0-bin-hadoop2.4/lib/spark-assembly-1.4.0-hadoop2.4.0.jar 
org.apache.spark.launcher.Main org.apache.spark.deploy.SparkSubmit --master 
local[*] --class org.apache.spark.examples.SparkPi 
/apps/tmp/spark-1.4.0-bin-hadoop2.4/lib/spark-examples-1.4.0-hadoop2.4.0.jar
+++ First Element:
+++ Command Array:
{panel}

As you can see the command array is empty.

However, when running the launcher command manually I got the following:
{panel}
C:/msys64/apps/jdk1.7.0_80\bin\java -cp 
"C:/msys64/apps/tmp/spark-1.4.0-bin-hadoop2.4\conf\;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\spark-assembly-1.4.0-hadoop2.4.0.jar;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\datanucleus-api-jdo-3.2.6.jar;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\datanucleus-core-3.2.10.jar;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\datanucleus-rdbms-3.2.9.jar"
 -Xms512m -Xmx512m "-XX:MaxPermSize=128m" org.apache.spark.deploy.SparkSubmit 
--master local[*] --class org.apache.spark.examples.SparkPi 
C:/msys64/apps/tmp/spark-1.4.0-bin-hadoop2.4/lib/spark-examples-1.4.0-hadoop2.4.0.jar
{panel}

When I change the delimiter to "-d ' '" I was able to get an non-empty command 
array.

This is why I think the issue is either with the delimiter setting or the 
launcher that does not produce the command staring with the expected delimiter.

Hope this helps.

Thank you for looking into this.


was (Author: swang):
Sure.  I think the problem is with the while loop's delimiter setting (-d '')  
or the launcher class' behavior in the MSYS64 environment.

Here is section of the script in version 1.4.0 where I added some echo commands 
(marked with the +++ prefix).

{quote}
CMD=()
while IFS= read -d '' -r ARG; do
echo "+++ Parsed Arguments in while loop: $ARG"
  CMD+=("$ARG")
done < <("$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@")
echo "+++ Launcher Command" "$RUNNER" -cp "$LAUNCH_CLASSPATH" 
org.apache.spark.launcher.Main "$@"

echo "+++ First Element: ${CMD[0]}"
echo "+++ Command Array: ${CMD[@]}"

if [ "${CMD[0]}" = "usage" ]; then
  "${CMD[@]}"
else
  exec "${CMD[@]}"
fi
{quote}

The output from "run-example SparkPi" is as follows:

{quote}
+++ Launcher Command /apps/jdk1.7.0_80/bin/java -cp 
/apps/tmp/spark-1.4.0-bin-hadoop2.4/lib/spark-assembly-1.4.0-hadoop2.4.0.jar 
org.apache.spark.launcher.Main org.apache.spark.deploy.SparkSubmit --master 
local[*] --class org.apache.spark.examples.SparkPi 
/apps/tmp/spark-1.4.0-bin-hadoop2.4/lib/spark-examples-1.4.0-hadoop2.4.0.jar
+++ First Element:
+++ Command Array:}}
{quote}

As you can see the command array is empty.

However, when running the launcher command manually I got the following:
{quote}
C:/msys64/apps/jdk1.7.0_80\bin\java -cp 
"C:/msys64/apps/tmp/spark-1.4.0-bin-hadoop2.4\conf\;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\spark-assembly-1.4.0-hadoop2.4.0.jar;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\datanucleus-api-jdo-3.2.6.jar;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\datanucleus-core-3.2.10.jar;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\datanucleus-rdbms-3.2.9.jar"
 -Xms512m -Xmx512m "-XX:MaxPermSize=128m" org.apache.spark.deploy.SparkSubmit 
--master local[*] --class org.apache.spark.examples.SparkPi 
C:/msys64/apps/tmp/spark-1.4.0-bin-hadoop2.4/lib/spark-examples-1.4.0-hadoop2.4.0.jar
{quote}

When I change the delimiter to "-d ' '" I was able to get an non-empty command 
array.

This is why I think the issue is either with the delimiter setting or the 
launcher that does not produce the command staring with the expected delimiter.

Hope this helps.

Thank you for looking into this.

> spark-class produced null command strings for "exec"
> 
>
> Key: SPARK-12607
> URL: https://issues.apache.org/jira/browse/SPARK-12607
> Project: Spark

[jira] [Comment Edited] (SPARK-11714) Make Spark on Mesos honor port restrictions

2016-01-03 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15076309#comment-15076309
 ] 

Stavros Kontopoulos edited comment on SPARK-11714 at 1/4/16 12:04 AM:
--

Before moving on with a PR i was thinking of the following concept:

Check if spark.executor.port is empty if not check if port is within the offered
port range, otherwise refuse offer.
If spark.executor.port it empty this means random port
(default 0). That is a system facility (OS) its not spark convention. For this 
case pick a random
port within the offered range.
MesosSchedulerBackend passes to ExecutorInfo the resources offered
thus MesosExecutorBackend could use that info to initialize its port to a 
specified value.
I am working on this.



was (Author: skonto):
Before moving on with a PR i was thinking of the following concept:

Check if spark.executor.port is empty if not check if port is within the offered
port range, otherwise refuse offer.
If spark.executor.port it empty this means random port
(default 0). That is a system facility (OS) its not spark convention. For this 
case pick a random
port within the offered range.
MesosSchedulerBackend passes to ExecutorInfo the resources offered
thus MesosExecutorBackend could use that info to initialize its port to a 
specified value.



> Make Spark on Mesos honor port restrictions
> ---
>
> Key: SPARK-11714
> URL: https://issues.apache.org/jira/browse/SPARK-11714
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Charles Allen
>
> Currently the MesosSchedulerBackend does not make any effort to honor "ports" 
> as a resource offer in Mesos. This ask is to have the ports which the 
> executor binds to honor the limits of the "ports" resource of an offer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-11714) Make Spark on Mesos honor port restrictions

2016-01-03 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15076309#comment-15076309
 ] 

Stavros Kontopoulos edited comment on SPARK-11714 at 1/4/16 12:03 AM:
--

Before moving on with a PR i was thinking of the following concept:

Check if spark.executor.port is empty if not check if port is within the offered
port range, otherwise refuse offer.
If spark.executor.port it empty this means random port
(default 0). That is a system facility (OS) its not spark convention. For this 
case pick a random
port within the offered range.
MesosSchedulerBackend passes to ExecutorInfo the resources offered
thus MesosExecutorBackend could use that info to initialize its port to a 
specified value.




was (Author: skonto):
Before moving on with a PR i was thinking of the following concept:

Check if spark.executor.port is empty if not check if port is within the offered
port range, otherwise refuse offer.
If spark.executor.port it empty this means random port
(default 0). That is a system facility (OS) its not spark convention. For this 
case pick a random
port within the offered range.
MesosSchedulerBackend could pass along with other ExecutorInfo some info
about the allowed ports so that MesosExecutorBackend can initialize its port to 
the specified value.
We could pass that value (offered range of ports) in the data field (protobuf) 
of the ExecutorInfo structure where the exec command line arguments
for the executor are passed (not so clean), but then we could use it (actual 
initialization of the port deep down it is used by NettyRpcEnv ) and remove it 
from that list of arguments.


> Make Spark on Mesos honor port restrictions
> ---
>
> Key: SPARK-11714
> URL: https://issues.apache.org/jira/browse/SPARK-11714
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Charles Allen
>
> Currently the MesosSchedulerBackend does not make any effort to honor "ports" 
> as a resource offer in Mesos. This ask is to have the ports which the 
> executor binds to honor the limits of the "ports" resource of an offer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11661) We should still pushdown filters returned by a data source's unhandledFilters

2016-01-03 Thread Yin Huai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15080605#comment-15080605
 ] 

Yin Huai commented on SPARK-11661:
--

Can you create a jira? Do you also want to create a PR to fix it 
(DataSourceStrategy is the file that needs to updated)?

> We should still pushdown filters returned by a data source's unhandledFilters
> -
>
> Key: SPARK-11661
> URL: https://issues.apache.org/jira/browse/SPARK-11661
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Blocker
> Fix For: 1.6.0
>
>
> We added unhandledFilters interface to SPARK-10978. So, a data source has a 
> chance to let Spark SQL know that for those returned filters, it is possible 
> that the data source will not apply them to every row. So, Spark SQL should 
> use a Filter operator to evaluate those filters. However, if a filter is a 
> part of returned unhandledFilters, we should still push it down. For example, 
> our internal data sources do not override this method, if we do not push down 
> those filters, we are actually turning off the filter pushdown feature.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12609) Make R to JVM timeout configurable

2016-01-03 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15080659#comment-15080659
 ] 

Shivaram Venkataraman commented on SPARK-12609:
---

It was from a user running long jobs and getting a failure after 100 mins. We 
should make the worker.R configurable if it's necessary.

> Make R to JVM timeout configurable 
> ---
>
> Key: SPARK-12609
> URL: https://issues.apache.org/jira/browse/SPARK-12609
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>
> The timeout from R to the JVM is hardcoded at 6000 seconds in 
> https://github.com/apache/spark/blob/6c5bbd628aaedb6efb44c15f816fea8fb600decc/R/pkg/R/client.R#L22
> This results in Spark jobs that take more than 100 minutes to always fail. We 
> should make this timeout configurable through SparkConf.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12196) Store/retrieve blocks in different speed storage devices by hierarchy way

2016-01-03 Thread wei wu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15080635#comment-15080635
 ] 

wei wu commented on SPARK-12196:


  Yes, yucai. I went over the PR and the related code implemented on PR GitHub. 
I have understand your idea now. 
   The two questions  are:
   1.  Different applications may compete the SSD resource (may be cache RDD or 
shuffle data). If  the SSD capacity is small, the shuffle data may occupy all 
the SSD space. But the user want to give the priority in use of the SSD to 
cache the RDD not the shuffle data. Just like the function  
RDD.persist(StorageLevel.SSD). Maybe we can add a  configuration to Flag for 
disabling the SSD  to shuffle data.

2. We mainly revised the Storage Level, Cache Manager, Block Manager to add 
some StorageLevel API for SSD: 
StorageLevel. SSD_ONLY, StorageLevel. SSD_ONLY, StorageLevel. 
MEMORY_AND_SSD_AND_DISK, StorageLevel. SSD_AND_DISK
   How about we add some fine-grained approach or settings for 
spark.storage.hierarchyStore. 

> Store/retrieve blocks in different speed storage devices by hierarchy way
> -
>
> Key: SPARK-12196
> URL: https://issues.apache.org/jira/browse/SPARK-12196
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: yucai
>
> *Motivation*
> Nowadays, customers have both SSDs(SATA SSD/PCIe SSD) and HDDs. 
> SSDs have great performance, but capacity is small. 
> HDDs have good capacity, but much slower than SSDs(x2-x3 slower than SATA 
> SSD, x20 slower than PCIe SSD).
> How can we get both good?
> *Proposal*
> One solution is to build hierarchy store: use SSDs as cache and HDDs as 
> backup storage. 
> When Spark core allocates blocks (either for shuffle or RDD cache), it gets 
> blocks from SSDs first, and when SSD’s useable space is less than some 
> threshold, getting blocks from HDDs.
> In our implementation, we actually go further. We support a way to build any 
> level hierarchy store access various storage medias (MEM, NVM, SSD, HDD etc.).
> *Performance*
> 1. At the best case, our solution performs the same as all SSDs.
> 2. At the worst case, like all data are spilled to HDDs, no performance 
> regression.
> 3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it 
> could be higher, CPU reaches bottleneck in our test environment).
> 4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because 
> we support both RDD cache and shuffle and no extra inter process 
> communication.
> *Test Environment*
> 1. 4 IVB box(40 cores, 192GB memory, 10GB Nic, 11HDDs/11SATA SSDs/PCIE SSD) 
> 2. Real customer case NWeight(graph analysis), which is to compute 
> associations between two vertices that are n-hop away(e.g., friend-to-friend 
> or video-to-video relationship for recommendation). 
> 3. Data Size: 22GB, Vertices: 41 milion, Edges: 1.4 billion.
> *Usage*
> 1. Set the priority and threshold for each layer in 
> spark.storage.hierarchyStore.
> {code}
> spark.storage.hierarchyStore='nvm 40GB,ssd 20GB'
> {code}
> It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all 
> the rest form the last layer.
> 2. Configure each layer's location, user just needs put the keyword like 
> "nvm", "ssd", which are specified in step 1, into local dirs, like 
> spark.local.dir or yarn.nodemanager.local-dirs.
> {code}
> spark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others
> {code}
> After then, restart your Spark application, it will allocate blocks from nvm 
> first.
> When nvm's usable space is less than 40GB, it starts to allocate from ssd.
> When ssd's usable space is less than 20GB, it starts to allocate from the 
> last layer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12609) Make R to JVM timeout configurable

2016-01-03 Thread Sun Rui (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15080647#comment-15080647
 ] 

Sun Rui commented on SPARK-12609:
-

[~shivaram], did you meet a real case or just by reviewing code? Do we need to 
make configurable the timeout for socket connections of R workers (3600 seconds 
in daemon.R, and default timeout is used in worker.R)? 
Not sure if a value of 0 for timeout means infinity?


> Make R to JVM timeout configurable 
> ---
>
> Key: SPARK-12609
> URL: https://issues.apache.org/jira/browse/SPARK-12609
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>
> The timeout from R to the JVM is hardcoded at 6000 seconds in 
> https://github.com/apache/spark/blob/6c5bbd628aaedb6efb44c15f816fea8fb600decc/R/pkg/R/client.R#L22
> This results in Spark jobs that take more than 100 minutes to always fail. We 
> should make this timeout configurable through SparkConf.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-3785) Support off-loading computations to a GPU

2016-01-03 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15080653#comment-15080653
 ] 

Kazuaki Ishizaki edited comment on SPARK-3785 at 1/4/16 3:44 AM:
-

Let us reopen this thread :)

We are working to effectively and easily exploit GPUs on Spark at  
[http://github.com/kiszk/spark-gpu]. Our project page is 
[http://kiszk.github.io/spark-gpu/]. A design document is 
[here|https://docs.google.com/document/d/1bo1hbQ7ikdUA9LYtYh6kU_TwjFK2ebkHsH66QlmbYP8/edit?usp=sharing]

Our ideas for exploiting GPUs are
# adding a new format for a partition in an RDD, which is a column-based 
structure in an array format, in addition to the current Iterator\[T\] format 
with Seq\[T\]
# generating parallelized GPU native code to access data in the new format from 
a Spark application program by using an optimizer and code generator (this is 
similar to [Project 
Tungsten|https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html])
 and pre-compiled library

The motivation of idea 1 is to reduce the overhead of serializing/deserializing 
partition data for copy between CPU and GPU. The motivation of idea 2 is to 
avoid writing hardware-dependent code by application programmers. At first, we 
are working for idea A (For idea B, we need to write 
[CUDA|https://en.wikipedia.org/wiki/CUDA] code for now). 

This prototype achieved [3.15x performance 
improvement|https://github.com/kiszk/spark-gpu/wiki/Benchmark] of logistic 
regression 
([SparkGPULR|https://github.com/kiszk/spark-gpu/blob/dev/examples/src/main/scala/org/apache/spark/examples/SparkGPULR.scala])
 in examples on a 16-thread IvyBridge box with an NVIDIA K40 GPU card over that 
with no GPU card

You can download the pre-build binary for x86_64 and ppc64le from 
[here|https://github.com/kiszk/spark-gpu/wiki/Downloads]. You can run this on 
Amazon EC2 by [the 
procedure|https://github.com/kiszk/spark-gpu/wiki/How-to-run-%28local-or-AWS-EC2%29],
 too.



was (Author: kiszk):
Let us reopen this thread :)

We are working for effectively and easily exploiting GPUs on Spark at  
[http://github.com/kiszk/spark-gpu]. Our project page is 
[http://kiszk.github.io/spark-gpu/]. A design document is 
[here|https://docs.google.com/document/d/1bo1hbQ7ikdUA9LYtYh6kU_TwjFK2ebkHsH66QlmbYP8/edit?usp=sharing]

Our ideas for exploiting GPUs are
# adding a new format for a partition in an RDD, which is a column-based 
structure in an array format, in addition to the current Iterator\[T\] format 
with Seq\[T\]
# generating parallelized GPU native code to access data in the new format from 
a Spark application program by using an optimizer and code generator (this is 
similar to [Project 
Tungsten|https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html])
 and pre-compiled library

The motivation of idea 1 is to reduce the overhead of serializing/deserializing 
partition data for copy between CPU and GPU. The motivation of idea 2 is to 
avoid writing hardware-dependent code by application programmers. At first, we 
are working for idea A (For idea B, we need to write 
[CUDA|https://en.wikipedia.org/wiki/CUDA] code for now). 

This prototype achieved [3.15x performance 
improvement|https://github.com/kiszk/spark-gpu/wiki/Benchmark] of logistic 
regression 
([SparkGPULR|https://github.com/kiszk/spark-gpu/blob/dev/examples/src/main/scala/org/apache/spark/examples/SparkGPULR.scala])
 in examples on a 16-thread IvyBridge box with an NVIDIA K40 GPU card over that 
with no GPU card

You can download the pre-build binary for x86_64 and ppc64le from 
[here|https://github.com/kiszk/spark-gpu/wiki/Downloads]. You can run this on 
Amazon EC2 by [the 
procedure|https://github.com/kiszk/spark-gpu/wiki/How-to-run-%28local-or-AWS-EC2%29],
 too.


> Support off-loading computations to a GPU
> -
>
> Key: SPARK-3785
> URL: https://issues.apache.org/jira/browse/SPARK-3785
> Project: Spark
>  Issue Type: Brainstorming
>  Components: MLlib
>Reporter: Thomas Darimont
>Priority: Minor
>
> Are there any plans to adding support for off-loading computations to the 
> GPU, e.g. via an open-cl binding? 
> http://www.jocl.org/
> https://code.google.com/p/javacl/
> http://lwjgl.org/wiki/index.php?title=OpenCL_in_LWJGL



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-12607) spark-class produced null command strings for "exec"

2016-01-03 Thread SM Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15080499#comment-15080499
 ] 

SM Wang edited comment on SPARK-12607 at 1/3/16 8:40 PM:
-

Sure.  I think the problem is with the while loop's delimiter setting (-d '')  
or the launcher class' behavior in the MSYS64 environment.

Here is section of the script in version 1.4.0 where I added some echo commands 
(marked with the +++ prefix).

{panel}
CMD=()
while IFS= read -d '' -r ARG; do
echo "+++ Parsed Arguments in while loop: $ARG"
  CMD+=("$ARG")
done < <("$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@")
echo "+++ Launcher Command" "$RUNNER" -cp "$LAUNCH_CLASSPATH" 
org.apache.spark.launcher.Main "$@"

echo "+++ First Element: ${CMD[0]}"
echo "+++ Command Array: ${CMD[@]}"

if [ "${CMD[0]}" = "usage" ]; then
  "${CMD[@]}"
else
  exec "${CMD[@]}"
fi
{panel}

The output from "run-example SparkPi" is as follows:

{panel}
+++ Launcher Command /apps/jdk1.7.0_80/bin/java -cp 
/apps/tmp/spark-1.4.0-bin-hadoop2.4/lib/spark-assembly-1.4.0-hadoop2.4.0.jar 
org.apache.spark.launcher.Main org.apache.spark.deploy.SparkSubmit --master 
local[*] --class org.apache.spark.examples.SparkPi 
/apps/tmp/spark-1.4.0-bin-hadoop2.4/lib/spark-examples-1.4.0-hadoop2.4.0.jar
+++ First Element:
+++ Command Array:
{panel}

As you can see the command array is empty.

However, when running the launcher command manually I got the following:
{panel}
C:/msys64/apps/jdk1.7.0_80\bin\java -cp 
"C:/msys64/apps/tmp/spark-1.4.0-bin-hadoop2.4\conf\;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\spark-assembly-1.4.0-hadoop2.4.0.jar;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\datanucleus-api-jdo-3.2.6.jar;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\datanucleus-core-3.2.10.jar;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\datanucleus-rdbms-3.2.9.jar"
 -Xms512m -Xmx512m "-XX:MaxPermSize=128m" org.apache.spark.deploy.SparkSubmit 
--master local[*] --class org.apache.spark.examples.SparkPi 
C:/msys64/apps/tmp/spark-1.4.0-bin-hadoop2.4/lib/spark-examples-1.4.0-hadoop2.4.0.jar
{panel}

When I change the delimiter to *-d ' '* (a space between quotes) I was able to 
get an non-empty command array.

This is why I think the issue is either with the delimiter setting or the 
launcher that does not produce the command staring with the expected delimiter.

Hope this helps.

Thank you for looking into this.


was (Author: swang):
Sure.  I think the problem is with the while loop's delimiter setting (-d '')  
or the launcher class' behavior in the MSYS64 environment.

Here is section of the script in version 1.4.0 where I added some echo commands 
(marked with the +++ prefix).

{panel}
CMD=()
while IFS= read -d '' -r ARG; do
echo "+++ Parsed Arguments in while loop: $ARG"
  CMD+=("$ARG")
done < <("$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@")
echo "+++ Launcher Command" "$RUNNER" -cp "$LAUNCH_CLASSPATH" 
org.apache.spark.launcher.Main "$@"

echo "+++ First Element: ${CMD[0]}"
echo "+++ Command Array: ${CMD[@]}"

if [ "${CMD[0]}" = "usage" ]; then
  "${CMD[@]}"
else
  exec "${CMD[@]}"
fi
{panel}

The output from "run-example SparkPi" is as follows:

{panel}
+++ Launcher Command /apps/jdk1.7.0_80/bin/java -cp 
/apps/tmp/spark-1.4.0-bin-hadoop2.4/lib/spark-assembly-1.4.0-hadoop2.4.0.jar 
org.apache.spark.launcher.Main org.apache.spark.deploy.SparkSubmit --master 
local[*] --class org.apache.spark.examples.SparkPi 
/apps/tmp/spark-1.4.0-bin-hadoop2.4/lib/spark-examples-1.4.0-hadoop2.4.0.jar
+++ First Element:
+++ Command Array:
{panel}

As you can see the command array is empty.

However, when running the launcher command manually I got the following:
{panel}
C:/msys64/apps/jdk1.7.0_80\bin\java -cp 
"C:/msys64/apps/tmp/spark-1.4.0-bin-hadoop2.4\conf\;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\spark-assembly-1.4.0-hadoop2.4.0.jar;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\datanucleus-api-jdo-3.2.6.jar;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\datanucleus-core-3.2.10.jar;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\datanucleus-rdbms-3.2.9.jar"
 -Xms512m -Xmx512m "-XX:MaxPermSize=128m" org.apache.spark.deploy.SparkSubmit 
--master local[*] --class org.apache.spark.examples.SparkPi 
C:/msys64/apps/tmp/spark-1.4.0-bin-hadoop2.4/lib/spark-examples-1.4.0-hadoop2.4.0.jar
{panel}

When I change the delimiter to "-d ' '" I was able to get an non-empty command 
array.

This is why I think the issue is either with the delimiter setting or the 
launcher that does not produce the command staring with the expected delimiter.

Hope this helps.

Thank you for looking into this.

> spark-class produced null command strings for "exec"
> 
>
> Key: SPARK-12607
> URL: https://issues.apache.org/jira/browse/SPARK-12607
>

[jira] [Comment Edited] (SPARK-12607) spark-class produced null command strings for "exec"

2016-01-03 Thread SM Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15080499#comment-15080499
 ] 

SM Wang edited comment on SPARK-12607 at 1/3/16 9:57 PM:
-

Sure.  I think the problem is with the while loop's delimiter setting (-d '')  
or the launcher class' behavior in the MSYS64 environment.

Here is section of the script in version 1.4.0 where I added some echo commands 
(marked with the +++ prefix).

{noformat}
CMD=()
while IFS= read -d '' -r ARG; do
echo "+++ Parsed Arguments in while loop: $ARG"
  CMD+=("$ARG")
done < <("$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@")
echo "+++ Launcher Command" "$RUNNER" -cp "$LAUNCH_CLASSPATH" 
org.apache.spark.launcher.Main "$@"

echo "+++ First Element: ${CMD[0]}"
echo "+++ Command Array: ${CMD[@]}"

if [ "${CMD[0]}" = "usage" ]; then
  "${CMD[@]}"
else
  exec "${CMD[@]}"
fi
{noformat}

The output from "run-example SparkPi" is as follows:

{panel}
+++ Launcher Command /apps/jdk1.7.0_80/bin/java -cp 
/apps/tmp/spark-1.4.0-bin-hadoop2.4/lib/spark-assembly-1.4.0-hadoop2.4.0.jar 
org.apache.spark.launcher.Main org.apache.spark.deploy.SparkSubmit --master 
local[*] --class org.apache.spark.examples.SparkPi 
/apps/tmp/spark-1.4.0-bin-hadoop2.4/lib/spark-examples-1.4.0-hadoop2.4.0.jar
+++ First Element:
+++ Command Array:
{panel}

As you can see the command array is empty.

However, when running the launcher command manually I got the following:
{panel}
C:/msys64/apps/jdk1.7.0_80\bin\java -cp 
"C:/msys64/apps/tmp/spark-1.4.0-bin-hadoop2.4\conf\;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\spark-assembly-1.4.0-hadoop2.4.0.jar;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\datanucleus-api-jdo-3.2.6.jar;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\datanucleus-core-3.2.10.jar;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\datanucleus-rdbms-3.2.9.jar"
 -Xms512m -Xmx512m "-XX:MaxPermSize=128m" org.apache.spark.deploy.SparkSubmit 
--master local[*] --class org.apache.spark.examples.SparkPi 
C:/msys64/apps/tmp/spark-1.4.0-bin-hadoop2.4/lib/spark-examples-1.4.0-hadoop2.4.0.jar
{panel}

When I change the delimiter to *-d ' '* (a space between quotes) I was able to 
get an non-empty command array.

This is why I think the issue is either with the delimiter setting or the 
launcher that does not produce the command string including the delimiter 
expected by the read function of while loop.

Hope this helps.

Thank you for looking into this.


was (Author: swang):
Sure.  I think the problem is with the while loop's delimiter setting (-d '')  
or the launcher class' behavior in the MSYS64 environment.

Here is section of the script in version 1.4.0 where I added some echo commands 
(marked with the +++ prefix).

{quote}
CMD=()
while IFS= read -d '' -r ARG; do
echo "+++ Parsed Arguments in while loop: $ARG"
  CMD+=("$ARG")
done < <("$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@")
echo "+++ Launcher Command" "$RUNNER" -cp "$LAUNCH_CLASSPATH" 
org.apache.spark.launcher.Main "$@"

echo "+++ First Element: ${CMD[0]}"
echo "+++ Command Array: ${CMD[@]}"

if [ "${CMD[0]}" = "usage" ]; then
  "${CMD[@]}"
else
  exec "${CMD[@]}"
fi
{quote}

The output from "run-example SparkPi" is as follows:

{panel}
+++ Launcher Command /apps/jdk1.7.0_80/bin/java -cp 
/apps/tmp/spark-1.4.0-bin-hadoop2.4/lib/spark-assembly-1.4.0-hadoop2.4.0.jar 
org.apache.spark.launcher.Main org.apache.spark.deploy.SparkSubmit --master 
local[*] --class org.apache.spark.examples.SparkPi 
/apps/tmp/spark-1.4.0-bin-hadoop2.4/lib/spark-examples-1.4.0-hadoop2.4.0.jar
+++ First Element:
+++ Command Array:
{panel}

As you can see the command array is empty.

However, when running the launcher command manually I got the following:
{panel}
C:/msys64/apps/jdk1.7.0_80\bin\java -cp 
"C:/msys64/apps/tmp/spark-1.4.0-bin-hadoop2.4\conf\;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\spark-assembly-1.4.0-hadoop2.4.0.jar;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\datanucleus-api-jdo-3.2.6.jar;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\datanucleus-core-3.2.10.jar;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\datanucleus-rdbms-3.2.9.jar"
 -Xms512m -Xmx512m "-XX:MaxPermSize=128m" org.apache.spark.deploy.SparkSubmit 
--master local[*] --class org.apache.spark.examples.SparkPi 
C:/msys64/apps/tmp/spark-1.4.0-bin-hadoop2.4/lib/spark-examples-1.4.0-hadoop2.4.0.jar
{panel}

When I change the delimiter to *-d ' '* (a space between quotes) I was able to 
get an non-empty command array.

This is why I think the issue is either with the delimiter setting or the 
launcher that does not produce the command string including the delimiter 
expected by the read function of while loop.

Hope this helps.

Thank you for looking into this.

> spark-class produced null command strings for "exec"
> 
>
>

[jira] [Comment Edited] (SPARK-12607) spark-class produced null command strings for "exec"

2016-01-03 Thread SM Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15080499#comment-15080499
 ] 

SM Wang edited comment on SPARK-12607 at 1/3/16 9:59 PM:
-

Sure.  I think the problem is with the while loop's delimiter setting (-d '')  
or the launcher class' behavior in the MSYS64 environment.

Here is section of the script in version 1.4.0 where I added some echo commands 
(marked with the +++ prefix).

{noformat}
CMD=()
while IFS= read -d '' -r ARG; do
echo "+++ Parsed Arguments in while loop: $ARG"
  CMD+=("$ARG")
done < <("$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@")
echo "+++ Launcher Command" "$RUNNER" -cp "$LAUNCH_CLASSPATH" 
org.apache.spark.launcher.Main "$@"

echo "+++ First Element: ${CMD[0]}"
echo "+++ Command Array: ${CMD[@]}"

if [ "${CMD[0]}" = "usage" ]; then
  "${CMD[@]}"
else
  exec "${CMD[@]}"
fi
{noformat}

The output from "run-example SparkPi" is as follows:

{noformat}
+++ Launcher Command /apps/jdk1.7.0_80/bin/java -cp 
/apps/tmp/spark-1.4.0-bin-hadoop2.4/lib/spark-assembly-1.4.0-hadoop2.4.0.jar 
org.apache.spark.launcher.Main org.apache.spark.deploy.SparkSubmit --master 
local[*] --class org.apache.spark.examples.SparkPi 
/apps/tmp/spark-1.4.0-bin-hadoop2.4/lib/spark-examples-1.4.0-hadoop2.4.0.jar
+++ First Element:
+++ Command Array:
{noformat}

As you can see the command array is empty.

However, when running the launcher command manually I got the following:

{noformat}
C:/msys64/apps/jdk1.7.0_80\bin\java -cp 
"C:/msys64/apps/tmp/spark-1.4.0-bin-hadoop2.4\conf\;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\spark-assembly-1.4.0-hadoop2.4.0.jar;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\datanucleus-api-jdo-3.2.6.jar;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\datanucleus-core-3.2.10.jar;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\datanucleus-rdbms-3.2.9.jar"
 -Xms512m -Xmx512m "-XX:MaxPermSize=128m" org.apache.spark.deploy.SparkSubmit 
--master local[*] --class org.apache.spark.examples.SparkPi 
C:/msys64/apps/tmp/spark-1.4.0-bin-hadoop2.4/lib/spark-examples-1.4.0-hadoop2.4.0.jar
{noformat}

When I change the delimiter to *-d ' '* (a space between quotes) I was able to 
get an non-empty command array.

This is why I think the issue is either with the delimiter setting or the 
launcher that does not produce the command string including the delimiter 
expected by the read function of while loop.

Hope this helps.

Thank you for looking into this.


was (Author: swang):
Sure.  I think the problem is with the while loop's delimiter setting (-d '')  
or the launcher class' behavior in the MSYS64 environment.

Here is section of the script in version 1.4.0 where I added some echo commands 
(marked with the +++ prefix).

{noformat}
CMD=()
while IFS= read -d '' -r ARG; do
echo "+++ Parsed Arguments in while loop: $ARG"
  CMD+=("$ARG")
done < <("$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@")
echo "+++ Launcher Command" "$RUNNER" -cp "$LAUNCH_CLASSPATH" 
org.apache.spark.launcher.Main "$@"

echo "+++ First Element: ${CMD[0]}"
echo "+++ Command Array: ${CMD[@]}"

if [ "${CMD[0]}" = "usage" ]; then
  "${CMD[@]}"
else
  exec "${CMD[@]}"
fi
{noformat}

The output from "run-example SparkPi" is as follows:

{panel}
+++ Launcher Command /apps/jdk1.7.0_80/bin/java -cp 
/apps/tmp/spark-1.4.0-bin-hadoop2.4/lib/spark-assembly-1.4.0-hadoop2.4.0.jar 
org.apache.spark.launcher.Main org.apache.spark.deploy.SparkSubmit --master 
local[*] --class org.apache.spark.examples.SparkPi 
/apps/tmp/spark-1.4.0-bin-hadoop2.4/lib/spark-examples-1.4.0-hadoop2.4.0.jar
+++ First Element:
+++ Command Array:
{panel}

As you can see the command array is empty.

However, when running the launcher command manually I got the following:
{panel}
C:/msys64/apps/jdk1.7.0_80\bin\java -cp 
"C:/msys64/apps/tmp/spark-1.4.0-bin-hadoop2.4\conf\;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\spark-assembly-1.4.0-hadoop2.4.0.jar;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\datanucleus-api-jdo-3.2.6.jar;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\datanucleus-core-3.2.10.jar;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\datanucleus-rdbms-3.2.9.jar"
 -Xms512m -Xmx512m "-XX:MaxPermSize=128m" org.apache.spark.deploy.SparkSubmit 
--master local[*] --class org.apache.spark.examples.SparkPi 
C:/msys64/apps/tmp/spark-1.4.0-bin-hadoop2.4/lib/spark-examples-1.4.0-hadoop2.4.0.jar
{panel}

When I change the delimiter to *-d ' '* (a space between quotes) I was able to 
get an non-empty command array.

This is why I think the issue is either with the delimiter setting or the 
launcher that does not produce the command string including the delimiter 
expected by the read function of while loop.

Hope this helps.

Thank you for looking into this.

> spark-class produced null command strings for "exec"
>

[jira] [Resolved] (SPARK-12611) test_infer_schema_to_local depended on old handling of missing value in row

2016-01-03 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-12611.
-
   Resolution: Fixed
 Assignee: holdenk
Fix Version/s: 2.0.0

> test_infer_schema_to_local depended on old handling of missing value in row
> ---
>
> Key: SPARK-12611
> URL: https://issues.apache.org/jira/browse/SPARK-12611
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Reporter: holdenk
>Assignee: holdenk
>Priority: Trivial
> Fix For: 2.0.0
>
>
> test_infer_schema_to_local depended on the old handling of missing values in 
> row objects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10817) ML abstraction umbrella

2016-01-03 Thread Jeff Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15080625#comment-15080625
 ] 

Jeff Zhang commented on SPARK-10817:


Hi, considering spark 2.0 may be the next major release, should we stabilize 
the ML abstraction in spark 2.0 ?

> ML abstraction umbrella
> ---
>
> Key: SPARK-10817
> URL: https://issues.apache.org/jira/browse/SPARK-10817
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Critical
>
> This is an umbrella for discussing and creating ML abstractions.  This was 
> originally handled under [SPARK-1856] and [SPARK-3702], under which we 
> created the Pipelines API and some Developer APIs for classification and 
> regression.
> This umbrella is for future work, including:
> * Stabilizing the classification and regression APIs
> * Discussing traits vs. abstract classes for abstraction APIs
> * Creating other abstractions not yet covered (clustering, multilabel 
> prediction, etc.)
> Note that [SPARK-3702] still has useful discussion and design docs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12612) Add missing Hadoop profiles to dev/run-tests-*.py scripts

2016-01-03 Thread Josh Rosen (JIRA)

Josh Rosen created SPARK-12612:
--

 Summary: Add missing Hadoop profiles to dev/run-tests-*.py scripts
 Key: SPARK-12612
 URL: https://issues.apache.org/jira/browse/SPARK-12612
 Project: Spark
  Issue Type: Bug
  Components: Build, Project Infra
Reporter: Josh Rosen
Assignee: Josh Rosen


There are a couple of places in the dev/run-tests-*.py scripts which deal with 
Hadoop profiles, but the set of profiles that they handle does not include all 
Hadoop profiles defined in our POM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12537) Add option to accept quoting of all character backslash quoting mechanism

2016-01-03 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-12537.
-
   Resolution: Fixed
 Assignee: Cazen Lee  (was: Apache Spark)
Fix Version/s: 2.0.0

> Add option to accept quoting of all character backslash quoting mechanism
> -
>
> Key: SPARK-12537
> URL: https://issues.apache.org/jira/browse/SPARK-12537
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Cazen Lee
>Assignee: Cazen Lee
> Fix For: 2.0.0
>
>
> We can provides the option to choose JSON parser can be enabled to accept 
> quoting of all character or not.
> For example, if JSON file that includes not listed by JSON backslash quoting 
> specification, it returns corrupt_record
> {code:title=JSON File|borderStyle=solid}
> {"name": "Cazen Lee", "price": "$10"}
> {"name": "John Doe", "price": "\$20"}
> {"name": "Tracy", "price": "$10"}
> {code}
> corrupt_record(returns null)
> {code}
> scala> df.show
> ++-+-+
> | _corrupt_record| name|price|
> ++-+-+
> |null|Cazen Lee|  $10|
> |{"name": "John Do...| null| null|
> |null|Tracy|  $10|
> ++-+-+
> {code}
> And after apply this patch, we can enable allowBackslashEscapingAnyCharacter 
> option like below
> {code}
> scala> val df = sqlContext.read.option("allowBackslashEscapingAnyCharacter", 
> "true").json("/user/Cazen/test/test2.txt")
> df: org.apache.spark.sql.DataFrame = [name: string, price: string]
> scala> df.show
> +-+-+
> | name|price|
> +-+-+
> |Cazen Lee|  $10|
> | John Doe|  $20|
> |Tracy|  $10|
> +-+-+
> {code}
> This issue similar to HIVE-11825, HIVE-12717.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3785) Support off-loading computations to a GPU

2016-01-03 Thread Tycho Grouwstra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15080669#comment-15080669
 ] 

Tycho Grouwstra commented on SPARK-3785:


>> For idea B, we need to write CUDA code

>> The motivation of idea 2 is to avoid writing hardware-dependent code

I thought that unlike CUDA OpenCL could be run on GPU as well as CPU?
(That being said I'm under the impression CUDA is generally at the bleeding 
edge of innovation, so no objections here regardless.)


> Support off-loading computations to a GPU
> -
>
> Key: SPARK-3785
> URL: https://issues.apache.org/jira/browse/SPARK-3785
> Project: Spark
>  Issue Type: Brainstorming
>  Components: MLlib
>Reporter: Thomas Darimont
>Priority: Minor
>
> Are there any plans to adding support for off-loading computations to the 
> GPU, e.g. via an open-cl binding? 
> http://www.jocl.org/
> https://code.google.com/p/javacl/
> http://lwjgl.org/wiki/index.php?title=OpenCL_in_LWJGL



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12612) Add missing Hadoop profiles to dev/run-tests-*.py scripts

2016-01-03 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-12612.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10565
[https://github.com/apache/spark/pull/10565]

> Add missing Hadoop profiles to dev/run-tests-*.py scripts
> -
>
> Key: SPARK-12612
> URL: https://issues.apache.org/jira/browse/SPARK-12612
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Project Infra
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 2.0.0
>
>
> There are a couple of places in the dev/run-tests-*.py scripts which deal 
> with Hadoop profiles, but the set of profiles that they handle does not 
> include all Hadoop profiles defined in our POM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12562) DataFrame.write.format("text") requires the column name to be called value

2016-01-03 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-12562.
-
   Resolution: Fixed
Fix Version/s: 2.0.0
   1.6.1

> DataFrame.write.format("text") requires the column name to be called value
> --
>
> Key: SPARK-12562
> URL: https://issues.apache.org/jira/browse/SPARK-12562
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Michael Armbrust
>Assignee: Apache Spark
> Fix For: 1.6.1, 2.0.0
>
>
> We should support writing any DataFrame that has a single string column, 
> independent of the name.
> {code}
> wiki.select("text")
>   .limit(1)
>   .write
>   .format("text")
>   .mode("overwrite")
>   .save("/home/michael/wiki.txt")
> {code}
> {code}
> org.apache.spark.sql.AnalysisException: cannot resolve 'value' given input 
> columns text;
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:60)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:318)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:107)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:117)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:121)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:121)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:125)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:125)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:57)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:50)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:105)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:50)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:133)
>   at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:106)
>   at 
>

[jira] [Commented] (SPARK-12616) Union logical plan should support arbitrary number of children (rather than binary)

2016-01-03 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15080753#comment-15080753
 ] 

Xiao Li commented on SPARK-12616:
-

I am working on it. Thanks!

> Union logical plan should support arbitrary number of children (rather than 
> binary)
> ---
>
> Key: SPARK-12616
> URL: https://issues.apache.org/jira/browse/SPARK-12616
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>
> Union logical plan is a binary node. However, a typical use case for union is 
> to union a very large number of input sources (DataFrames, RDDs, or files). 
> It is not uncommon to union hundreds of thousands of files. In this case, our 
> optimizer can become very slow due to the large number of logical unions. We 
> should change the Union logical plan to support an arbitrary number of 
> children, and add a single rule in the optimizer (or analyzer?) to collapse 
> all adjacent Unions into one.
> Note that this problem doesn't exist in physical plan, because the physical 
> Union already supports arbitrary number of children.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12616) Union logical plan should support arbitrary number of children (rather than binary)

2016-01-03 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-12616:

Description: 
Union logical plan is a binary node. However, a typical use case for union is 
to union a very large number of input sources (DataFrames, RDDs, or files). It 
is not uncommon to union hundreds of thousands of files. In this case, our 
optimizer can become very slow due to the large number of logical unions. We 
should change the Union logical plan to support an arbitrary number of 
children, and add a single rule in the optimizer (or analyzer?) to collapse all 
adjacent Unions into one.

Note that this problem doesn't exist in physical plan, because the physical 
Union already supports arbitrary number of children.




  was:
Union logical plan is a binary node. However, a typical use case for union is 
to union a very large number of input sources (DataFrames, RDDs, or files). In 
this case, our optimizer can become very slow due to the large number of 
logical unions. We should change the Union logical plan to support an arbitrary 
number of children, and add a single rule in the optimizer (or analyzer?) to 
collapse all adjacent Unions into one.

Note that this problem doesn't exist in physical plan, because the physical 
Union already supports arbitrary number of children.





> Union logical plan should support arbitrary number of children (rather than 
> binary)
> ---
>
> Key: SPARK-12616
> URL: https://issues.apache.org/jira/browse/SPARK-12616
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>
> Union logical plan is a binary node. However, a typical use case for union is 
> to union a very large number of input sources (DataFrames, RDDs, or files). 
> It is not uncommon to union hundreds of thousands of files. In this case, our 
> optimizer can become very slow due to the large number of logical unions. We 
> should change the Union logical plan to support an arbitrary number of 
> children, and add a single rule in the optimizer (or analyzer?) to collapse 
> all adjacent Unions into one.
> Note that this problem doesn't exist in physical plan, because the physical 
> Union already supports arbitrary number of children.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12614) Don't throw non fatal exception from RpcEndpointRef.send/ask

2016-01-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12614:


Assignee: Apache Spark

> Don't throw non fatal exception from RpcEndpointRef.send/ask
> 
>
> Key: SPARK-12614
> URL: https://issues.apache.org/jira/browse/SPARK-12614
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>
> Right now RpcEndpointRef.send/ask may throw exception in some corner cases, 
> such as calling send/ask after stopping RpcEnv. It's better to avoid throwing 
> exception from RpcEndpointRef.send/ask. We can log a warning for `send`, and 
> send the exception to the future for `ask`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12614) Don't throw non fatal exception from RpcEndpointRef.send/ask

2016-01-03 Thread Shixiong Zhu (JIRA)

Shixiong Zhu created SPARK-12614:


 Summary: Don't throw non fatal exception from 
RpcEndpointRef.send/ask
 Key: SPARK-12614
 URL: https://issues.apache.org/jira/browse/SPARK-12614
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.6.0
Reporter: Shixiong Zhu


Right now RpcEndpointRef.send/ask may throw exception in some corner cases, 
such as calling send/ask after stopping RpcEnv. It's better to avoid throwing 
exception from RpcEndpointRef.send/ask. We can log a warning for `send`, and 
send the exception to the future for `ask`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12614) Don't throw non fatal exception from RpcEndpointRef.send/ask

2016-01-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15080691#comment-15080691
 ] 

Apache Spark commented on SPARK-12614:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/10568

> Don't throw non fatal exception from RpcEndpointRef.send/ask
> 
>
> Key: SPARK-12614
> URL: https://issues.apache.org/jira/browse/SPARK-12614
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Shixiong Zhu
>
> Right now RpcEndpointRef.send/ask may throw exception in some corner cases, 
> such as calling send/ask after stopping RpcEnv. It's better to avoid throwing 
> exception from RpcEndpointRef.send/ask. We can log a warning for `send`, and 
> send the exception to the future for `ask`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12614) Don't throw non fatal exception from RpcEndpointRef.send/ask

2016-01-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12614:


Assignee: (was: Apache Spark)

> Don't throw non fatal exception from RpcEndpointRef.send/ask
> 
>
> Key: SPARK-12614
> URL: https://issues.apache.org/jira/browse/SPARK-12614
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Shixiong Zhu
>
> Right now RpcEndpointRef.send/ask may throw exception in some corner cases, 
> such as calling send/ask after stopping RpcEnv. It's better to avoid throwing 
> exception from RpcEndpointRef.send/ask. We can log a warning for `send`, and 
> send the exception to the future for `ask`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12615) Remove some deprecated APIs in RDD/SparkContext

2016-01-03 Thread Reynold Xin (JIRA)

Reynold Xin created SPARK-12615:
---

 Summary: Remove some deprecated APIs in RDD/SparkContext
 Key: SPARK-12615
 URL: https://issues.apache.org/jira/browse/SPARK-12615
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Reynold Xin
Assignee: Reynold Xin






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12616) Union logical plan should support arbitrary number of children (rather than binary)

2016-01-03 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-12616:

Summary: Union logical plan should support arbitrary number of children 
(rather than binary)  (was: Improve union logical plan efficiency)

> Union logical plan should support arbitrary number of children (rather than 
> binary)
> ---
>
> Key: SPARK-12616
> URL: https://issues.apache.org/jira/browse/SPARK-12616
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>
> Union logical plan is a binary node. However, a typical use case for union is 
> to union a very large number of input sources (DataFrames, RDDs, or files). 
> In this case, our optimizer can become very slow due to the large number of 
> logical unions. We should change the Union logical plan to support an 
> arbitrary number of children, and add a single rule in the optimizer (or 
> analyzer?) to collapse all adjacent Unions into one.
> Note that this problem doesn't exist in physical plan, because the physical 
> Union already supports arbitrary number of children.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12616) Improve union logical plan efficiency

2016-01-03 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-12616:

Description: 
Union logical plan is a binary node. However, a typical use case for union is 
to union a very large number of input sources (DataFrames, RDDs, or files). In 
this case, our optimizer can become very slow due to the large number of 
logical unions. We should change the Union logical plan to support an arbitrary 
number of children, and add a single rule in the optimizer (or analyzer?) to 
collapse all adjacent Unions into one.

Note that this problem doesn't exist in physical plan, because the physical 
Union already supports arbitrary number of children.




  was:
Union logical plan is a binary node. However, a typical use case for union is 
to union a very large number of input sources (DataFrames, RDDs, or files). In 
this case, our optimizer can become very slow due to the large number of 
logical unions. We should change the Union logical plan to support an arbitrary 
number of children, and add a single rule in the optimizer (or analyzer?) to 
collapse all adjacent Unions into one.





> Improve union logical plan efficiency
> -
>
> Key: SPARK-12616
> URL: https://issues.apache.org/jira/browse/SPARK-12616
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>
> Union logical plan is a binary node. However, a typical use case for union is 
> to union a very large number of input sources (DataFrames, RDDs, or files). 
> In this case, our optimizer can become very slow due to the large number of 
> logical unions. We should change the Union logical plan to support an 
> arbitrary number of children, and add a single rule in the optimizer (or 
> analyzer?) to collapse all adjacent Unions into one.
> Note that this problem doesn't exist in physical plan, because the physical 
> Union already supports arbitrary number of children.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12589) result row size is wrong in UnsafeRowParquetRecordReader

2016-01-03 Thread Jayadevan M (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15080768#comment-15080768
 ] 

Jayadevan M commented on SPARK-12589:
-

@ Wenchen Fan

Can you please elaborate more ? How can I replicate this issue.

Thanks in advance

> result row size is wrong in UnsafeRowParquetRecordReader
> 
>
> Key: SPARK-12589
> URL: https://issues.apache.org/jira/browse/SPARK-12589
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>
> When we write rows in UnsafeRowParquetRecordReader, we call `row.pointTo` at 
> first, which assigns a wrong row size. We should reset the row size after 
> writing all columns.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12562) DataFrame.write.format("text") requires the column name to be called value

2016-01-03 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-12562:

Assignee: Xiu (Joe) Guo  (was: Apache Spark)

> DataFrame.write.format("text") requires the column name to be called value
> --
>
> Key: SPARK-12562
> URL: https://issues.apache.org/jira/browse/SPARK-12562
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Michael Armbrust
>Assignee: Xiu (Joe) Guo
> Fix For: 1.6.1, 2.0.0
>
>
> We should support writing any DataFrame that has a single string column, 
> independent of the name.
> {code}
> wiki.select("text")
>   .limit(1)
>   .write
>   .format("text")
>   .mode("overwrite")
>   .save("/home/michael/wiki.txt")
> {code}
> {code}
> org.apache.spark.sql.AnalysisException: cannot resolve 'value' given input 
> columns text;
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:60)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:318)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:107)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:117)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:121)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:121)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:125)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:125)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:57)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:50)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:105)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:50)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:133)
>   at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:106)
>   at 
>

[jira] [Assigned] (SPARK-12615) Remove some deprecated APIs in RDD/SparkContext

2016-01-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12615:


Assignee: Apache Spark  (was: Reynold Xin)

> Remove some deprecated APIs in RDD/SparkContext
> ---
>
> Key: SPARK-12615
> URL: https://issues.apache.org/jira/browse/SPARK-12615
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12615) Remove some deprecated APIs in RDD/SparkContext

2016-01-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15080739#comment-15080739
 ] 

Apache Spark commented on SPARK-12615:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/10569

> Remove some deprecated APIs in RDD/SparkContext
> ---
>
> Key: SPARK-12615
> URL: https://issues.apache.org/jira/browse/SPARK-12615
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12615) Remove some deprecated APIs in RDD/SparkContext

2016-01-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12615:


Assignee: Reynold Xin  (was: Apache Spark)

> Remove some deprecated APIs in RDD/SparkContext
> ---
>
> Key: SPARK-12615
> URL: https://issues.apache.org/jira/browse/SPARK-12615
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12616) Improve union logical plan efficiency

2016-01-03 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-12616:

Description: 
Union logical plan is a binary node. However, a typical use case for union is 
to union a very large number of input sources (DataFrames, RDDs, or files). In 
this case, our optimizer can become very slow due to the large number of 
logical unions. We should change the Union logical plan to support an arbitrary 
number of children, and add a single rule in the optimizer (or analyzer?) to 
collapse all adjacent Unions into one.




> Improve union logical plan efficiency
> -
>
> Key: SPARK-12616
> URL: https://issues.apache.org/jira/browse/SPARK-12616
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>
> Union logical plan is a binary node. However, a typical use case for union is 
> to union a very large number of input sources (DataFrames, RDDs, or files). 
> In this case, our optimizer can become very slow due to the large number of 
> logical unions. We should change the Union logical plan to support an 
> arbitrary number of children, and add a single rule in the optimizer (or 
> analyzer?) to collapse all adjacent Unions into one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12616) Improve union logical plan efficiency

2016-01-03 Thread Reynold Xin (JIRA)

Reynold Xin created SPARK-12616:
---

 Summary: Improve union logical plan efficiency
 Key: SPARK-12616
 URL: https://issues.apache.org/jira/browse/SPARK-12616
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12537) Add option to accept quoting of all character backslash quoting mechanism

2016-01-03 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15080366#comment-15080366
 ] 

Sean Owen commented on SPARK-12537:
---

+1 yes, there I was arguing for a default of 'false' and not against the option

> Add option to accept quoting of all character backslash quoting mechanism
> -
>
> Key: SPARK-12537
> URL: https://issues.apache.org/jira/browse/SPARK-12537
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Cazen Lee
>Assignee: Apache Spark
>
> We can provides the option to choose JSON parser can be enabled to accept 
> quoting of all character or not.
> For example, if JSON file that includes not listed by JSON backslash quoting 
> specification, it returns corrupt_record
> {code:title=JSON File|borderStyle=solid}
> {"name": "Cazen Lee", "price": "$10"}
> {"name": "John Doe", "price": "\$20"}
> {"name": "Tracy", "price": "$10"}
> {code}
> corrupt_record(returns null)
> {code}
> scala> df.show
> ++-+-+
> | _corrupt_record| name|price|
> ++-+-+
> |null|Cazen Lee|  $10|
> |{"name": "John Do...| null| null|
> |null|Tracy|  $10|
> ++-+-+
> {code}
> And after apply this patch, we can enable allowBackslashEscapingAnyCharacter 
> option like below
> {code}
> scala> val df = sqlContext.read.option("allowBackslashEscapingAnyCharacter", 
> "true").json("/user/Cazen/test/test2.txt")
> df: org.apache.spark.sql.DataFrame = [name: string, price: string]
> scala> df.show
> +-+-+
> | name|price|
> +-+-+
> |Cazen Lee|  $10|
> | John Doe|  $20|
> |Tracy|  $10|
> +-+-+
> {code}
> This issue similar to HIVE-11825, HIVE-12717.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6416) RDD.fold() requires the operator to be commutative

2016-01-03 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6416.
--
  Resolution: Fixed
Assignee: Sean Owen
   Fix Version/s: 1.5.0
Target Version/s:   (was: 2.0.0)

Hm! I should have tried that myself. I think that's a good argument that at 
least it's not inconsistent. The behavior is documented by an earlier change so 
calling this resolved, retroactively.

> RDD.fold() requires the operator to be commutative
> --
>
> Key: SPARK-6416
> URL: https://issues.apache.org/jira/browse/SPARK-6416
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Spark Core
>Affects Versions: 1.4.0
>Reporter: Josh Rosen
>Assignee: Sean Owen
>Priority: Critical
> Fix For: 1.5.0
>
>
> Spark's {{RDD.fold}} operation has some confusing behaviors when a 
> non-commutative reduce function is used.
> Here's an example, which was originally reported on StackOverflow 
> (https://stackoverflow.com/questions/29150202/pyspark-fold-method-output):
> {code}
> sc.parallelize([1,25,8,4,2]).fold(0,lambda a,b:a+1 )
> 8
> {code}
> To understand what's going on here, let's look at the definition of Spark's 
> `fold` operation.  
> I'm going to show the Python version of the code, but the Scala version 
> exhibits the exact same behavior (you can also [browse the source on 
> GitHub|https://github.com/apache/spark/blob/8cb23a1f9a3ed08e57865bcb6cc1cc7902881073/python/pyspark/rdd.py#L780]:
> {code}
> def fold(self, zeroValue, op):
> """
> Aggregate the elements of each partition, and then the results for all
> the partitions, using a given associative function and a neutral "zero
> value."
> The function C{op(t1, t2)} is allowed to modify C{t1} and return it
> as its result value to avoid object allocation; however, it should not
> modify C{t2}.
> >>> from operator import add
> >>> sc.parallelize([1, 2, 3, 4, 5]).fold(0, add)
> 15
> """
> def func(iterator):
> acc = zeroValue
> for obj in iterator:
> acc = op(obj, acc)
> yield acc
> vals = self.mapPartitions(func).collect()
> return reduce(op, vals, zeroValue)
> {code}
> (For comparison, see the [Scala implementation of 
> `RDD.fold`|https://github.com/apache/spark/blob/8cb23a1f9a3ed08e57865bcb6cc1cc7902881073/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L943]).
> Spark's `fold` operates by first folding each partition and then folding the 
> results.  The problem is that an empty partition gets folded down to the zero 
> element, so the final driver-side fold ends up folding one value for _every_ 
> partition rather than one value for each _non-empty_ partition.  This means 
> that the result of `fold` is sensitive to the number of partitions:
> {code}
> >>> sc.parallelize([1,25,8,4,2], 100).fold(0,lambda a,b:a+1 )
> 100
> >>> sc.parallelize([1,25,8,4,2], 50).fold(0,lambda a,b:a+1 )
> 50
> >>> sc.parallelize([1,25,8,4,2], 1).fold(0,lambda a,b:a+1 )
> 1
> {code}
> In this last case, what's happening is that the single partition is being 
> folded down to the correct value, then that value is folded with the 
> zero-value at the driver to yield 1.
> I think the underlying problem here is that our fold() operation implicitly 
> requires the operator to be commutative in addition to associative, but this 
> isn't documented anywhere.  Due to ordering non-determinism elsewhere in 
> Spark, such as SPARK-5750, I don't think there's an easy way to fix this.  
> Therefore, I think we should update the documentation and examples to clarify 
> this requirement and explain that our fold acts more like a reduce with a 
> default value than the type of ordering-sensitive fold() that users may 
> expect in functional languages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6416) Document that RDD.fold() requires the operator to be commutative

2016-01-03 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6416:
-
Affects Version/s: 1.4.0
 Priority: Minor  (was: Critical)
  Summary: Document that RDD.fold() requires the operator to be 
commutative  (was: RDD.fold() requires the operator to be commutative)

> Document that RDD.fold() requires the operator to be commutative
> 
>
> Key: SPARK-6416
> URL: https://issues.apache.org/jira/browse/SPARK-6416
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Spark Core
>Affects Versions: 1.4.0
>Reporter: Josh Rosen
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 1.5.0
>
>
> Spark's {{RDD.fold}} operation has some confusing behaviors when a 
> non-commutative reduce function is used.
> Here's an example, which was originally reported on StackOverflow 
> (https://stackoverflow.com/questions/29150202/pyspark-fold-method-output):
> {code}
> sc.parallelize([1,25,8,4,2]).fold(0,lambda a,b:a+1 )
> 8
> {code}
> To understand what's going on here, let's look at the definition of Spark's 
> `fold` operation.  
> I'm going to show the Python version of the code, but the Scala version 
> exhibits the exact same behavior (you can also [browse the source on 
> GitHub|https://github.com/apache/spark/blob/8cb23a1f9a3ed08e57865bcb6cc1cc7902881073/python/pyspark/rdd.py#L780]:
> {code}
> def fold(self, zeroValue, op):
> """
> Aggregate the elements of each partition, and then the results for all
> the partitions, using a given associative function and a neutral "zero
> value."
> The function C{op(t1, t2)} is allowed to modify C{t1} and return it
> as its result value to avoid object allocation; however, it should not
> modify C{t2}.
> >>> from operator import add
> >>> sc.parallelize([1, 2, 3, 4, 5]).fold(0, add)
> 15
> """
> def func(iterator):
> acc = zeroValue
> for obj in iterator:
> acc = op(obj, acc)
> yield acc
> vals = self.mapPartitions(func).collect()
> return reduce(op, vals, zeroValue)
> {code}
> (For comparison, see the [Scala implementation of 
> `RDD.fold`|https://github.com/apache/spark/blob/8cb23a1f9a3ed08e57865bcb6cc1cc7902881073/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L943]).
> Spark's `fold` operates by first folding each partition and then folding the 
> results.  The problem is that an empty partition gets folded down to the zero 
> element, so the final driver-side fold ends up folding one value for _every_ 
> partition rather than one value for each _non-empty_ partition.  This means 
> that the result of `fold` is sensitive to the number of partitions:
> {code}
> >>> sc.parallelize([1,25,8,4,2], 100).fold(0,lambda a,b:a+1 )
> 100
> >>> sc.parallelize([1,25,8,4,2], 50).fold(0,lambda a,b:a+1 )
> 50
> >>> sc.parallelize([1,25,8,4,2], 1).fold(0,lambda a,b:a+1 )
> 1
> {code}
> In this last case, what's happening is that the single partition is being 
> folded down to the correct value, then that value is folded with the 
> zero-value at the driver to yield 1.
> I think the underlying problem here is that our fold() operation implicitly 
> requires the operator to be commutative in addition to associative, but this 
> isn't documented anywhere.  Due to ordering non-determinism elsewhere in 
> Spark, such as SPARK-5750, I don't think there's an easy way to fix this.  
> Therefore, I think we should update the documentation and examples to clarify 
> this requirement and explain that our fold acts more like a reduce with a 
> default value than the type of ordering-sensitive fold() that users may 
> expect in functional languages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12607) spark-class produced null command strings for "exec"

2016-01-03 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15080367#comment-15080367
 ] 

Sean Owen commented on SPARK-12607:
---

Can you elaborate? it's not clear what you mean. What is the problem it causes? 
what do you think the fix is?

> spark-class produced null command strings for "exec"
> 
>
> Key: SPARK-12607
> URL: https://issues.apache.org/jira/browse/SPARK-12607
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.4.0, 1.4.1, 1.5.2
> Environment: MSYS64 on Windows 7 64 bit
>Reporter: SM Wang
>
> When using the run-example script in 1.4.0 to run the SparkPi example, I 
> found that it did not print any text to the terminal (e.g., stdout, stderr). 
> After further investigation I found the while loop for producing the exec 
> command from the launcher class produced a null command array.
> This discrepancy was observed on 1.5.2 and 1.4.1.  The 1.3.1's behavior seems 
> to be correct.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12537) Add option to accept quoting of all character backslash quoting mechanism

2016-01-03 Thread Cazen Lee (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15080374#comment-15080374
 ] 

Cazen Lee commented on SPARK-12537:
---

Thank you for advising me I will change the default to false soon.

> Add option to accept quoting of all character backslash quoting mechanism
> -
>
> Key: SPARK-12537
> URL: https://issues.apache.org/jira/browse/SPARK-12537
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Cazen Lee
>Assignee: Apache Spark
>
> We can provides the option to choose JSON parser can be enabled to accept 
> quoting of all character or not.
> For example, if JSON file that includes not listed by JSON backslash quoting 
> specification, it returns corrupt_record
> {code:title=JSON File|borderStyle=solid}
> {"name": "Cazen Lee", "price": "$10"}
> {"name": "John Doe", "price": "\$20"}
> {"name": "Tracy", "price": "$10"}
> {code}
> corrupt_record(returns null)
> {code}
> scala> df.show
> ++-+-+
> | _corrupt_record| name|price|
> ++-+-+
> |null|Cazen Lee|  $10|
> |{"name": "John Do...| null| null|
> |null|Tracy|  $10|
> ++-+-+
> {code}
> And after apply this patch, we can enable allowBackslashEscapingAnyCharacter 
> option like below
> {code}
> scala> val df = sqlContext.read.option("allowBackslashEscapingAnyCharacter", 
> "true").json("/user/Cazen/test/test2.txt")
> df: org.apache.spark.sql.DataFrame = [name: string, price: string]
> scala> df.show
> +-+-+
> | name|price|
> +-+-+
> |Cazen Lee|  $10|
> | John Doe|  $20|
> |Tracy|  $10|
> +-+-+
> {code}
> This issue similar to HIVE-11825, HIVE-12717.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12609) Make R to JVM timeout configurable

2016-01-03 Thread Shivaram Venkataraman (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-12609:
--
Component/s: (was: rkR)

> Make R to JVM timeout configurable 
> ---
>
> Key: SPARK-12609
> URL: https://issues.apache.org/jira/browse/SPARK-12609
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>
> The timeout from R to the JVM is hardcoded at 6000 seconds in 
> https://github.com/apache/spark/blob/6c5bbd628aaedb6efb44c15f816fea8fb600decc/R/pkg/R/client.R#L22
> This results in Spark jobs that take more than 100 minutes to always fail. We 
> should make this timeout configurable through SparkConf.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12327) lint-r checks fail with commented code

2016-01-03 Thread Shivaram Venkataraman (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-12327.
---
   Resolution: Fixed
 Assignee: Felix Cheung
Fix Version/s: 2.0.0
   1.6.1

Resolved by https://github.com/apache/spark/pull/10408

> lint-r checks fail with commented code
> --
>
> Key: SPARK-12327
> URL: https://issues.apache.org/jira/browse/SPARK-12327
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>Assignee: Felix Cheung
> Fix For: 1.6.1, 2.0.0
>
>
> We get this after our R version downgrade
> {code}
> R/RDD.R:183:68: style: Commented code should be removed.
> rdd@env$jrdd_val <- callJMethod(rddRef, "asJavaRDD") # 
> rddRef$asJavaRDD()
>
> ^~
> R/RDD.R:228:63: style: Commented code should be removed.
> #' http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence.
>   ^~~~
> R/RDD.R:388:24: style: Commented code should be removed.
> #' collectAsMap(rdd) # list(`1` = 2, `3` = 4)
>^~
> R/RDD.R:603:61: style: Commented code should be removed.
> #' unlist(collect(filterRDD(rdd, function (x) { x < 3 }))) # c(1, 2)
> ^~~~
> R/RDD.R:762:20: style: Commented code should be removed.
> #' take(rdd, 2L) # list(1, 2)
>^~
> R/RDD.R:830:42: style: Commented code should be removed.
> #' sort(unlist(collect(distinct(rdd # c(1, 2, 3)
>  ^~~
> R/RDD.R:980:47: style: Commented code should be removed.
> #' collect(keyBy(rdd, function(x) { x*x })) # list(list(1, 1), list(4, 2), 
> list(9, 3))
>   
> ^~~~
> R/RDD.R:1194:27: style: Commented code should be removed.
> #' takeOrdered(rdd, 6L) # list(1, 2, 3, 4, 5, 6)
>   ^~
> R/RDD.R:1215:19: style: Commented code should be removed.
> #' top(rdd, 6L) # list(10, 9, 7, 6, 5, 4)
>   ^~~
> R/RDD.R:1270:50: style: Commented code should be removed.
> #' aggregateRDD(rdd, zeroValue, seqOp, combOp) # list(10, 4)
>  ^~~
> R/RDD.R:1374:6: style: Commented code should be removed.
> #' # list(list("a", 0), list("b", 3), list("c", 1), list("d", 4), list("e", 
> 2))
>  
> ^~
> R/RDD.R:1415:6: style: Commented code should be removed.
> #' # list(list("a", 0), list("b", 1), list("c", 2), list("d", 3), list("e", 
> 4))
>  
> ^~
> R/RDD.R:1461:6: style: Commented code should be removed.
> #' # list(list(1, 2), list(3, 4))
>  ^~~~
> R/RDD.R:1527:6: style: Commented code should be removed.
> #' # list(list(0, 1000), list(1, 1001), list(2, 1002), list(3, 1003), list(4, 
> 1004))
>  
> ^~~
> R/RDD.R:1564:6: style: Commented code should be removed.
> #' # list(list(1, 1), list(1, 2), list(2, 1), list(2, 2))
>  ^~~~
> R/RDD.R:1595:6: style: Commented code should be removed.
> #' # list(1, 1, 3)
>  ^
> R/RDD.R:1627:6: style: Commented code should be removed.
> #' # list(1, 2, 3)
>  ^
> R/RDD.R:1663:6: style: Commented code should be removed.
> #' # list(list(1, c(1,2), c(1,2,3)), list(2, c(3,4), c(4,5,6)))
>  ^~
> R/deserialize.R:22:3: style: Commented code should be removed.
> # void -> NULL
>   ^~~~
> R/deserialize.R:23:3: style: Commented code should be removed.
> # Int -> integer
>   ^~
> R/deserialize.R:24:3: style: Commented code should be removed.
> # String -> character
>   ^~~
> R/deserialize.R:25:3: style: Commented code should be removed.
> # Boolean -> logical
>   ^~
> R/deserialize.R:26:3: style: Commented code should be removed.
> # Float -> double
>   ^~~
> R/deserialize.R:27:3: style: Commented code should be removed.
> # Double -> double
>   ^~~~
> R/deserialize.R:28:3: style: Commented code should be removed.
> # Long -> double
>   ^~
> R/deserialize.R:29:3: style: Commented code should be removed.
> # Array[Byte] -> raw
>   ^~
> R/deserialize.R:30:3: style: Commented code should be removed.

[jira] [Assigned] (SPARK-12610) Add Anti Join Operators

2016-01-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12610:


Assignee: (was: Apache Spark)

> Add Anti Join Operators
> ---
>
> Key: SPARK-12610
> URL: https://issues.apache.org/jira/browse/SPARK-12610
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Cheng Hao
>
> We need to implements the anti join operators, for supporting the NOT 
> predicates in subquery.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12610) Add Anti Join Operators

2016-01-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12610:


Assignee: Apache Spark

> Add Anti Join Operators
> ---
>
> Key: SPARK-12610
> URL: https://issues.apache.org/jira/browse/SPARK-12610
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Cheng Hao
>Assignee: Apache Spark
>
> We need to implements the anti join operators, for supporting the NOT 
> predicates in subquery.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12610) Add Anti Join Operators

2016-01-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15080454#comment-15080454
 ] 

Apache Spark commented on SPARK-12610:
--

User 'chenghao-intel' has created a pull request for this issue:
https://github.com/apache/spark/pull/10563

> Add Anti Join Operators
> ---
>
> Key: SPARK-12610
> URL: https://issues.apache.org/jira/browse/SPARK-12610
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Cheng Hao
>
> We need to implements the anti join operators, for supporting the NOT 
> predicates in subquery.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12609) Make R to JVM timeout configurable

2016-01-03 Thread Shivaram Venkataraman (JIRA)

Shivaram Venkataraman created SPARK-12609:
-

 Summary: Make R to JVM timeout configurable 
 Key: SPARK-12609
 URL: https://issues.apache.org/jira/browse/SPARK-12609
 Project: Spark
  Issue Type: Improvement
  Components: rkR, SparkR
Reporter: Shivaram Venkataraman


The timeout from R to the JVM is hardcoded at 6000 seconds in 
https://github.com/apache/spark/blob/6c5bbd628aaedb6efb44c15f816fea8fb600decc/R/pkg/R/client.R#L22

This results in Spark jobs that take more than 100 minutes to always fail. We 
should make this timeout configurable through SparkConf.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9111) Dumping the memory info when an executor dies abnormally

2016-01-03 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15080452#comment-15080452
 ] 

Steve Loughran commented on SPARK-9111:
---

the heap dump option could itself be useful; within a YARN container the  
launch command could be set to something like  {{-XX:HeapDumpPath=<>}} 
then the heap dump would be automatically grabbed by the YARN Nodemanager and 
copied to HDFS, where it would then be cleaned up by the normal YARN history 
cleanup routines.

> Dumping the memory info when an executor dies abnormally
> 
>
> Key: SPARK-9111
> URL: https://issues.apache.org/jira/browse/SPARK-9111
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Zhang, Liye
>Priority: Minor
>
> When an executor is not normally finished, we shall give out it's memory dump 
> info right before the JVM shutting down. So that if the executor is killed 
> because of OOM, we can easily checkout how is the memory used and which part 
> cause the OOM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12610) Add Anti Join Operators

2016-01-03 Thread Cheng Hao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Hao updated SPARK-12610:
--
Issue Type: Sub-task  (was: New Feature)
Parent: SPARK-4226

> Add Anti Join Operators
> ---
>
> Key: SPARK-12610
> URL: https://issues.apache.org/jira/browse/SPARK-12610
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Cheng Hao
>
> We need to implements the anti join operators, for supporting the NOT 
> predicates in subquery.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12610) Add Anti Join Operators

2016-01-03 Thread Cheng Hao (JIRA)

Cheng Hao created SPARK-12610:
-

 Summary: Add Anti Join Operators
 Key: SPARK-12610
 URL: https://issues.apache.org/jira/browse/SPARK-12610
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Cheng Hao


We need to implements the anti join operators, for supporting the NOT 
predicates in subquery.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12340) overstep the bounds of Int in SparkPlan.executeTake

2016-01-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15080435#comment-15080435
 ] 

Apache Spark commented on SPARK-12340:
--

User 'QiangCai' has created a pull request for this issue:
https://github.com/apache/spark/pull/10562

> overstep the bounds of Int in SparkPlan.executeTake
> ---
>
> Key: SPARK-12340
> URL: https://issues.apache.org/jira/browse/SPARK-12340
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: QiangCai
>Assignee: Apache Spark
>
> Reproduce
>   
>sql e.g.  select * from talbe1 where c1 = 'abc' limit 2147483638
>   
>n will be 2147483638 in SparkPlan.executeTake(n: Int). If the first 
> partition just have one row ( buf.size will be one), the result of  cod e 
> numPartsToTry = (1.5 * n * partsScanned / buf.size).toInt will be 
> Int.MaxValue. Then math.min(partsScanned + numPartsToTry, totalParts) will be 
> Int.MinValue (-2147483648) .
> Exception
> java.lang.IllegalArgumentException: Attempting to access a non-existent 
> partition: -2147483648. Total number of partitions: 200
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$submitJob$2.apply(DAGScheduler.scala:531)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$submitJob$2.apply(DAGScheduler.scala:530)
> at scala.Option.foreach(Option.scala:236)
> at 
> org.apache.spark.scheduler.DAGScheduler.submitJob(DAGScheduler.scala:530)
> at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:558)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1813)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1826)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1839)
> at 
> org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:215)
> at 
> org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:207)
> at 
> org.apache.spark.sql.DataFrame$$anonfun$collect$1.apply(DataFrame.scala:1386)
> at 
> org.apache.spark.sql.DataFrame$$anonfun$collect$1.apply(DataFrame.scala:1386)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
> at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1904)
> at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1385)
> at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1315)
> at org.apache.spark.sql.DataFrame.take(DataFrame.scala:1378)
> at org.apache.spark.sql.DataFrame.showString(DataFrame.scala:178)
> at 
> org.apache.spark.sql.hbase.HBaseSQLCliDriver$.process(HBaseSQLCliDriver.scala:122)
> at 
> org.apache.spark.sql.hbase.HBaseSQLCliDriver$.processLine(HBaseSQLCliDriver.scala:102)
> at 
> org.apache.spark.sql.hbase.HBaseSQLCliDriver$.main(HBaseSQLCliDriver.scala:80)
> at 
> org.apache.spark.sql.hbase.HBaseSQLCliDriver.main(HBaseSQLCliDriver.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:601)
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672)
> at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12607) spark-class produced null command strings for "exec"

2016-01-03 Thread SM Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15080499#comment-15080499
 ] 

SM Wang commented on SPARK-12607:
-

Sure.  I think the problem is with the while loop's delimiter setting (-d '')  
or the launcher class' behavior in the MSYS64 environment.

Here is section of the script in version 1.4.0 where I added some echo commands 
(marked with the +++ prefix).

CMD=()
while IFS= read -d '' -r ARG; do
echo "+++ Parsed Arguments in while loop: $ARG"
  CMD+=("$ARG")
done < <("$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@")
echo "+++ Launcher Command" "$RUNNER" -cp "$LAUNCH_CLASSPATH" 
org.apache.spark.launcher.Main "$@"

echo "+++ First Element: ${CMD[0]}"
echo "+++ Command Array: ${CMD[@]}"

if [ "${CMD[0]}" = "usage" ]; then
  "${CMD[@]}"
else
  exec "${CMD[@]}"
fi

The output from "run-example SparkPi" is as follows:

+++ Launcher Command /apps/jdk1.7.0_80/bin/java -cp 
/apps/tmp/spark-1.4.0-bin-hadoop2.4/lib/spark-assembly-1.4.0-hadoop2.4.0.jar 
org.apache.spark.launcher.Main org.apache.spark.deploy.SparkSubmit --master 
local[*] --class org.apache.spark.examples.SparkPi 
/apps/tmp/spark-1.4.0-bin-hadoop2.4/lib/spark-examples-1.4.0-hadoop2.4.0.jar
+++ First Element:
+++ Command Array:

As you can see the command array is empty.

However, when running the launcher command manually I got the following:

C:/msys64/apps/jdk1.7.0_80\bin\java -cp 
"C:/msys64/apps/tmp/spark-1.4.0-bin-hadoop2.4\conf\;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\spark-assembly-1.4.0-hadoop2.4.0.jar;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\datanucleus-api-jdo-3.2.6.jar;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\datanucleus-core-3.2.10.jar;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\datanucleus-rdbms-3.2.9.jar"
 -Xms512m -Xmx512m "-XX:MaxPermSize=128m" org.apache.spark.deploy.SparkSubmit 
--master local[*] --class org.apache.spark.examples.SparkPi 
C:/msys64/apps/tmp/spark-1.4.0-bin-hadoop2.4/lib/spark-examples-1.4.0-hadoop2.4.0.jar

When I change the delimiter to "-d ' '" I was able to get an non-empty command 
array.

This is why I think the issue is either with the delimiter setting or the 
launcher that does not produce the command staring with the expected delimiter.

Hope this helps.

Thank you for looking into this.

> spark-class produced null command strings for "exec"
> 
>
> Key: SPARK-12607
> URL: https://issues.apache.org/jira/browse/SPARK-12607
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.4.0, 1.4.1, 1.5.2
> Environment: MSYS64 on Windows 7 64 bit
>Reporter: SM Wang
>
> When using the run-example script in 1.4.0 to run the SparkPi example, I 
> found that it did not print any text to the terminal (e.g., stdout, stderr). 
> After further investigation I found the while loop for producing the exec 
> command from the launcher class produced a null command array.
> This discrepancy was observed on 1.5.2 and 1.4.1.  The 1.3.1's behavior seems 
> to be correct.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-12607) spark-class produced null command strings for "exec"

2016-01-03 Thread SM Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15080499#comment-15080499
 ] 

SM Wang edited comment on SPARK-12607 at 1/3/16 7:00 PM:
-

Sure.  I think the problem is with the while loop's delimiter setting (-d '')  
or the launcher class' behavior in the MSYS64 environment.

Here is section of the script in version 1.4.0 where I added some echo commands 
(marked with the +++ prefix).

{{
CMD=()
while IFS= read -d '' -r ARG; do
echo "+++ Parsed Arguments in while loop: $ARG"
  CMD+=("$ARG")
done < <("$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@")
echo "+++ Launcher Command" "$RUNNER" -cp "$LAUNCH_CLASSPATH" 
org.apache.spark.launcher.Main "$@"

echo "+++ First Element: ${CMD[0]}"
echo "+++ Command Array: ${CMD[@]}"

if [ "${CMD[0]}" = "usage" ]; then
  "${CMD[@]}"
else
  exec "${CMD[@]}"
fi
}}

The output from "run-example SparkPi" is as follows:

+++ Launcher Command /apps/jdk1.7.0_80/bin/java -cp 
/apps/tmp/spark-1.4.0-bin-hadoop2.4/lib/spark-assembly-1.4.0-hadoop2.4.0.jar 
org.apache.spark.launcher.Main org.apache.spark.deploy.SparkSubmit --master 
local[*] --class org.apache.spark.examples.SparkPi 
/apps/tmp/spark-1.4.0-bin-hadoop2.4/lib/spark-examples-1.4.0-hadoop2.4.0.jar
+++ First Element:
+++ Command Array:

As you can see the command array is empty.

However, when running the launcher command manually I got the following:

C:/msys64/apps/jdk1.7.0_80\bin\java -cp 
"C:/msys64/apps/tmp/spark-1.4.0-bin-hadoop2.4\conf\;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\spark-assembly-1.4.0-hadoop2.4.0.jar;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\datanucleus-api-jdo-3.2.6.jar;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\datanucleus-core-3.2.10.jar;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\datanucleus-rdbms-3.2.9.jar"
 -Xms512m -Xmx512m "-XX:MaxPermSize=128m" org.apache.spark.deploy.SparkSubmit 
--master local[*] --class org.apache.spark.examples.SparkPi 
C:/msys64/apps/tmp/spark-1.4.0-bin-hadoop2.4/lib/spark-examples-1.4.0-hadoop2.4.0.jar

When I change the delimiter to "-d ' '" I was able to get an non-empty command 
array.

This is why I think the issue is either with the delimiter setting or the 
launcher that does not produce the command staring with the expected delimiter.

Hope this helps.

Thank you for looking into this.


was (Author: swang):
Sure.  I think the problem is with the while loop's delimiter setting (-d '')  
or the launcher class' behavior in the MSYS64 environment.

Here is section of the script in version 1.4.0 where I added some echo commands 
(marked with the +++ prefix).

CMD=()
while IFS= read -d '' -r ARG; do
echo "+++ Parsed Arguments in while loop: $ARG"
  CMD+=("$ARG")
done < <("$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@")
echo "+++ Launcher Command" "$RUNNER" -cp "$LAUNCH_CLASSPATH" 
org.apache.spark.launcher.Main "$@"

echo "+++ First Element: ${CMD[0]}"
echo "+++ Command Array: ${CMD[@]}"

if [ "${CMD[0]}" = "usage" ]; then
  "${CMD[@]}"
else
  exec "${CMD[@]}"
fi

The output from "run-example SparkPi" is as follows:

+++ Launcher Command /apps/jdk1.7.0_80/bin/java -cp 
/apps/tmp/spark-1.4.0-bin-hadoop2.4/lib/spark-assembly-1.4.0-hadoop2.4.0.jar 
org.apache.spark.launcher.Main org.apache.spark.deploy.SparkSubmit --master 
local[*] --class org.apache.spark.examples.SparkPi 
/apps/tmp/spark-1.4.0-bin-hadoop2.4/lib/spark-examples-1.4.0-hadoop2.4.0.jar
+++ First Element:
+++ Command Array:

As you can see the command array is empty.

However, when running the launcher command manually I got the following:

C:/msys64/apps/jdk1.7.0_80\bin\java -cp 
"C:/msys64/apps/tmp/spark-1.4.0-bin-hadoop2.4\conf\;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\spark-assembly-1.4.0-hadoop2.4.0.jar;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\datanucleus-api-jdo-3.2.6.jar;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\datanucleus-core-3.2.10.jar;C:\msys64\apps\tmp\spark-1.4.0-bin-hadoop2.4\lib\datanucleus-rdbms-3.2.9.jar"
 -Xms512m -Xmx512m "-XX:MaxPermSize=128m" org.apache.spark.deploy.SparkSubmit 
--master local[*] --class org.apache.spark.examples.SparkPi 
C:/msys64/apps/tmp/spark-1.4.0-bin-hadoop2.4/lib/spark-examples-1.4.0-hadoop2.4.0.jar

When I change the delimiter to "-d ' '" I was able to get an non-empty command 
array.

This is why I think the issue is either with the delimiter setting or the 
launcher that does not produce the command staring with the expected delimiter.

Hope this helps.

Thank you for looking into this.

> spark-class produced null command strings for "exec"
> 
>
> Key: SPARK-12607
> URL: https://issues.apache.org/jira/browse/SPARK-12607
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions:

[jira] [Resolved] (SPARK-12533) hiveContext.table() throws the wrong exception

2016-01-03 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-12533.
-
   Resolution: Fixed
 Assignee: Thomas Sebastian
Fix Version/s: 2.0.0

> hiveContext.table() throws the wrong exception
> --
>
> Key: SPARK-12533
> URL: https://issues.apache.org/jira/browse/SPARK-12533
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Michael Armbrust
>Assignee: Thomas Sebastian
> Fix For: 2.0.0
>
>
> This should throw an {{AnalysisException}} that includes the table name 
> instead of the following:
> {code}
> org.apache.spark.sql.catalyst.analysis.NoSuchTableException
>   at 
> org.apache.spark.sql.hive.client.ClientInterface$$anonfun$getTable$1.apply(ClientInterface.scala:122)
>   at 
> org.apache.spark.sql.hive.client.ClientInterface$$anonfun$getTable$1.apply(ClientInterface.scala:122)
>   at scala.Option.getOrElse(Option.scala:120)
>   at 
> org.apache.spark.sql.hive.client.ClientInterface$class.getTable(ClientInterface.scala:122)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper.getTable(ClientWrapper.scala:60)
>   at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:384)
>   at 
> org.apache.spark.sql.hive.HiveContext$$anon$2.org$apache$spark$sql$catalyst$analysis$OverrideCatalog$$super$lookupRelation(HiveContext.scala:458)
>   at 
> org.apache.spark.sql.catalyst.analysis.OverrideCatalog$class.lookupRelation(Catalog.scala:161)
>   at 
> org.apache.spark.sql.hive.HiveContext$$anon$2.lookupRelation(HiveContext.scala:458)
>   at org.apache.spark.sql.SQLContext.table(SQLContext.scala:830)
>   at org.apache.spark.sql.SQLContext.table(SQLContext.scala:826)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

75 matches

Mail list logo