[jira] [Created] (SPARK-11000) Derby have booted the database twice in yarn security mode.

2015-10-08 Thread SaintBacchus (JIRA)
SaintBacchus created SPARK-11000:


 Summary: Derby have booted the database twice in yarn security 
mode.
 Key: SPARK-11000
 URL: https://issues.apache.org/jira/browse/SPARK-11000
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell, SQL, YARN
Affects Versions: 1.6.0
Reporter: SaintBacchus


*bin/spark-shell --master yarn-client*
If spark was build with hive, this simple command will also have a problem: 
_Another instance of Derby may have already booted the database_
{code:title=Exeception|borderStyle=solid}
Caused by: java.sql.SQLException: Another instance of Derby may have already 
booted the database /opt/client/Spark/spark/metastore_db.
at 
org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
at 
org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown
 Source)
at 
org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(Unknown Source)
at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown 
Source)
... 130 more
Caused by: ERROR XSDB6: Another instance of Derby may have already booted the 
database /opt/client/Spark/spark/metastore_db.
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11000) Derby have booted the database twice in yarn security mode.

2015-10-08 Thread SaintBacchus (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14948138#comment-14948138
 ] 

SaintBacchus commented on SPARK-11000:
--

Very similar with it but this is in yarn security mode.

> Derby have booted the database twice in yarn security mode.
> ---
>
> Key: SPARK-11000
> URL: https://issues.apache.org/jira/browse/SPARK-11000
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL, YARN
>Affects Versions: 1.6.0
>Reporter: SaintBacchus
>
> *bin/spark-shell --master yarn-client*
> If spark was build with hive, this simple command will also have a problem: 
> _Another instance of Derby may have already booted the database_
> {code:title=Exeception|borderStyle=solid}
> Caused by: java.sql.SQLException: Another instance of Derby may have already 
> booted the database /opt/client/Spark/spark/metastore_db.
> at 
> org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
> at 
> org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown
>  Source)
> at 
> org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(Unknown 
> Source)
> at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown 
> Source)
> ... 130 more
> Caused by: ERROR XSDB6: Another instance of Derby may have already booted the 
> database /opt/client/Spark/spark/metastore_db.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11000) Derby have booted the database twice in yarn security mode.

2015-10-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14948237#comment-14948237
 ] 

Apache Spark commented on SPARK-11000:
--

User 'SaintBacchus' has created a pull request for this issue:
https://github.com/apache/spark/pull/9026

> Derby have booted the database twice in yarn security mode.
> ---
>
> Key: SPARK-11000
> URL: https://issues.apache.org/jira/browse/SPARK-11000
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL, YARN
>Affects Versions: 1.6.0
>Reporter: SaintBacchus
>
> *bin/spark-shell --master yarn-client*
> If spark was build with hive, this simple command will also have a problem: 
> _Another instance of Derby may have already booted the database_
> {code:title=Exeception|borderStyle=solid}
> Caused by: java.sql.SQLException: Another instance of Derby may have already 
> booted the database /opt/client/Spark/spark/metastore_db.
> at 
> org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
> at 
> org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown
>  Source)
> at 
> org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(Unknown 
> Source)
> at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown 
> Source)
> ... 130 more
> Caused by: ERROR XSDB6: Another instance of Derby may have already booted the 
> database /opt/client/Spark/spark/metastore_db.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11000) Derby have booted the database twice in yarn security mode.

2015-10-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11000:


Assignee: Apache Spark

> Derby have booted the database twice in yarn security mode.
> ---
>
> Key: SPARK-11000
> URL: https://issues.apache.org/jira/browse/SPARK-11000
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL, YARN
>Affects Versions: 1.6.0
>Reporter: SaintBacchus
>Assignee: Apache Spark
>
> *bin/spark-shell --master yarn-client*
> If spark was build with hive, this simple command will also have a problem: 
> _Another instance of Derby may have already booted the database_
> {code:title=Exeception|borderStyle=solid}
> Caused by: java.sql.SQLException: Another instance of Derby may have already 
> booted the database /opt/client/Spark/spark/metastore_db.
> at 
> org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
> at 
> org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown
>  Source)
> at 
> org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(Unknown 
> Source)
> at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown 
> Source)
> ... 130 more
> Caused by: ERROR XSDB6: Another instance of Derby may have already booted the 
> database /opt/client/Spark/spark/metastore_db.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11000) Derby have booted the database twice in yarn security mode.

2015-10-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11000:


Assignee: (was: Apache Spark)

> Derby have booted the database twice in yarn security mode.
> ---
>
> Key: SPARK-11000
> URL: https://issues.apache.org/jira/browse/SPARK-11000
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL, YARN
>Affects Versions: 1.6.0
>Reporter: SaintBacchus
>
> *bin/spark-shell --master yarn-client*
> If spark was build with hive, this simple command will also have a problem: 
> _Another instance of Derby may have already booted the database_
> {code:title=Exeception|borderStyle=solid}
> Caused by: java.sql.SQLException: Another instance of Derby may have already 
> booted the database /opt/client/Spark/spark/metastore_db.
> at 
> org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
> at 
> org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown
>  Source)
> at 
> org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(Unknown 
> Source)
> at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown 
> Source)
> ... 130 more
> Caused by: ERROR XSDB6: Another instance of Derby may have already booted the 
> database /opt/client/Spark/spark/metastore_db.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10981) R semijoin leads to Java errors, R leftsemi leads to Spark errors

2015-10-08 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14948299#comment-14948299
 ] 

Sun Rui edited comment on SPARK-10981 at 10/8/15 8:48 AM:
--

yes, this is a bug in SparkR. your fix looks good. Could you submit a PR for 
this?

In the PR, please:
1. Support all join types defined in 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/joinTypes.scala 
(You can remove the "_" char from the currently supported join types in SparkR)

2. Add test cases for missing join types including "leftsemi"


was (Author: sunrui):
yes, this is a bug in SparkR. your fix looks good. Could you submit a PR for 
this?

In the PR, please:
1. Support all join types defined in 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/joinTypes.scala 
(You can move the "_" char from the currently supported join types in SparkR)

2. Add test cases for missing join types including "leftsemi"

> R semijoin leads to Java errors, R leftsemi leads to Spark errors
> -
>
> Key: SPARK-10981
> URL: https://issues.apache.org/jira/browse/SPARK-10981
> Project: Spark
>  Issue Type: Bug
>  Components: R
>Affects Versions: 1.5.0
> Environment: SparkR from RStudio on Macbook
>Reporter: Monica Liu
>Priority: Minor
>  Labels: easyfix, newbie
>
> I am using SparkR from RStudio, and I ran into an error with the join 
> function that I recreated with a smaller example:
> {code:title=joinTest.R|borderStyle=solid}
> Sys.setenv(SPARK_HOME="/Users/liumo1/Applications/spark/")
> .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
> library(SparkR)
> sc <- sparkR.init("local[4]")
> sqlContext <- sparkRSQL.init(sc) 
> n = c(2, 3, 5)
> s = c("aa", "bb", "cc")
> b = c(TRUE, FALSE, TRUE)
> df = data.frame(n, s, b)
> df1= createDataFrame(sqlContext, df)
> showDF(df1)
> x = c(2, 3, 10)
> t = c("dd", "ee", "ff")
> c = c(FALSE, FALSE, TRUE)
> dff = data.frame(x, t, c)
> df2 = createDataFrame(sqlContext, dff)
> showDF(df2)
> res = join(df1, df2, df1$n == df2$x, "semijoin")
> showDF(res)
> {code}
> Running this code, I encountered the error:
> {panel}
> Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : 
>   java.lang.IllegalArgumentException: Unsupported join type 'semijoin'. 
> Supported join types include: 'inner', 'outer', 'full', 'fullouter', 
> 'leftouter', 'left', 'rightouter', 'right', 'leftsemi'.
> {panel}
> However, if I changed the joinType to "leftsemi", 
> {code}
> res = join(df1, df2, df1$n == df2$x, "leftsemi")
> {code}
> I would get the error:
> {panel}
> Error in .local(x, y, ...) : 
>   joinType must be one of the following types: 'inner', 'outer', 
> 'left_outer', 'right_outer', 'semijoin'
> {panel}
> Since the join function in R appears to invoke a Java method, I went into 
> DataFrame.R and changed the code on line 1374 and line 1378 to change the 
> "semijoin" to "leftsemi" to match the Java function's parameters. These also 
> make the R joinType accepted values match those of Scala's. 
> semijoin:
> {code:title=DataFrame.R: join(x, y, joinExpr, joinType)|borderStyle=solid}
> if (joinType %in% c("inner", "outer", "left_outer", "right_outer", 
> "semijoin")) {
> sdf <- callJMethod(x@sdf, "join", y@sdf, joinExpr@jc, joinType)
> } 
> else {
>  stop("joinType must be one of the following types: ",
>  "'inner', 'outer', 'left_outer', 'right_outer', 'semijoin'")
> }
> {code}
> leftsemi:
> {code:title=DataFrame.R: join(x, y, joinExpr, joinType)|borderStyle=solid}
> if (joinType %in% c("inner", "outer", "left_outer", "right_outer", 
> "leftsemi")) {
> sdf <- callJMethod(x@sdf, "join", y@sdf, joinExpr@jc, joinType)
> } 
> else {
>  stop("joinType must be one of the following types: ",
>  "'inner', 'outer', 'left_outer', 'right_outer', 'leftsemi'")
> }
> {code}
> This fixed the issue, but I'm not sure if this solution breaks hive 
> compatibility or causes other issues, but I can submit a pull request to 
> change this



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10981) R semijoin leads to Java errors, R leftsemi leads to Spark errors

2015-10-08 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14948299#comment-14948299
 ] 

Sun Rui commented on SPARK-10981:
-

yes, this is a bug in SparkR. your fix looks good. Could you submit a PR for 
this?

In the PR, please:
1. Support all join types defined in 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/joinTypes.scala 
(You can move the "_" char from the currently supported join types in SparkR)

2. Add test cases for missing join types including "leftsemi"

> R semijoin leads to Java errors, R leftsemi leads to Spark errors
> -
>
> Key: SPARK-10981
> URL: https://issues.apache.org/jira/browse/SPARK-10981
> Project: Spark
>  Issue Type: Bug
>  Components: R
>Affects Versions: 1.5.0
> Environment: SparkR from RStudio on Macbook
>Reporter: Monica Liu
>Priority: Minor
>  Labels: easyfix, newbie
>
> I am using SparkR from RStudio, and I ran into an error with the join 
> function that I recreated with a smaller example:
> {code:title=joinTest.R|borderStyle=solid}
> Sys.setenv(SPARK_HOME="/Users/liumo1/Applications/spark/")
> .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
> library(SparkR)
> sc <- sparkR.init("local[4]")
> sqlContext <- sparkRSQL.init(sc) 
> n = c(2, 3, 5)
> s = c("aa", "bb", "cc")
> b = c(TRUE, FALSE, TRUE)
> df = data.frame(n, s, b)
> df1= createDataFrame(sqlContext, df)
> showDF(df1)
> x = c(2, 3, 10)
> t = c("dd", "ee", "ff")
> c = c(FALSE, FALSE, TRUE)
> dff = data.frame(x, t, c)
> df2 = createDataFrame(sqlContext, dff)
> showDF(df2)
> res = join(df1, df2, df1$n == df2$x, "semijoin")
> showDF(res)
> {code}
> Running this code, I encountered the error:
> {panel}
> Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : 
>   java.lang.IllegalArgumentException: Unsupported join type 'semijoin'. 
> Supported join types include: 'inner', 'outer', 'full', 'fullouter', 
> 'leftouter', 'left', 'rightouter', 'right', 'leftsemi'.
> {panel}
> However, if I changed the joinType to "leftsemi", 
> {code}
> res = join(df1, df2, df1$n == df2$x, "leftsemi")
> {code}
> I would get the error:
> {panel}
> Error in .local(x, y, ...) : 
>   joinType must be one of the following types: 'inner', 'outer', 
> 'left_outer', 'right_outer', 'semijoin'
> {panel}
> Since the join function in R appears to invoke a Java method, I went into 
> DataFrame.R and changed the code on line 1374 and line 1378 to change the 
> "semijoin" to "leftsemi" to match the Java function's parameters. These also 
> make the R joinType accepted values match those of Scala's. 
> semijoin:
> {code:title=DataFrame.R: join(x, y, joinExpr, joinType)|borderStyle=solid}
> if (joinType %in% c("inner", "outer", "left_outer", "right_outer", 
> "semijoin")) {
> sdf <- callJMethod(x@sdf, "join", y@sdf, joinExpr@jc, joinType)
> } 
> else {
>  stop("joinType must be one of the following types: ",
>  "'inner', 'outer', 'left_outer', 'right_outer', 'semijoin'")
> }
> {code}
> leftsemi:
> {code:title=DataFrame.R: join(x, y, joinExpr, joinType)|borderStyle=solid}
> if (joinType %in% c("inner", "outer", "left_outer", "right_outer", 
> "leftsemi")) {
> sdf <- callJMethod(x@sdf, "join", y@sdf, joinExpr@jc, joinType)
> } 
> else {
>  stop("joinType must be one of the following types: ",
>  "'inner', 'outer', 'left_outer', 'right_outer', 'leftsemi'")
> }
> {code}
> This fixed the issue, but I'm not sure if this solution breaks hive 
> compatibility or causes other issues, but I can submit a pull request to 
> change this



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-10879) spark on yarn support priority option

2015-10-08 Thread Yun Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yun Zhao closed SPARK-10879.

Resolution: Later

> spark on yarn support priority option
> -
>
> Key: SPARK-10879
> URL: https://issues.apache.org/jira/browse/SPARK-10879
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit, YARN
>Reporter: Yun Zhao
>
> Add a YARN-only option to spark-submit: *--priority PRIORITY* .The priority 
> of your YARN application (Default: 0).
> Add a property: *spark.yarn.priority*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10971) sparkR: RRunner should allow setting path to Rscript

2015-10-08 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14948307#comment-14948307
 ] 

Sun Rui commented on SPARK-10971:
-

just be curious: how do you distribute RScript to YARN nodes? Why not 
installing R in all YARN nodes so that it need not be distributed for each job 
to improve performance?

> sparkR: RRunner should allow setting path to Rscript
> 
>
> Key: SPARK-10971
> URL: https://issues.apache.org/jira/browse/SPARK-10971
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>
> I'm running spark on yarn and trying to use R in cluster mode. RRunner seems 
> to just call Rscript and assumes its in the path. But on our YARN deployment 
> R isn't installed on the nodes so it needs to be distributed along with the 
> job and we need the ability to point to where it gets installed. sparkR in 
> client mode has the config spark.sparkr.r.command to point to Rscript. 
> RRunner should have something similar so it works in cluster mode



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10960) SQL with windowing function cannot reference column in inner select block

2015-10-08 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-10960:

Description: 
There seems to be a bug in the Spark SQL parser when I use windowing functions. 
Specifically, when the SELECT refers to a column from an inner select block, 
the parser throws an error.

Here is an example:
--
When I use a windowing function and add a '1' constant to the result, 
{code}
   select Rank() OVER ( ORDER BY D1.c3 ) + 1 as c1
{code}

The Spark SQL parser works. The whole SQL is:
{code}
select Rank() OVER ( ORDER BY D1.c3 ) + 1 as c1,
 D1.c3 as c3,
 D1.c4 as c4,
 D1.c5 as c5
from 
 (select T3671.ROW_WID as c3,
   T3671.CAL_MONTH as c4,
   T3671.CAL_YEAR as c5,
   1 as c6
  from 
   W_DAY_D T3671
 ) D1
{code}

However, if I change the projection so that it refers to a column in an inner 
select block, D1.C6, whose value is itself a '1' literal, so it is functionally 
equivalent to the SQL above, Spark SQL will throw an error:
{code}
select Rank() OVER ( ORDER BY D1.c3 ) + D1.C6 as c1,
 D1.c3 as c3,
 D1.c4 as c4,
 D1.c5 as c5
from 
 (select T3671.ROW_WID as c3,
   T3671.CAL_MONTH as c4,
   T3671.CAL_YEAR as c5,
   1 as c6
  from 
   W_DAY_D T3671
 ) D1
{code}

The error message is:
{code}
. . . . . . . . . . . . . . . .> java.lang.NullPointerException
Error: org.apache.spark.sql.AnalysisException: resolved attribute(s) c6#3386 
missing from c5#3390
,c3#3383,c4#3389,_we0#3461,c3#3388 in operator !Project 
[c3#3388,c4#3389,c5#3390,c3#3383,_we0#346
1,(_we0#3461 + c6#3386) AS c1#3387]; (state=,code=0)
{code}

The above example is a simplified version of the SQL I was testing. The full 
SQL I was using, which fails with a similar error, is as follows:

{code}
select Case when case D1.c6 when 1 then D1.c3 else NULL end  is not null then 
Rank() OVER ( ORDER BY case when ( case D1.c6 when 1 then D1.c3 else NULL end  
) is null then 1 else 0 end, case D1.c6 when 1 then D1.c3 else NULL end ) end 
as c1,
 Case when case D1.c7 when 1 then D1.c3 else NULL end  
is not null then Rank() OVER ( PARTITION BY D1.c4, D1.c5 ORDER BY case when ( 
case D1.c7 when 1 then D1.c3 else NULL end  ) is null then 1 else 0 end, case 
D1.c7 when 1 then D1.c3 else NULL end ) end as c2,
 D1.c3 as c3,
 D1.c4 as c4,
 D1.c5 as c5
from 
 (select T3671.ROW_WID as c3,
   T3671.CAL_MONTH as c4,
   T3671.CAL_YEAR as c5,
   ROW_NUMBER() OVER (PARTITION BY 
T3671.CAL_MONTH, T3671.CAL_YEAR ORDER BY T3671.CAL_MONTH DESC, T3671.CAL_YEAR 
DESC) as c6,
   ROW_NUMBER() OVER (PARTITION BY 
T3671.CAL_MONTH, T3671.CAL_YEAR, T3671.ROW_WID ORDER BY T3671.CAL_MONTH DESC, 
T3671.CAL_YEAR DESC, T3671.ROW_WID DESC) as c7
  from 
   W_DAY_D T3671
 ) D1
{code}

Hopefully when fixed, both these sample SQLs should work!


  was:
There seems to be a bug in the Spark SQL parser when I use windowing functions. 
Specifically, when the SELECT refers to a column from an inner select block, 
the parser throws an error.

Here is an example:
--
When I use a windowing function and add a '1' constant to the result, 

   select Rank() OVER ( ORDER BY D1.c3 ) + 1 as c1

The Spark SQL parser works. The whole SQL is:

select Rank() OVER ( ORDER BY D1.c3 ) + 1 as c1,
 D1.c3 as c3,
 D1.c4 as c4,
 D1.c5 as c5
from 
 (select T3671.ROW_WID as c3,
   T3671.CAL_MONTH as c4,
   T3671.CAL_YEAR as c5,
   1 as c6
  from 
   W_DAY_D T3671
 ) D1
--
However, if I change the projection so that it refers to a column in an inner 
select block, D1.C6, whose value is itself a '1' literal, so it is functionally 
equivalent to the SQL above, Spark SQL will throw an error:

select Rank() OVER ( ORDER BY D1.c3 ) + 

[jira] [Commented] (SPARK-10903) Make sqlContext global

2015-10-08 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14948268#comment-14948268
 ] 

Sun Rui commented on SPARK-10903:
-

There are a number of functions defined in SQLContext.R taking a sqlContext 
instance as its first argument. Instead of removing the first argument from 
these functions, I'd rather we can allow both cases (that is the sqlContext 
parameter is passed or not passed) in these functions, for backward 
compatibility.

> Make sqlContext global 
> ---
>
> Key: SPARK-10903
> URL: https://issues.apache.org/jira/browse/SPARK-10903
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Narine Kokhlikyan
>Priority: Minor
>
> Make sqlContext global so that we don't have to always specify it.
> e.g. createDataFrame(iris) instead of createDataFrame(sqlContext, iris)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7869) Spark Data Frame Fails to Load Postgres Tables with JSONB DataType Columns

2015-10-08 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-7869.

   Resolution: Fixed
Fix Version/s: 1.6.0

> Spark Data Frame Fails to Load Postgres Tables with JSONB DataType Columns
> --
>
> Key: SPARK-7869
> URL: https://issues.apache.org/jira/browse/SPARK-7869
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.3.0, 1.3.1
> Environment: Spark 1.3.1
>Reporter: Brad Willard
>Assignee: Alexey Grishchenko
>Priority: Minor
> Fix For: 1.6.0
>
>
> Most of our tables load into dataframes just fine with postgres. However we 
> have a number of tables leveraging the JSONB datatype. Spark will error and 
> refuse to load this table. While asking for Spark to support JSONB might be a 
> tall order in the short term, it would be great if Spark would at least load 
> the table ignoring the columns it can't load or have it be an option.
> {code}
> pdf = sql_context.load(source="jdbc", url=url, dbtable="table_of_json")
> Py4JJavaError: An error occurred while calling o41.load.
> : java.sql.SQLException: Unsupported type 
> at org.apache.spark.sql.jdbc.JDBCRDD$.getCatalystType(JDBCRDD.scala:78)
> at org.apache.spark.sql.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:112)
> at org.apache.spark.sql.jdbc.JDBCRelation.(JDBCRelation.scala:133)
> at 
> org.apache.spark.sql.jdbc.DefaultSource.createRelation(JDBCRelation.scala:121)
> at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:219)
> at org.apache.spark.sql.SQLContext.load(SQLContext.scala:697)
> at org.apache.spark.sql.SQLContext.load(SQLContext.scala:685)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
> at py4j.Gateway.invoke(Gateway.java:259)
> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
> at py4j.GatewayConnection.run(GatewayConnection.java:207)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10977) SQL injection bugs in JdbcUtils and DataFrameWriter

2015-10-08 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14948302#comment-14948302
 ] 

Sean Owen commented on SPARK-10977:
---

It's a JDBC thing rather than database specific (e.g. parsed by JDBC) but yes I 
don't know if it works for every possible part of the query. Like, I don't know 
if you can prepare a statement to "SELECT * FROM ?". It's worth ruling it out 
before considering another approach as that would be by far the easiest thing. 
Quoting is probably the next-easiest.

Would love to see a test case involving Little Bobby Tables here. 
https://xkcd.com/327/

> SQL injection bugs in JdbcUtils and DataFrameWriter
> ---
>
> Key: SPARK-10977
> URL: https://issues.apache.org/jira/browse/SPARK-10977
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Rick Hillegas
>Priority: Minor
>
> SPARK-10857 identifies a SQL injection bug in the JDBC dialect code. A 
> similar SQL injection bug can be found in 2 places in JdbcUtils and another 
> place in DataFrameWriter:
> {noformat}
> The DROP TABLE logic in JdbcUtils concatenates boilerplate with a 
> user-supplied string:
> def dropTable(conn: Connection, table: String): Unit = {
> conn.prepareStatement(s"DROP TABLE $table").executeUpdate()
>   }
> Same for the INSERT logic in JdbcUtils:
> def insertStatement(conn: Connection, table: String, rddSchema: StructType): 
> PreparedStatement = {
> val sql = new StringBuilder(s"INSERT INTO $table VALUES (")
> var fieldsLeft = rddSchema.fields.length
> while (fieldsLeft > 0) {
>   sql.append("?")
>   if (fieldsLeft > 1) sql.append(", ") else sql.append(")")
>   fieldsLeft = fieldsLeft - 1
> }
> conn.prepareStatement(sql.toString())
>   }
> Same for the CREATE TABLE logic in DataFrameWriter:
>   def jdbc(url: String, table: String, connectionProperties: Properties): 
> Unit = {
>...
>
> if (!tableExists) {
> val schema = JdbcUtils.schemaString(df, url)
> val sql = s"CREATE TABLE $table ($schema)"
> conn.prepareStatement(sql).executeUpdate()
>   }
>...
>   }
> {noformat}
> Maybe we can find a common solution to all of these SQL injection bugs. 
> Something like this:
> 1) Parse the user-supplied table name into a table identifier and an optional 
> schema identifier. We can borrow logic from org.apache.derby.iapi.util.IdUtil 
> in order to do this.
> 2) Double-quote (and escape as necessary) the schema and table identifiers so 
> that the database interprets them as delimited ids.
> That should prevent the SQL injection attacks.
> With this solution, if the user specifies table names like cityTable and 
> trafficSchema.congestionTable, then the generated DROP TABLE statements would 
> be
> {noformat}
> DROP TABLE "CITYTABLE"
> DROP TABLE "TRAFFICSCHEMA"."CONGESTIONTABLE"
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10999) Physical plan node Coalesce should be able to handle UnsafeRow

2015-10-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14948133#comment-14948133
 ] 

Apache Spark commented on SPARK-10999:
--

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/9024

> Physical plan node Coalesce should be able to handle UnsafeRow
> --
>
> Key: SPARK-10999
> URL: https://issues.apache.org/jira/browse/SPARK-10999
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Minor
>
> The following PySpark snippet shows the problem:
> {noformat}
> >>> sqlContext.range(10).selectExpr('id AS a').coalesce(1).explain(True)
> ...
> == Physical Plan ==
> Coalesce 1
>  ConvertToSafe
>   TungstenProject [id#3L AS a#4L]
>Scan PhysicalRDD[id#3L]
> {noformat}
> The {{ConvertToSafe}} is unnecessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10999) Physical plan node Coalesce should be able to handle UnsafeRow

2015-10-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10999:


Assignee: Cheng Lian  (was: Apache Spark)

> Physical plan node Coalesce should be able to handle UnsafeRow
> --
>
> Key: SPARK-10999
> URL: https://issues.apache.org/jira/browse/SPARK-10999
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Minor
>
> The following PySpark snippet shows the problem:
> {noformat}
> >>> sqlContext.range(10).selectExpr('id AS a').coalesce(1).explain(True)
> ...
> == Physical Plan ==
> Coalesce 1
>  ConvertToSafe
>   TungstenProject [id#3L AS a#4L]
>Scan PhysicalRDD[id#3L]
> {noformat}
> The {{ConvertToSafe}} is unnecessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10999) Physical plan node Coalesce should be able to handle UnsafeRow

2015-10-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10999:


Assignee: Apache Spark  (was: Cheng Lian)

> Physical plan node Coalesce should be able to handle UnsafeRow
> --
>
> Key: SPARK-10999
> URL: https://issues.apache.org/jira/browse/SPARK-10999
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Cheng Lian
>Assignee: Apache Spark
>Priority: Minor
>
> The following PySpark snippet shows the problem:
> {noformat}
> >>> sqlContext.range(10).selectExpr('id AS a').coalesce(1).explain(True)
> ...
> == Physical Plan ==
> Coalesce 1
>  ConvertToSafe
>   TungstenProject [id#3L AS a#4L]
>Scan PhysicalRDD[id#3L]
> {noformat}
> The {{ConvertToSafe}} is unnecessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10974) Add progress bar for output operation column and use red dots for failed batches

2015-10-08 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-10974:
-
Summary: Add progress bar for output operation column and use red dots for 
failed batches  (was: Add progress bar for output operation column)

> Add progress bar for output operation column and use red dots for failed 
> batches
> 
>
> Key: SPARK-10974
> URL: https://issues.apache.org/jira/browse/SPARK-10974
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Shixiong Zhu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10999) Physical plan node Coalesce should be able to handle UnsafeRow

2015-10-08 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-10999:
--

 Summary: Physical plan node Coalesce should be able to handle 
UnsafeRow
 Key: SPARK-10999
 URL: https://issues.apache.org/jira/browse/SPARK-10999
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.0
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Minor


The following PySpark snippet shows the problem:
{noformat}
>>> sqlContext.range(10).selectExpr('id AS a').coalesce(1).explain(True)
...
== Physical Plan ==
Coalesce 1
 ConvertToSafe
  TungstenProject [id#3L AS a#4L]
   Scan PhysicalRDD[id#3L]
{noformat}
The {{ConvertToSafe}} is unnecessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10326) Cannot launch YARN job on Windows

2015-10-08 Thread Jose Antonio (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14948206#comment-14948206
 ] 

Jose Antonio commented on SPARK-10326:
--

C:\WINDOWS\system32>pyspark --master yarn-client
Python 2.7.10 |Anaconda 2.3.0 (64-bit)| (default, Sep 15 2015, 14:26:14) [MSC 
v.1500 64 bit (AMD64)]
Type "copyright", "credits" or "license" for more information.

IPython 4.0.0 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help  -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.
15/10/08 09:28:05 WARN MetricsSystem: Using default name DAGScheduler for 
source because spark.app.id is not set.
15/10/08 09:28:06 WARN : Your hostname, PC-509512 resolves to a 
loopback/non-reachable address: fe80:0:0:0:0:5efe:a5f:c318%net3, but we 
couldn't find any external IP address!
15/10/08 09:28:08 WARN BlockReaderLocal: The short-circuit local reads feature 
cannot be used because UNIX Domain sockets are not available on Windows.
15/10/08 09:28:08 ERROR SparkContext: Error initializing SparkContext.
java.net.URISyntaxException: Illegal character in opaque part at index 2: 
C:\spark\bin\..\python\lib\pyspark.zip
at java.net.URI$Parser.fail(Unknown Source)
at java.net.URI$Parser.checkChars(Unknown Source)
at java.net.URI$Parser.parse(Unknown Source)
at java.net.URI.(Unknown Source)
at 
org.apache.spark.deploy.yarn.Client$$anonfun$setupLaunchEnv$7.apply(Client.scala:558)
at 
org.apache.spark.deploy.yarn.Client$$anonfun$setupLaunchEnv$7.apply(Client.scala:557)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.apache.spark.deploy.yarn.Client.setupLaunchEnv(Client.scala:557)
at 
org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:628)
at 
org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:119)
at 
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
at 
org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:144)
at org.apache.spark.SparkContext.(SparkContext.scala:523)
at 
org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:61)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown 
Source)
at java.lang.reflect.Constructor.newInstance(Unknown Source)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:214)
at 
py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Unknown Source)
15/10/08 09:28:08 ERROR Utils: Uncaught exception in thread Thread-2
java.lang.NullPointerException
at 
org.apache.spark.network.netty.NettyBlockTransferService.close(NettyBlockTransferService.scala:152)
at org.apache.spark.storage.BlockManager.stop(BlockManager.scala:1228)
at org.apache.spark.SparkEnv.stop(SparkEnv.scala:100)
at 
org.apache.spark.SparkContext$$anonfun$stop$12.apply$mcV$sp(SparkContext.scala:1749)
at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1185)
at org.apache.spark.SparkContext.stop(SparkContext.scala:1748)
at org.apache.spark.SparkContext.(SparkContext.scala:593)
at 
org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:61)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown 
Source)
at java.lang.reflect.Constructor.newInstance(Unknown Source)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:214)
at 
py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Unknown Source)
---
Py4JJavaError Traceback (most recent call last)
C:\spark\bin\..\python\pyspark\shell.py in ()
 41 

[jira] [Commented] (SPARK-10919) Association rules class should return the support of each rule

2015-10-08 Thread Tofigh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14948235#comment-14948235
 ] 

Tofigh commented on SPARK-10919:


sure

> Association rules class should return the support of each rule
> --
>
> Key: SPARK-10919
> URL: https://issues.apache.org/jira/browse/SPARK-10919
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Tofigh
>Priority: Minor
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> The current implementation of Association rule does not return the frequency 
> of appearance of each rule. This piece of information is essential for 
> implementing functional dependency on top of the AR. In order to return the 
> frequency (support) of each rule,   freqUnion: Double, and  freqAntecedent: 
> Double should be:  val freqUnion: Double, val freqAntecedent: Double



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-11000) Derby have booted the database twice in yarn security mode.

2015-10-08 Thread SaintBacchus (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SaintBacchus updated SPARK-11000:
-
Comment: was deleted

(was: Very similar with it but this is in yarn security mode.)

> Derby have booted the database twice in yarn security mode.
> ---
>
> Key: SPARK-11000
> URL: https://issues.apache.org/jira/browse/SPARK-11000
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL, YARN
>Affects Versions: 1.6.0
>Reporter: SaintBacchus
>
> *bin/spark-shell --master yarn-client*
> If spark was build with hive, this simple command will also have a problem: 
> _Another instance of Derby may have already booted the database_
> {code:title=Exeception|borderStyle=solid}
> Caused by: java.sql.SQLException: Another instance of Derby may have already 
> booted the database /opt/client/Spark/spark/metastore_db.
> at 
> org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
> at 
> org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown
>  Source)
> at 
> org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(Unknown 
> Source)
> at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown 
> Source)
> ... 130 more
> Caused by: ERROR XSDB6: Another instance of Derby may have already booted the 
> database /opt/client/Spark/spark/metastore_db.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9040) StructField datatype Conversion Error

2015-10-08 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-9040.
--
   Resolution: Not A Problem
Fix Version/s: (was: 1.4.0)

> StructField datatype Conversion Error
> -
>
> Key: SPARK-9040
> URL: https://issues.apache.org/jira/browse/SPARK-9040
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core, SQL
>Affects Versions: 1.3.0
> Environment: Cloudera 5.3 on CDH 6
>Reporter: Sandeep Pal
>
> The following issue occurs if I specify the StructFields in specific order in 
> StructType as follow:
> fields = [StructField("d", IntegerType(), True),StructField("b", 
> IntegerType(), True),StructField("a", StringType(), True),StructField("c", 
> IntegerType(), True)]
> But the following code words fine:
> fields = [StructField("d", IntegerType(), True),StructField("b", 
> IntegerType(), True),StructField("c", IntegerType(), True),StructField("a", 
> StringType(), True)]
>  in ()
>  18 
>  19 schema = StructType(fields)
> ---> 20 schemasimid_simple = 
> sqlContext.createDataFrame(simid_simplereqfields, schema)
>  21 schemasimid_simple.registerTempTable("simid_simple")
> /usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/pyspark/sql/context.py in 
> createDataFrame(self, data, schema, samplingRatio)
> 302 
> 303 for row in rows:
> --> 304 _verify_type(row, schema)
> 305 
> 306 # convert python objects to sql data
> /usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/pyspark/sql/types.py in 
> _verify_type(obj, dataType)
> 986  "length of fields (%d)" % (len(obj), 
> len(dataType.fields)))
> 987 for v, f in zip(obj, dataType.fields):
> --> 988 _verify_type(v, f.dataType)
> 989 
> 990 _cached_cls = weakref.WeakValueDictionary()
> /usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/pyspark/sql/types.py in 
> _verify_type(obj, dataType)
> 970 if type(obj) not in _acceptable_types[_type]:
> 971 raise TypeError("%s can not accept object in type %s"
> --> 972 % (dataType, type(obj)))
> 973 
> 974 if isinstance(dataType, ArrayType):
> TypeError: StringType can not accept object in type 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10752) Implement corr() and cov in DataFrameStatFunctions

2015-10-08 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10752:
--
Assignee: Sun Rui

> Implement corr() and cov in DataFrameStatFunctions
> --
>
> Key: SPARK-10752
> URL: https://issues.apache.org/jira/browse/SPARK-10752
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 1.5.0
>Reporter: Sun Rui
>Assignee: Sun Rui
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10914) Incorrect empty join sets when executor-memory >= 32g

2015-10-08 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14948511#comment-14948511
 ] 

Sean Owen commented on SPARK-10914:
---

I don't think having it on one machine necessarily matters. You still have two 
JVMs in play; whereas you can't reproduce when just one JVM is involved. What 
if the driver has oops on, but the executor does not? and the results from the 
executor are parsed somewhere as if oops are on? Normally this would be wholly 
transparent to JVM bytecode but tungsten / SizeEstimator are depending in part 
on the actual representation of the object in memory.

> Incorrect empty join sets when executor-memory >= 32g
> -
>
> Key: SPARK-10914
> URL: https://issues.apache.org/jira/browse/SPARK-10914
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
> Environment: Ubuntu 14.04 (spark-slave), 12.04 (master)
>Reporter: Ben Moran
>
> Using an inner join, to match together two integer columns, I generally get 
> no results when there should be matches.  But the results vary and depend on 
> whether the dataframes are coming from SQL, JSON, or cached, as well as the 
> order in which I cache things and query them.
> This minimal example reproduces it consistently for me in the spark-shell, on 
> new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from 
> http://spark.apache.org/downloads.html.)
> {code}
> /* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */
> val x = sql("select 1 xx union all select 2") 
> val y = sql("select 1 yy union all select 2")
> x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */
> /* If I cache both tables it works: */
> x.cache()
> y.cache()
> x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */
> /* but this still doesn't work: */
> x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10914) Incorrect empty join sets when executor-memory >= 32g

2015-10-08 Thread Ben Moran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14948492#comment-14948492
 ] 

Ben Moran commented on SPARK-10914:
---

I just tried moving the master to the worker box, so it's entirely on one 
machine.  (Ubuntu 14.04 + now Oracle JDK 1.8).

It still reproduces the bug.  So, entirely on spark-worker:

{code}
spark@spark-worker:~/spark-1.5.1-bin-hadoop2.6$ sbin/start-master.sh
spark@spark-worker:~/spark-1.5.1-bin-hadoop2.6$ sbin/start-slave.sh --master 
spark://spark-worker:7077
spark@spark-worker:~/spark-1.5.1-bin-hadoop2.6$ bin/spark-shell --master 
spark://spark-worker:7077  --conf 
"spark.executor.extraJavaOptions=-XX:-UseCompressedOops"
log4j:WARN No appenders could be found for logger 
(org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more 
info.
Using Spark's repl log4j profile: 
org/apache/spark/log4j-defaults-repl.properties
To adjust logging level use sc.setLogLevel("INFO")
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.5.1
  /_/

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60)
Type in expressions to have them evaluated.
Type :help for more information.
15/10/08 12:15:12 WARN MetricsSystem: Using default name DAGScheduler for 
source because spark.app.id is not set.
Spark context available as sc.
15/10/08 12:15:14 WARN Connection: BoneCP specified but not present in 
CLASSPATH (or one of dependencies)
15/10/08 12:15:14 WARN Connection: BoneCP specified but not present in 
CLASSPATH (or one of dependencies)
15/10/08 12:15:19 WARN ObjectStore: Version information not found in metastore. 
hive.metastore.schema.verification is not enabled so recording the schema 
version 1.2.0
15/10/08 12:15:20 WARN ObjectStore: Failed to get database default, returning 
NoSuchObjectException
15/10/08 12:15:21 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
15/10/08 12:15:21 WARN Connection: BoneCP specified but not present in 
CLASSPATH (or one of dependencies)
15/10/08 12:15:21 WARN Connection: BoneCP specified but not present in 
CLASSPATH (or one of dependencies)
SQL context available as sqlContext.

scala> val x = sql("select 1 xx union all select 2")
x: org.apache.spark.sql.DataFrame = [xx: int]

scala> val y = sql("select 1 yy union all select 2")
y: org.apache.spark.sql.DataFrame = [yy: int]

scala> 

scala> x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */ 
res0: Long = 0

{code}

does give me the incorrect count.

> Incorrect empty join sets when executor-memory >= 32g
> -
>
> Key: SPARK-10914
> URL: https://issues.apache.org/jira/browse/SPARK-10914
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
> Environment: Ubuntu 14.04 (spark-slave), 12.04 (master)
>Reporter: Ben Moran
>
> Using an inner join, to match together two integer columns, I generally get 
> no results when there should be matches.  But the results vary and depend on 
> whether the dataframes are coming from SQL, JSON, or cached, as well as the 
> order in which I cache things and query them.
> This minimal example reproduces it consistently for me in the spark-shell, on 
> new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from 
> http://spark.apache.org/downloads.html.)
> {code}
> /* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */
> val x = sql("select 1 xx union all select 2") 
> val y = sql("select 1 yy union all select 2")
> x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */
> /* If I cache both tables it works: */
> x.cache()
> y.cache()
> x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */
> /* but this still doesn't work: */
> x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-9040) StructField datatype Conversion Error

2015-10-08 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reopened SPARK-9040:
--

> StructField datatype Conversion Error
> -
>
> Key: SPARK-9040
> URL: https://issues.apache.org/jira/browse/SPARK-9040
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core, SQL
>Affects Versions: 1.3.0
> Environment: Cloudera 5.3 on CDH 6
>Reporter: Sandeep Pal
>
> The following issue occurs if I specify the StructFields in specific order in 
> StructType as follow:
> fields = [StructField("d", IntegerType(), True),StructField("b", 
> IntegerType(), True),StructField("a", StringType(), True),StructField("c", 
> IntegerType(), True)]
> But the following code words fine:
> fields = [StructField("d", IntegerType(), True),StructField("b", 
> IntegerType(), True),StructField("c", IntegerType(), True),StructField("a", 
> StringType(), True)]
>  in ()
>  18 
>  19 schema = StructType(fields)
> ---> 20 schemasimid_simple = 
> sqlContext.createDataFrame(simid_simplereqfields, schema)
>  21 schemasimid_simple.registerTempTable("simid_simple")
> /usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/pyspark/sql/context.py in 
> createDataFrame(self, data, schema, samplingRatio)
> 302 
> 303 for row in rows:
> --> 304 _verify_type(row, schema)
> 305 
> 306 # convert python objects to sql data
> /usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/pyspark/sql/types.py in 
> _verify_type(obj, dataType)
> 986  "length of fields (%d)" % (len(obj), 
> len(dataType.fields)))
> 987 for v, f in zip(obj, dataType.fields):
> --> 988 _verify_type(v, f.dataType)
> 989 
> 990 _cached_cls = weakref.WeakValueDictionary()
> /usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/pyspark/sql/types.py in 
> _verify_type(obj, dataType)
> 970 if type(obj) not in _acceptable_types[_type]:
> 971 raise TypeError("%s can not accept object in type %s"
> --> 972 % (dataType, type(obj)))
> 973 
> 974 if isinstance(dataType, ArrayType):
> TypeError: StringType can not accept object in type 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10939) Misaligned data with RDD.zip after repartition

2015-10-08 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-10939.
---
Resolution: Not A Problem

Provisionally resolving as not a problem since RDDs don't have a guaranteed 
ordering. You can only really use zip with something that's sorted. Here you 
may indeed see different orderings within the same RDD if it gets recalculated.

> Misaligned data with RDD.zip after repartition
> --
>
> Key: SPARK-10939
> URL: https://issues.apache.org/jira/browse/SPARK-10939
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0, 1.4.1, 1.5.0
> Environment: - OSX 10.10.4, java 1.7.0_51, hadoop 2.6.0-cdh5.4.5
> - Ubuntu 12.04, java 1.7.0_80, hadoop 2.6.0-cdh5.4.5
>Reporter: Dan Brown
>
> Split out from https://issues.apache.org/jira/browse/SPARK-10685:
> Here's a weird behavior where {{RDD.zip}} after a {{repartition}} produces 
> "misaligned" data, meaning different column values in the same row aren't 
> matched, as if a zip shuffled the collections before zipping them. It's 
> difficult to reproduce because it's nondeterministic, doesn't occur in local 
> mode, and requires ≥2 workers (≥3 in one case). I was able to repro it using 
> pyspark 1.3.0 (cdh5.4.5), 1.4.1 (bin-without-hadoop), and 1.5.0 
> (bin-without-hadoop).
> Also, this {{DataFrame.zip}} issue is related in spirit, since we were trying 
> to build it ourselves when we ran into this problem. Let me put in my vote 
> for reopening the issue and supporting {{DataFrame.zip}} in the standard lib.
> - https://issues.apache.org/jira/browse/SPARK-7460
> h3. Repro
> Fail: RDD.zip after repartition
> {code}
> df  = sqlCtx.createDataFrame(Row(a=a) for a in xrange(1))
> df  = df.repartition(100)
> rdd = df.rdd.zip(df.map(lambda r: Row(b=r.a))).map(lambda (x,y): Row(a=x.a, 
> b=y.b))
> [r for r in rdd.collect() if r.a != r.b][:3] # Should be []
> {code}
> Sample outputs (nondeterministic):
> {code}
> []
> [Row(a=50, b=6947), Row(a=150, b=7047), Row(a=250, b=7147)]
> []
> []
> [Row(a=44, b=644), Row(a=144, b=744), Row(a=244, b=844)]
> []
> {code}
> Test setup:
> - local\[8]: {{MASTER=local\[8]}}
> - dist\[N]: 1 driver + 1 master + N workers
> {code}
> "Fail" tests pass?  cluster mode  spark version
> 
> yes local[8]  1.3.0-cdh5.4.5
> no  dist[4]   1.3.0-cdh5.4.5
> yes local[8]  1.4.1
> yes dist[1]   1.4.1
> no  dist[2]   1.4.1
> no  dist[4]   1.4.1
> yes local[8]  1.5.0
> yes dist[1]   1.5.0
> no  dist[2]   1.5.0
> no  dist[4]   1.5.0
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10979) SparkR: Add merge to DataFrame

2015-10-08 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10979:
--
Component/s: SparkR

> SparkR: Add merge to DataFrame
> --
>
> Key: SPARK-10979
> URL: https://issues.apache.org/jira/browse/SPARK-10979
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Narine Kokhlikyan
>
> Add merge function to DataFrame, which supports R signature.
> https://stat.ethz.ch/R-manual/R-devel/library/base/html/merge.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10914) Incorrect empty join sets when executor-memory >= 32g

2015-10-08 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14948486#comment-14948486
 ] 

Sean Owen commented on SPARK-10914:
---

Still kind of guessing here... but what if the problem is that the computation 
spans machines that have a different oops configuration (driver vs executor) 
and that breaks some assumption in the low-level byte munging?

> Incorrect empty join sets when executor-memory >= 32g
> -
>
> Key: SPARK-10914
> URL: https://issues.apache.org/jira/browse/SPARK-10914
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
> Environment: Ubuntu 14.04 (spark-slave), 12.04 (master)
>Reporter: Ben Moran
>
> Using an inner join, to match together two integer columns, I generally get 
> no results when there should be matches.  But the results vary and depend on 
> whether the dataframes are coming from SQL, JSON, or cached, as well as the 
> order in which I cache things and query them.
> This minimal example reproduces it consistently for me in the spark-shell, on 
> new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from 
> http://spark.apache.org/downloads.html.)
> {code}
> /* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */
> val x = sql("select 1 xx union all select 2") 
> val y = sql("select 1 yy union all select 2")
> x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */
> /* If I cache both tables it works: */
> x.cache()
> y.cache()
> x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */
> /* but this still doesn't work: */
> x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-10940) Too many open files Spark Shuffle

2015-10-08 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reopened SPARK-10940:
---

> Too many open files Spark Shuffle
> -
>
> Key: SPARK-10940
> URL: https://issues.apache.org/jira/browse/SPARK-10940
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, SQL
>Affects Versions: 1.5.0
> Environment: 6 node standalone spark cluster with 1 master and 5 
> worker nodes on Centos 6.6 for all nodes. Each node has > 100 GB memory and 
> 36 cores.
>Reporter: Sandeep Pal
>
> Executing terasort by Spark-SQL on the data generated by teragen in hadoop. 
> Data size generated is ~456 GB. 
> Terasort passing with --total-executor-cores = 40, where as failing for 
> --total-executor-cores = 120. 
> I have tried to increase the ulimit to 10k but the problem persists.
> Note: The above failed configuration of 120 cores worked on spark core code 
> on the top of rdd. The failure is only in case of using Spark SQL.
> Below is the error message from one of the executor node:
> java.io.FileNotFoundException: 
> /tmp/spark-e15993e8-51a4-452a-8b86-da0169445065/executor-0c661152-3837-4711-bba2-2abf4fd15240/blockmgr-973aab72-feb8-4c60-ba3d-1b2ee27a1cc2/3f/temp_shuffle_7741538d-3ccf-4566-869f-265655ca9c90
>  (Too many open files)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10939) Misaligned data with RDD.zip after repartition

2015-10-08 Thread Michael Malak (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14948596#comment-14948596
 ] 

Michael Malak commented on SPARK-10939:
---

Here Matei explains the explicit design decision to prefer shuffle performance 
arising from randomization over deterministic RDD computation:

https://issues.apache.org/jira/browse/SPARK-3098?focusedCommentId=14110183=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14110183

It has made it into the documentation (though perhaps not clearly enough, 
especially regarding the rationale):

https://issues.apache.org/jira/browse/SPARK-3356
https://github.com/apache/spark/pull/2508/files


> Misaligned data with RDD.zip after repartition
> --
>
> Key: SPARK-10939
> URL: https://issues.apache.org/jira/browse/SPARK-10939
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0, 1.4.1, 1.5.0
> Environment: - OSX 10.10.4, java 1.7.0_51, hadoop 2.6.0-cdh5.4.5
> - Ubuntu 12.04, java 1.7.0_80, hadoop 2.6.0-cdh5.4.5
>Reporter: Dan Brown
>
> Split out from https://issues.apache.org/jira/browse/SPARK-10685:
> Here's a weird behavior where {{RDD.zip}} after a {{repartition}} produces 
> "misaligned" data, meaning different column values in the same row aren't 
> matched, as if a zip shuffled the collections before zipping them. It's 
> difficult to reproduce because it's nondeterministic, doesn't occur in local 
> mode, and requires ≥2 workers (≥3 in one case). I was able to repro it using 
> pyspark 1.3.0 (cdh5.4.5), 1.4.1 (bin-without-hadoop), and 1.5.0 
> (bin-without-hadoop).
> Also, this {{DataFrame.zip}} issue is related in spirit, since we were trying 
> to build it ourselves when we ran into this problem. Let me put in my vote 
> for reopening the issue and supporting {{DataFrame.zip}} in the standard lib.
> - https://issues.apache.org/jira/browse/SPARK-7460
> h3. Repro
> Fail: RDD.zip after repartition
> {code}
> df  = sqlCtx.createDataFrame(Row(a=a) for a in xrange(1))
> df  = df.repartition(100)
> rdd = df.rdd.zip(df.map(lambda r: Row(b=r.a))).map(lambda (x,y): Row(a=x.a, 
> b=y.b))
> [r for r in rdd.collect() if r.a != r.b][:3] # Should be []
> {code}
> Sample outputs (nondeterministic):
> {code}
> []
> [Row(a=50, b=6947), Row(a=150, b=7047), Row(a=250, b=7147)]
> []
> []
> [Row(a=44, b=644), Row(a=144, b=744), Row(a=244, b=844)]
> []
> {code}
> Test setup:
> - local\[8]: {{MASTER=local\[8]}}
> - dist\[N]: 1 driver + 1 master + N workers
> {code}
> "Fail" tests pass?  cluster mode  spark version
> 
> yes local[8]  1.3.0-cdh5.4.5
> no  dist[4]   1.3.0-cdh5.4.5
> yes local[8]  1.4.1
> yes dist[1]   1.4.1
> no  dist[2]   1.4.1
> no  dist[4]   1.4.1
> yes local[8]  1.5.0
> yes dist[1]   1.5.0
> no  dist[2]   1.5.0
> no  dist[4]   1.5.0
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11001) SQLContext doesn't support window function

2015-10-08 Thread jixing.ji (JIRA)
jixing.ji created SPARK-11001:
-

 Summary: SQLContext doesn't support window function
 Key: SPARK-11001
 URL: https://issues.apache.org/jira/browse/SPARK-11001
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.5.1
 Environment: windows pyspark
ubuntu pyspark
Reporter: jixing.ji


currently, SQLContext doesn't support window function, which made a lot of cool 
features like lag, rank very hard to be implemented with hivecontext.
If spark can support window function in SQLContext, it will help a lot for data 
analyzing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10914) Incorrect empty join sets when executor-memory >= 32g

2015-10-08 Thread Ben Moran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14948460#comment-14948460
 ] 

Ben Moran commented on SPARK-10914:
---

On latest master for me .count() also always seems to return 5 for everything! 
I think that is a separate bug - I think I saw it filed already but I can't 
find it now.

> Incorrect empty join sets when executor-memory >= 32g
> -
>
> Key: SPARK-10914
> URL: https://issues.apache.org/jira/browse/SPARK-10914
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
> Environment: Ubuntu 14.04 (spark-slave), 12.04 (master)
>Reporter: Ben Moran
>
> Using an inner join, to match together two integer columns, I generally get 
> no results when there should be matches.  But the results vary and depend on 
> whether the dataframes are coming from SQL, JSON, or cached, as well as the 
> order in which I cache things and query them.
> This minimal example reproduces it consistently for me in the spark-shell, on 
> new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from 
> http://spark.apache.org/downloads.html.)
> {code}
> /* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */
> val x = sql("select 1 xx union all select 2") 
> val y = sql("select 1 yy union all select 2")
> x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */
> /* If I cache both tables it works: */
> x.cache()
> y.cache()
> x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */
> /* but this still doesn't work: */
> x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10883) Document building each module individually

2015-10-08 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-10883.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8993
[https://github.com/apache/spark/pull/8993]

> Document building each module individually
> --
>
> Key: SPARK-10883
> URL: https://issues.apache.org/jira/browse/SPARK-10883
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Jean-Baptiste Onofré
>Priority: Trivial
> Fix For: 1.6.0
>
>
> Right now, due to the location of the scalastyle-config.xml location, it's 
> not possible to build an individual module.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10940) Too many open files Spark Shuffle

2015-10-08 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-10940.
---
Resolution: Cannot Reproduce

> Too many open files Spark Shuffle
> -
>
> Key: SPARK-10940
> URL: https://issues.apache.org/jira/browse/SPARK-10940
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, SQL
>Affects Versions: 1.5.0
> Environment: 6 node standalone spark cluster with 1 master and 5 
> worker nodes on Centos 6.6 for all nodes. Each node has > 100 GB memory and 
> 36 cores.
>Reporter: Sandeep Pal
>
> Executing terasort by Spark-SQL on the data generated by teragen in hadoop. 
> Data size generated is ~456 GB. 
> Terasort passing with --total-executor-cores = 40, where as failing for 
> --total-executor-cores = 120. 
> I have tried to increase the ulimit to 10k but the problem persists.
> Note: The above failed configuration of 120 cores worked on spark core code 
> on the top of rdd. The failure is only in case of using Spark SQL.
> Below is the error message from one of the executor node:
> java.io.FileNotFoundException: 
> /tmp/spark-e15993e8-51a4-452a-8b86-da0169445065/executor-0c661152-3837-4711-bba2-2abf4fd15240/blockmgr-973aab72-feb8-4c60-ba3d-1b2ee27a1cc2/3f/temp_shuffle_7741538d-3ccf-4566-869f-265655ca9c90
>  (Too many open files)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10914) Incorrect empty join sets when executor-memory >= 32g

2015-10-08 Thread Ben Moran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14948513#comment-14948513
 ] 

Ben Moran commented on SPARK-10914:
---

I think you've got it - if I also turn off UseCompressedOops for the driver as 
well as the executor, it gives correct results:

 bin/spark-shell --master spark://spark-worker:7077  --conf 
"spark.executor.extraJavaOptions=-XX:-UseCompressedOops" --driver-java-options 
"-XX:-UseCompressedOops"


Does this leave me with a viable workaround?  I'm not sure of the impact of 
UseCompressedOops

> Incorrect empty join sets when executor-memory >= 32g
> -
>
> Key: SPARK-10914
> URL: https://issues.apache.org/jira/browse/SPARK-10914
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
> Environment: Ubuntu 14.04 (spark-slave), 12.04 (master)
>Reporter: Ben Moran
>
> Using an inner join, to match together two integer columns, I generally get 
> no results when there should be matches.  But the results vary and depend on 
> whether the dataframes are coming from SQL, JSON, or cached, as well as the 
> order in which I cache things and query them.
> This minimal example reproduces it consistently for me in the spark-shell, on 
> new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from 
> http://spark.apache.org/downloads.html.)
> {code}
> /* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */
> val x = sql("select 1 xx union all select 2") 
> val y = sql("select 1 yy union all select 2")
> x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */
> /* If I cache both tables it works: */
> x.cache()
> y.cache()
> x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */
> /* but this still doesn't work: */
> x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10978) Allow PrunedFilterScan to eliminate predicates from further evaluation

2015-10-08 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10978:
--
Priority: Minor  (was: Major)

([~rspitzer] don't set Fix Version)

> Allow PrunedFilterScan to eliminate predicates from further evaluation
> --
>
> Key: SPARK-10978
> URL: https://issues.apache.org/jira/browse/SPARK-10978
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.3.0, 1.4.0, 1.5.0
>Reporter: Russell Alexander Spitzer
>Priority: Minor
> Fix For: 1.6.0
>
>
> Currently PrunedFilterScan allows implementors to push down predicates to an 
> underlying datasource. This is done solely as an optimization as the 
> predicate will be reapplied on the Spark side as well. This allows for 
> bloom-filter like operations but ends up doing a redundant scan for those 
> sources which can do accurate pushdowns.
> In addition it makes it difficult for underlying sources to accept queries 
> which reference non-existent to provide ancillary function. In our case we 
> allow a solr query to be passed in via a non-existent solr_query column. 
> Since this column is not returned when Spark does a filter on "solr_query" 
> nothing passes. 
> Suggestion on the ML from [~marmbrus] 
> {quote}
> We have to try and maintain binary compatibility here, so probably the 
> easiest thing to do here would be to add a method to the class.  Perhaps 
> something like:
> def unhandledFilters(filters: Array[Filter]): Array[Filter] = filters
> By default, this could return all filters so behavior would remain the same, 
> but specific implementations could override it.  There is still a chance that 
> this would conflict with existing methods, but hopefully that would not be a 
> problem in practice.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11002) pyspark doesn't support UDAF

2015-10-08 Thread jixing.ji (JIRA)
jixing.ji created SPARK-11002:
-

 Summary: pyspark doesn't support UDAF
 Key: SPARK-11002
 URL: https://issues.apache.org/jira/browse/SPARK-11002
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 1.5.1
 Environment: windows pyspark
ubuntu pyspark
Reporter: jixing.ji


currently, pyspark doesn't support user defined aggregated function, which made 
that data analyzer cannot get some specific aggregated result, for example, 
group by column A, concat column B into a list



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10914) Incorrect empty join sets when executor-memory >= 32g

2015-10-08 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14948438#comment-14948438
 ] 

Sean Owen commented on SPARK-10914:
---

I ran the latest master in standalone mode, with {{/bin/spark-shell --master 
spark://localhost:7077 --executor-memory 31g --conf 
"spark.executor.extraJavaOptions=-XX:-UseCompressedOops"}}

{code}
scala> val x = sql("select 1 xx union all select 2")
x: org.apache.spark.sql.DataFrame = [xx: int]

scala> val y = sql("select 1 yy union all select 2")
y: org.apache.spark.sql.DataFrame = [yy: int]

scala> x.join(y, $"xx" === $"yy").count()
res0: Long = 5  
{code}

Without the {{-XX:-UseCompressedOops}} the answer is 2.

Could be unrelated to {{SizeEstimator}}, yes. I get 5 when setting 
{{spark.test.useCompressedOops=false}} too, which seems to indicate it's not 
{{SizeEstimator}}.

Could it be something in the Tungsten machinery? I see it's part of the plan 
above. I wasn't sure how to test with that disabled.

> Incorrect empty join sets when executor-memory >= 32g
> -
>
> Key: SPARK-10914
> URL: https://issues.apache.org/jira/browse/SPARK-10914
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
> Environment: Ubuntu 14.04 (spark-slave), 12.04 (master)
>Reporter: Ben Moran
>
> Using an inner join, to match together two integer columns, I generally get 
> no results when there should be matches.  But the results vary and depend on 
> whether the dataframes are coming from SQL, JSON, or cached, as well as the 
> order in which I cache things and query them.
> This minimal example reproduces it consistently for me in the spark-shell, on 
> new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from 
> http://spark.apache.org/downloads.html.)
> {code}
> /* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */
> val x = sql("select 1 xx union all select 2") 
> val y = sql("select 1 yy union all select 2")
> x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */
> /* If I cache both tables it works: */
> x.cache()
> y.cache()
> x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */
> /* but this still doesn't work: */
> x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10883) Document building each module individually

2015-10-08 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10883:
--
Assignee: Jean-Baptiste Onofré

> Document building each module individually
> --
>
> Key: SPARK-10883
> URL: https://issues.apache.org/jira/browse/SPARK-10883
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Jean-Baptiste Onofré
>Assignee: Jean-Baptiste Onofré
>Priority: Trivial
> Fix For: 1.6.0
>
>
> Right now, due to the location of the scalastyle-config.xml location, it's 
> not possible to build an individual module.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10914) Incorrect empty join sets when executor-memory >= 32g

2015-10-08 Thread Ben Moran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14948494#comment-14948494
 ] 

Ben Moran commented on SPARK-10914:
---

Either using the large heap, or -XX:-UseCompressedOops triggers the bug.

> Incorrect empty join sets when executor-memory >= 32g
> -
>
> Key: SPARK-10914
> URL: https://issues.apache.org/jira/browse/SPARK-10914
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
> Environment: Ubuntu 14.04 (spark-slave), 12.04 (master)
>Reporter: Ben Moran
>
> Using an inner join, to match together two integer columns, I generally get 
> no results when there should be matches.  But the results vary and depend on 
> whether the dataframes are coming from SQL, JSON, or cached, as well as the 
> order in which I cache things and query them.
> This minimal example reproduces it consistently for me in the spark-shell, on 
> new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from 
> http://spark.apache.org/downloads.html.)
> {code}
> /* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */
> val x = sql("select 1 xx union all select 2") 
> val y = sql("select 1 yy union all select 2")
> x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */
> /* If I cache both tables it works: */
> x.cache()
> y.cache()
> x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */
> /* but this still doesn't work: */
> x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10914) Incorrect empty join sets when executor-memory >= 32g

2015-10-08 Thread Ben Moran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14948464#comment-14948464
 ] 

Ben Moran commented on SPARK-10914:
---

I also don't see it if I run spark-shell without setting --master.

> Incorrect empty join sets when executor-memory >= 32g
> -
>
> Key: SPARK-10914
> URL: https://issues.apache.org/jira/browse/SPARK-10914
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
> Environment: Ubuntu 14.04 (spark-slave), 12.04 (master)
>Reporter: Ben Moran
>
> Using an inner join, to match together two integer columns, I generally get 
> no results when there should be matches.  But the results vary and depend on 
> whether the dataframes are coming from SQL, JSON, or cached, as well as the 
> order in which I cache things and query them.
> This minimal example reproduces it consistently for me in the spark-shell, on 
> new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from 
> http://spark.apache.org/downloads.html.)
> {code}
> /* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */
> val x = sql("select 1 xx union all select 2") 
> val y = sql("select 1 yy union all select 2")
> x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */
> /* If I cache both tables it works: */
> x.cache()
> y.cache()
> x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */
> /* but this still doesn't work: */
> x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10925) Exception when joining DataFrames

2015-10-08 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10925:
--
Component/s: SQL

> Exception when joining DataFrames
> -
>
> Key: SPARK-10925
> URL: https://issues.apache.org/jira/browse/SPARK-10925
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
> Environment: Tested with Spark 1.5.0 and Spark 1.5.1
>Reporter: Alexis Seigneurin
> Attachments: Photo 05-10-2015 14 31 16.jpg, TestCase2.scala
>
>
> I get an exception when joining a DataFrame with another DataFrame. The 
> second DataFrame was created by performing an aggregation on the first 
> DataFrame.
> My complete workflow is:
> # read the DataFrame
> # apply an UDF on column "name"
> # apply an UDF on column "surname"
> # apply an UDF on column "birthDate"
> # aggregate on "name" and re-join with the DF
> # aggregate on "surname" and re-join with the DF
> If I remove one step, the process completes normally.
> Here is the exception:
> {code}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved 
> attribute(s) surname#20 missing from id#0,birthDate#3,name#10,surname#7 in 
> operator !Project [id#0,birthDate#3,name#10,surname#20,UDF(birthDate#3) AS 
> birthDate_cleaned#8];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:132)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154)
>   at org.apache.spark.sql.DataFrame.join(DataFrame.scala:553)
>   at org.apache.spark.sql.DataFrame.join(DataFrame.scala:520)
>   at TestCase2$.main(TestCase2.scala:51)
>   at TestCase2.main(TestCase2.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)
> {code}
> I'm attaching a test case 

[jira] [Updated] (SPARK-10939) Misaligned data with RDD.zip after repartition

2015-10-08 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10939:
--
Component/s: Spark Core

> Misaligned data with RDD.zip after repartition
> --
>
> Key: SPARK-10939
> URL: https://issues.apache.org/jira/browse/SPARK-10939
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0, 1.4.1, 1.5.0
> Environment: - OSX 10.10.4, java 1.7.0_51, hadoop 2.6.0-cdh5.4.5
> - Ubuntu 12.04, java 1.7.0_80, hadoop 2.6.0-cdh5.4.5
>Reporter: Dan Brown
>
> Split out from https://issues.apache.org/jira/browse/SPARK-10685:
> Here's a weird behavior where {{RDD.zip}} after a {{repartition}} produces 
> "misaligned" data, meaning different column values in the same row aren't 
> matched, as if a zip shuffled the collections before zipping them. It's 
> difficult to reproduce because it's nondeterministic, doesn't occur in local 
> mode, and requires ≥2 workers (≥3 in one case). I was able to repro it using 
> pyspark 1.3.0 (cdh5.4.5), 1.4.1 (bin-without-hadoop), and 1.5.0 
> (bin-without-hadoop).
> Also, this {{DataFrame.zip}} issue is related in spirit, since we were trying 
> to build it ourselves when we ran into this problem. Let me put in my vote 
> for reopening the issue and supporting {{DataFrame.zip}} in the standard lib.
> - https://issues.apache.org/jira/browse/SPARK-7460
> h3. Repro
> Fail: RDD.zip after repartition
> {code}
> df  = sqlCtx.createDataFrame(Row(a=a) for a in xrange(1))
> df  = df.repartition(100)
> rdd = df.rdd.zip(df.map(lambda r: Row(b=r.a))).map(lambda (x,y): Row(a=x.a, 
> b=y.b))
> [r for r in rdd.collect() if r.a != r.b][:3] # Should be []
> {code}
> Sample outputs (nondeterministic):
> {code}
> []
> [Row(a=50, b=6947), Row(a=150, b=7047), Row(a=250, b=7147)]
> []
> []
> [Row(a=44, b=644), Row(a=144, b=744), Row(a=244, b=844)]
> []
> {code}
> Test setup:
> - local\[8]: {{MASTER=local\[8]}}
> - dist\[N]: 1 driver + 1 master + N workers
> {code}
> "Fail" tests pass?  cluster mode  spark version
> 
> yes local[8]  1.3.0-cdh5.4.5
> no  dist[4]   1.3.0-cdh5.4.5
> yes local[8]  1.4.1
> yes dist[1]   1.4.1
> no  dist[2]   1.4.1
> no  dist[4]   1.4.1
> yes local[8]  1.5.0
> yes dist[1]   1.5.0
> no  dist[2]   1.5.0
> no  dist[4]   1.5.0
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10995) Graceful shutdown drops processing in Spark Streaming

2015-10-08 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14948693#comment-14948693
 ] 

Sean Owen commented on SPARK-10995:
---

TD's the expert, but I don't really get that -- if you're processing the last 
hour of data each minute, then I'd expect shutdown to process the current 
minute, not for another hour.

Here however your batch interval and window are the same. In that case do you 
need a window at all?

> Graceful shutdown drops processing in Spark Streaming
> -
>
> Key: SPARK-10995
> URL: https://issues.apache.org/jira/browse/SPARK-10995
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1
>Reporter: Michal Cizmazia
>
> After triggering the graceful shutdown on the following application, the 
> application stops before the windowed stream reaches its slide duration. As a 
> result, the data is not completely processed (i.e. saveToMyStorage is not 
> called) before shutdown.
> According to the documentation, graceful shutdown should ensure that the 
> data, which has been received, is completely processed before shutdown.
> https://spark.apache.org/docs/1.4.0/streaming-programming-guide.html#upgrading-application-code
> Spark version: 1.4.1
> Code snippet:
> {code:java}
> Function0 factory = () -> {
> JavaStreamingContext context = new JavaStreamingContext(sparkConf, 
> Durations.minutes(1));
> context.checkpoint("/test");
> JavaDStream records = 
> context.receiverStream(myReliableReceiver).flatMap(...);
> records.persist(StorageLevel.MEMORY_AND_DISK());
> records.foreachRDD(rdd -> { rdd.count(); return null; });
> records
> .window(Durations.minutes(15), Durations.minutes(15))
> .foreachRDD(rdd -> saveToMyStorage(rdd));
> return context;
> };
> try (JavaStreamingContext context = JavaStreamingContext.getOrCreate("/test", 
> factory)) {
> context.start();
> waitForShutdownSignal();
> Boolean stopSparkContext = true;
> Boolean stopGracefully = true;
> context.stop(stopSparkContext, stopGracefully);
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4105) FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle

2015-10-08 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14948678#comment-14948678
 ] 

Imran Rashid commented on SPARK-4105:
-

[~nadenf] The FetchFailures don't need to be on the same node as the snappy 
exceptions for the root cause to be SPARK-8029.  Of course, we can't be sure 
that is the cause either, but it is at least a working hypothesis for now.

> FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based 
> shuffle
> -
>
> Key: SPARK-4105
> URL: https://issues.apache.org/jira/browse/SPARK-4105
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 1.2.0, 1.2.1, 1.3.0, 1.4.1
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Blocker
> Attachments: JavaObjectToSerialize.java, 
> SparkFailedToUncompressGenerator.scala
>
>
> We have seen non-deterministic {{FAILED_TO_UNCOMPRESS(5)}} errors during 
> shuffle read.  Here's a sample stacktrace from an executor:
> {code}
> 14/10/23 18:34:11 ERROR Executor: Exception in task 1747.3 in stage 11.0 (TID 
> 33053)
> java.io.IOException: FAILED_TO_UNCOMPRESS(5)
>   at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78)
>   at org.xerial.snappy.SnappyNative.rawUncompress(Native Method)
>   at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:391)
>   at org.xerial.snappy.Snappy.uncompress(Snappy.java:427)
>   at 
> org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:127)
>   at 
> org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88)
>   at org.xerial.snappy.SnappyInputStream.(SnappyInputStream.java:58)
>   at 
> org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:128)
>   at 
> org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1090)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:116)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:115)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:243)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:129)
>   at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)
>   at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158)
>   at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
>   at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at 
> org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at 
> org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:56)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>  

[jira] [Created] (SPARK-11003) Allowing UserDefinedTypes to extend primatives

2015-10-08 Thread John Muller (JIRA)
John Muller created SPARK-11003:
---

 Summary: Allowing UserDefinedTypes to extend primatives
 Key: SPARK-11003
 URL: https://issues.apache.org/jira/browse/SPARK-11003
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.5.1, 1.5.0
Reporter: John Muller
Priority: Minor


Currently, the classes and constructors of all the primative DataTypes (of 
StructFields) are private:

https://github.com/apache/spark/tree/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types

Which means for even simple String-based UDTs users will always have to 
implement serialize() and deserialize().  UDTs for something as simple as a 
Northwind database (products, orders, customers) would be very useful for 
pattern matching / validation.  For example:

import org.apache.spark.sql.types._
@SQLUserDefinedType(udt = classOf[ProductNameUDT])
case class ProductName(name: String) extends StringType with Validator {
  import scala.util.matching.Regex
  private val pattern = """[A-Z][A-Za-z]*"""
  def validate(): Boolean = {
name match {
  case pattern(_*) => true
  case _ => false
}
  }
}

class ProductNameUDT extends UserDefinedType[ProductName] {
  // No need for this; ProductName is a StringType so we know how to deserialize
  override def serialize(p: Any): Any = {
p match {
  case p: ProductName => Seq(p.name)
}
  }
  
  // Not sure why this override is needed at all; can't we always get this 
simply by the UDT type param?
  override def userClass: Class[ProductName] = classOf[ProductName]
  
  // Instead of the below, just infer the StructField name via reflection of 
the wrapper class' name
  override def sqlType: DataType = StructType(Seq(StructField("ProductName", 
StringType)))

  // Still needed.
  override def deserialize(datum: Any): ProductName = {
datum match {
  case values: Seq[_] =>
assert(values.length == 1)
ProductName(values.head.asInstanceOf[String])
}
  }
}

This would simplify the process of creating "primative extension" UDTs down to 
just 2 steps:
1. Annotated case class that extends a primative DataType
2. The UDT itself just needs a deserializer



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7874) Add a global setting for the fine-grained mesos scheduler that limits the number of concurrent tasks of a job

2015-10-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14948638#comment-14948638
 ] 

Apache Spark commented on SPARK-7874:
-

User 'dragos' has created a pull request for this issue:
https://github.com/apache/spark/pull/9027

> Add a global setting for the fine-grained mesos scheduler that limits the 
> number of concurrent tasks of a job
> -
>
> Key: SPARK-7874
> URL: https://issues.apache.org/jira/browse/SPARK-7874
> Project: Spark
>  Issue Type: Wish
>  Components: Mesos
>Affects Versions: 1.3.1
>Reporter: Thomas Dudziak
>Priority: Minor
>
> This would be a very simple yet effective way to prevent a job dominating the 
> cluster. A way to override it per job would also be nice but not required.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10995) Graceful shutdown drops processing in Spark Streaming

2015-10-08 Thread Michal Cizmazia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14948667#comment-14948667
 ] 

Michal Cizmazia commented on SPARK-10995:
-

[On 7 October 2015 at 21:24, Tathagata 
Das:|http://search-hadoop.com/m/q3RTtftj6Y1Mu9z1]
{quote}
Aaah, interesting, you are doing 15 minute slide duration. Yeah, internally the 
streaming scheduler waits for the last "batch" interval which has data to be 
processed, but if there is a sliding interval (i.e. 15 mins) that is higher 
than batch interval, then that might not be run. This is indeed a bug and 
should be fixed. Mind setting up a JIRA and assigning it to me. 
{quote}

> Graceful shutdown drops processing in Spark Streaming
> -
>
> Key: SPARK-10995
> URL: https://issues.apache.org/jira/browse/SPARK-10995
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1
>Reporter: Michal Cizmazia
>
> After triggering the graceful shutdown on the following application, the 
> application stops before the windowed stream reaches its slide duration. As a 
> result, the data is not completely processed (i.e. saveToMyStorage is not 
> called) before shutdown.
> According to the documentation, graceful shutdown should ensure that the 
> data, which has been received, is completely processed before shutdown.
> https://spark.apache.org/docs/1.4.0/streaming-programming-guide.html#upgrading-application-code
> Spark version: 1.4.1
> Code snippet:
> {code:java}
> Function0 factory = () -> {
> JavaStreamingContext context = new JavaStreamingContext(sparkConf, 
> Durations.minutes(1));
> context.checkpoint("/test");
> JavaDStream records = 
> context.receiverStream(myReliableReceiver).flatMap(...);
> records.persist(StorageLevel.MEMORY_AND_DISK());
> records.foreachRDD(rdd -> { rdd.count(); return null; });
> records
> .window(Durations.minutes(15), Durations.minutes(15))
> .foreachRDD(rdd -> saveToMyStorage(rdd));
> return context;
> };
> try (JavaStreamingContext context = JavaStreamingContext.getOrCreate("/test", 
> factory)) {
> context.start();
> waitForShutdownSignal();
> Boolean stopSparkContext = true;
> Boolean stopGracefully = true;
> context.stop(stopSparkContext, stopGracefully);
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11004) MapReduce Hive-like join operations for RDDs

2015-10-08 Thread Glenn Strycker (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glenn Strycker updated SPARK-11004:
---
Description: 
Could a feature be added to Spark that would use disk-only MapReduce operations 
for the very largest RDD joins?

MapReduce is able to handle incredibly large table joins in a stable, 
predictable way with gracious failures and recovery.  I have applications that 
are able to join 2 tables without error in Hive, but these same tables, when 
converted into RDDs, are unable to join in Spark (I am using the same cluster, 
and have played around with all of the memory configurations, persisting to 
disk, checkpointing, etc., and the RDDs are just too big for Spark on my 
cluster)

So, Spark is usually able to handle fairly large RDD joins, but occasionally 
runs into problems when the tables are just too big (e.g. the notorious 2GB 
shuffle limit issue, memory problems, etc.)  There are so many parameters to 
adjust (number of partitions, number of cores, memory per core, etc.) that it 
is difficult to guarantee stability on a shared cluster (say, running Yarn) 
with other jobs.

Could a feature be added to Spark that would use disk-only MapReduce commands 
to do very large joins?

That is, instead of myRDD1.join(myRDD2), we would have a special operation 
myRDD1.mapReduceJoin(myRDD2) that would checkpoint both RDDs to disk, run 
MapReduce, and then convert the results of the join back into a standard RDD.

This might add stability for Spark jobs that deal with extremely large data, 
and enable developers to mix-and-match some Spark and MapReduce operations in 
the same program, rather than writing Hive scripts and stringing together Spark 
and MapReduce programs, which has extremely large overhead to convert RDDs to 
Hive tables and back again.

Despite memory-level operations being where most of Spark's speed gains lie, 
sometimes using disk-only may help with stability!


  was:
Could a feature be added to Spark that would use disk-only MapReduce operations 
for the very largest RDD joins?

MapReduce is able to handle incredibly large table joins in a stable, 
predictable way with gracious failures and recovery.  I have applications that 
are able to join 2 tables without error in Hive, but these same tables, when 
converted into RDDs, are unable to join in Spark (I am using the same cluster, 
and have played around with all of the memory configurations, persisting to 
disk, checkpointing, etc., and the RDDs are just too big for Spark on my 
cluster)

So, Spark is usually able to handle fairly large RDD joins, but occasionally 
runs into problems when the tables are just too big (e.g. the notorious 2GB 
shuffle limit issue, memory problems, etc.)  There are so many parameters to 
adjust (number of partitions, number of cores, memory per core, etc.) that it 
is difficult to guarantee stability on a shared cluster (say, running Yarn) 
with other jobs.

Could a feature be added to Spark that would use disk-only MapReduce commands 
to do very large joins?

That is, instead of myRDD1.join(myRDD2), we would have a special operation 
myRDD1.mapReduceJoin(myRDD2) that would checkpoint both RDDs to disk, run 
MapReduce, and then convert the results of the join back into a standard RDD.

This might add stability for Spark jobs that deal with extremely large data, 
and enable developers to mix-and-max some Spark and MapReduce operations in the 
same program, rather than writing Hive scripts and stringing together Spark and 
MapReduce programs, which has extremely large overhead to convert RDDs to Hive 
tables and back again.

Despite memory-level operations being where most of Spark's speed gains lie, 
sometimes using disk-only may help with stability!



> MapReduce Hive-like join operations for RDDs
> 
>
> Key: SPARK-11004
> URL: https://issues.apache.org/jira/browse/SPARK-11004
> Project: Spark
>  Issue Type: New Feature
>Reporter: Glenn Strycker
>
> Could a feature be added to Spark that would use disk-only MapReduce 
> operations for the very largest RDD joins?
> MapReduce is able to handle incredibly large table joins in a stable, 
> predictable way with gracious failures and recovery.  I have applications 
> that are able to join 2 tables without error in Hive, but these same tables, 
> when converted into RDDs, are unable to join in Spark (I am using the same 
> cluster, and have played around with all of the memory configurations, 
> persisting to disk, checkpointing, etc., and the RDDs are just too big for 
> Spark on my cluster)
> So, Spark is usually able to handle fairly large RDD joins, but occasionally 
> runs into problems when the tables are just too big (e.g. the notorious 2GB 
> shuffle limit issue, memory problems, etc.)  There are so many parameters to 
> adjust (number of 

[jira] [Created] (SPARK-11004) MapReduce Hive-like join operations for RDDs

2015-10-08 Thread Glenn Strycker (JIRA)
Glenn Strycker created SPARK-11004:
--

 Summary: MapReduce Hive-like join operations for RDDs
 Key: SPARK-11004
 URL: https://issues.apache.org/jira/browse/SPARK-11004
 Project: Spark
  Issue Type: New Feature
Reporter: Glenn Strycker


Could a feature be added to Spark that would use disk-only MapReduce operations 
for the very largest RDD joins?

MapReduce is able to handle incredibly large table joins in a stable, 
predictable way with gracious failures and recovery.  I have applications that 
are able to join 2 tables without error in Hive, but these same tables, when 
converted into RDDs, are unable to join in Spark (I am using the same cluster, 
and have played around with all of the memory configurations, persisting to 
disk, checkpointing, etc., and the RDDs are just too big for Spark on my 
cluster)

So, Spark is usually able to handle fairly large RDD joins, but occasionally 
runs into problems when the tables are just too big (e.g. the notorious 2GB 
shuffle limit issue, memory problems, etc.)  There are so many parameters to 
adjust (number of partitions, number of cores, memory per core, etc.) that it 
is difficult to guarantee stability on a shared cluster (say, running Yarn) 
with other jobs.

Could a feature be added to Spark that would use disk-only MapReduce commands 
to do very large joins?

That is, instead of myRDD1.join(myRDD2), we would have a special operation 
myRDD1.mapReduceJoin(myRDD2) that would checkpoint both RDDs to disk, run 
MapReduce, and then convert the results of the join back into a standard RDD.

This might add stability for Spark jobs that deal with extremely large data, 
and enable developers to mix-and-max some Spark and MapReduce operations in the 
same program, rather than writing Hive scripts and stringing together Spark and 
MapReduce programs, which has extremely large overhead to convert RDDs to Hive 
tables and back again.

Despite memory-level operations being where most of Spark's speed gains lie, 
sometimes using disk-only may help with stability!




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10999) Physical plan node Coalesce should be able to handle UnsafeRow

2015-10-08 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-10999.

   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9024
[https://github.com/apache/spark/pull/9024]

> Physical plan node Coalesce should be able to handle UnsafeRow
> --
>
> Key: SPARK-10999
> URL: https://issues.apache.org/jira/browse/SPARK-10999
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Minor
> Fix For: 1.6.0
>
>
> The following PySpark snippet shows the problem:
> {noformat}
> >>> sqlContext.range(10).selectExpr('id AS a').coalesce(1).explain(True)
> ...
> == Physical Plan ==
> Coalesce 1
>  ConvertToSafe
>   TungstenProject [id#3L AS a#4L]
>Scan PhysicalRDD[id#3L]
> {noformat}
> The {{ConvertToSafe}} is unnecessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11004) MapReduce Hive-like join operations for RDDs

2015-10-08 Thread Glenn Strycker (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14948943#comment-14948943
 ] 

Glenn Strycker commented on SPARK-11004:


True, fixing the 2GB will go a long way.  However, this isn't a bug ticket, but 
a request for a new Spark feature in future versions, that MapReduce could be 
run on RDDs.  This would add a lot of functionality for many possible use cases.

This request is asking for an alternate back-end for the join operation, so if 
the normal rdd.join operation fails, a developer could force using this 
alternate MapReduce method that would run on disk rather than memory.

The reason for this request is that there are actually scenarios where 
MapReduce really is preferable to Spark, but it is inefficient for developers 
to mix code and have to run programs with scripts.  The request here is for 
additional back-end functionality for basic RDDs operations that fail at scale 
for Spark but work correctly in MapReduce.

> MapReduce Hive-like join operations for RDDs
> 
>
> Key: SPARK-11004
> URL: https://issues.apache.org/jira/browse/SPARK-11004
> Project: Spark
>  Issue Type: New Feature
>Reporter: Glenn Strycker
>
> Could a feature be added to Spark that would use disk-only MapReduce 
> operations for the very largest RDD joins?
> MapReduce is able to handle incredibly large table joins in a stable, 
> predictable way with gracious failures and recovery.  I have applications 
> that are able to join 2 tables without error in Hive, but these same tables, 
> when converted into RDDs, are unable to join in Spark (I am using the same 
> cluster, and have played around with all of the memory configurations, 
> persisting to disk, checkpointing, etc., and the RDDs are just too big for 
> Spark on my cluster)
> So, Spark is usually able to handle fairly large RDD joins, but occasionally 
> runs into problems when the tables are just too big (e.g. the notorious 2GB 
> shuffle limit issue, memory problems, etc.)  There are so many parameters to 
> adjust (number of partitions, number of cores, memory per core, etc.) that it 
> is difficult to guarantee stability on a shared cluster (say, running Yarn) 
> with other jobs.
> Could a feature be added to Spark that would use disk-only MapReduce commands 
> to do very large joins?
> That is, instead of myRDD1.join(myRDD2), we would have a special operation 
> myRDD1.mapReduceJoin(myRDD2) that would checkpoint both RDDs to disk, run 
> MapReduce, and then convert the results of the join back into a standard RDD.
> This might add stability for Spark jobs that deal with extremely large data, 
> and enable developers to mix-and-match some Spark and MapReduce operations in 
> the same program, rather than writing Hive scripts and stringing together 
> Spark and MapReduce programs, which has extremely large overhead to convert 
> RDDs to Hive tables and back again.
> Despite memory-level operations being where most of Spark's speed gains lie, 
> sometimes using disk-only may help with stability!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11005) Spark 1.5 Shuffle performance - (sort-based shuffle)

2015-10-08 Thread Sandeep Pal (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Pal updated SPARK-11005:

Summary: Spark 1.5 Shuffle performance - (sort-based shuffle)  (was: Spark 
1.5 Shuffle performance)

> Spark 1.5 Shuffle performance - (sort-based shuffle)
> 
>
> Key: SPARK-11005
> URL: https://issues.apache.org/jira/browse/SPARK-11005
> Project: Spark
>  Issue Type: Question
>  Components: Shuffle, SQL
>Affects Versions: 1.5.0
> Environment: 6 node cluster with 1 master and 5 worker nodes.
> Memory > 100 GB each
> Cores = 72 each
> Input data ~496 GB
>Reporter: Sandeep Pal
>
> In case of terasort by Spark SQL with 20 total cores(4 cores/ executor), the 
> performance of the map tasks is 14 minutes (around 26s-30s each) where as if 
> I increase the number of cores to 60(12 cores /executor), the performance of 
> map degrades to 30 minutes ( ~2.3 minutes per task). I believe the map tasks 
> are independent of each other in the shuffle. 
> Each map task has 128 MB input (HDFS block size) in both of the above cases. 
> So, what makes the performance degradation with increasing number of cores.
>   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10987) yarn-client mode misbehaving with netty-based RPC backend

2015-10-08 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-10987.

   Resolution: Fixed
 Assignee: Marcelo Vanzin
Fix Version/s: 1.6.0

> yarn-client mode misbehaving with netty-based RPC backend
> -
>
> Key: SPARK-10987
> URL: https://issues.apache.org/jira/browse/SPARK-10987
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 1.6.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Blocker
> Fix For: 1.6.0
>
>
> YARN running in cluster deploy mode seems to be having issues with the new 
> RPC backend; if you look at unit test runs, tests that run in cluster mode 
> are taking several minutes to run, instead of the more usual 20-30 seconds.
> For example, 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43349/consoleFull:
> {noformat}
> [info] YarnClusterSuite:
> [info] - run Spark in yarn-client mode (13 seconds, 953 milliseconds)
> [info] - run Spark in yarn-cluster mode (6 minutes, 50 seconds)
> [info] - run Spark in yarn-cluster mode unsuccessfully (1 minute, 53 seconds)
> [info] - run Python application in yarn-client mode (21 seconds, 842 
> milliseconds)
> [info] - run Python application in yarn-cluster mode (7 minutes, 0 seconds)
> [info] - user class path first in client mode (1 minute, 58 seconds)
> [info] - user class path first in cluster mode (4 minutes, 49 seconds)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11004) MapReduce Hive-like join operations for RDDs

2015-10-08 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14948920#comment-14948920
 ] 

Sean Owen commented on SPARK-11004:
---

Spark has had a sort-based shuffle for a while, which is a lot of the battle.
Yes there's already a known issue about 2GB block size limits.
Is there a specific issue here beyond what's already in JIRA?

> MapReduce Hive-like join operations for RDDs
> 
>
> Key: SPARK-11004
> URL: https://issues.apache.org/jira/browse/SPARK-11004
> Project: Spark
>  Issue Type: New Feature
>Reporter: Glenn Strycker
>
> Could a feature be added to Spark that would use disk-only MapReduce 
> operations for the very largest RDD joins?
> MapReduce is able to handle incredibly large table joins in a stable, 
> predictable way with gracious failures and recovery.  I have applications 
> that are able to join 2 tables without error in Hive, but these same tables, 
> when converted into RDDs, are unable to join in Spark (I am using the same 
> cluster, and have played around with all of the memory configurations, 
> persisting to disk, checkpointing, etc., and the RDDs are just too big for 
> Spark on my cluster)
> So, Spark is usually able to handle fairly large RDD joins, but occasionally 
> runs into problems when the tables are just too big (e.g. the notorious 2GB 
> shuffle limit issue, memory problems, etc.)  There are so many parameters to 
> adjust (number of partitions, number of cores, memory per core, etc.) that it 
> is difficult to guarantee stability on a shared cluster (say, running Yarn) 
> with other jobs.
> Could a feature be added to Spark that would use disk-only MapReduce commands 
> to do very large joins?
> That is, instead of myRDD1.join(myRDD2), we would have a special operation 
> myRDD1.mapReduceJoin(myRDD2) that would checkpoint both RDDs to disk, run 
> MapReduce, and then convert the results of the join back into a standard RDD.
> This might add stability for Spark jobs that deal with extremely large data, 
> and enable developers to mix-and-match some Spark and MapReduce operations in 
> the same program, rather than writing Hive scripts and stringing together 
> Spark and MapReduce programs, which has extremely large overhead to convert 
> RDDs to Hive tables and back again.
> Despite memory-level operations being where most of Spark's speed gains lie, 
> sometimes using disk-only may help with stability!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10942) Not all cached RDDs are unpersisted

2015-10-08 Thread Nick Pritchard (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14948928#comment-14948928
 ] 

Nick Pritchard commented on SPARK-10942:


Thanks [~sowen] for trying! I'll let it go.

> Not all cached RDDs are unpersisted
> ---
>
> Key: SPARK-10942
> URL: https://issues.apache.org/jira/browse/SPARK-10942
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Nick Pritchard
>Priority: Minor
> Attachments: SPARK-10942_1.png, SPARK-10942_2.png, SPARK-10942_3.png
>
>
> I have a Spark Streaming application that caches RDDs inside of a 
> {{transform}} closure. Looking at the Spark UI, it seems that most of these 
> RDDs are unpersisted after the batch completes, but not all.
> I have copied a minimal reproducible example below to highlight the problem. 
> I run this and monitor the Spark UI "Storage" tab. The example generates and 
> caches 30 RDDs, and I see most get cleaned up. However in the end, some still 
> remain cached. There is some randomness going on because I see different RDDs 
> remain cached for each run.
> I have marked this as Major because I haven't been able to workaround it and 
> it is a memory leak for my application. I tried setting {{spark.cleaner.ttl}} 
> but that did not change anything.
> {code}
> val inputRDDs = mutable.Queue.tabulate(30) { i =>
>   sc.parallelize(Seq(i))
> }
> val input: DStream[Int] = ssc.queueStream(inputRDDs)
> val output = input.transform { rdd =>
>   if (rdd.isEmpty()) {
> rdd
>   } else {
> val rdd2 = rdd.map(identity)
> rdd2.setName(rdd.first().toString)
> rdd2.cache()
> val rdd3 = rdd2.map(identity)
> rdd3
>   }
> }
> output.print()
> ssc.start()
> ssc.awaitTermination()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10858) YARN: archives/jar/files rename with # doesn't work unless scheme given

2015-10-08 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves reassigned SPARK-10858:
-

Assignee: Thomas Graves

> YARN: archives/jar/files rename with # doesn't work unless scheme given
> ---
>
> Key: SPARK-10858
> URL: https://issues.apache.org/jira/browse/SPARK-10858
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Minor
>
> The YARN distributed cache feature with --jars, --archives, --files where you 
> can rename the file/archive using a # symbol only works if you explicitly 
> include the scheme in the path:
> works:
> --jars file:///home/foo/my.jar#renamed.jar
> doesn't work:
> --jars /home/foo/my.jar#renamed.jar
> Exception in thread "main" java.io.FileNotFoundException: File 
> file:/home/foo/my.jar#renamed.jar does not exist
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:534)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:524)
> at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:416)
> at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:337)
> at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:289)
> at 
> org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:240)
> at 
> org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:329)
> at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$6$$anonfun$apply$2.apply(Client.scala:393)
> at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$6$$anonfun$apply$2.apply(Client.scala:392)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10858) YARN: archives/jar/files rename with # doesn't work unless scheme given

2015-10-08 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14948941#comment-14948941
 ] 

Thomas Graves commented on SPARK-10858:
---

Sorry for the delay on this didn't have time to look at it.  Not sure why you 
are seeing different from me.

thanks for looking into this.  I agree its in the parsing  in the resolveURI 
where its calling new File(path).getAbsoluteFile().toURI().

When I don't specify file://:
15/10/08 15:35:56 INFO Client: local uri is: 
file:/homes/tgraves/R_install/R_install.tgz%23R_installation

with file://
15/10/08 15:38:27 INFO Client: local uri is: 
file:/homes/tgraves/R_install/R_install.tgz#R_installation

That is coming back with the %23 encoded versus the #.   when I originally 
wrote those code it wasn't calling the Utils.resolveURIs.  

 Looking at the actual code for File.toURI() you will see its not really 
parsing the fragment out before calling URI() which I think is the problem:

   public URI toURI() {
try {
File f = getAbsoluteFile();
String sp = slashify(f.getPath(), f.isDirectory());
if (sp.startsWith("//"))
sp = "//" + sp;
return new URI("file", null, sp, null);
} catch (URISyntaxException x) {
throw new Error(x); // Can't happen
}
}


It seems like a bad idea to call this based on the fact that the string might 
already be URI format.  So we are now going from possible URI to File and back 
to URI.  When we change it to a File its not expecting it to be URI with 
fragment already so its treating it as part of the path.


> YARN: archives/jar/files rename with # doesn't work unless scheme given
> ---
>
> Key: SPARK-10858
> URL: https://issues.apache.org/jira/browse/SPARK-10858
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>Priority: Minor
>
> The YARN distributed cache feature with --jars, --archives, --files where you 
> can rename the file/archive using a # symbol only works if you explicitly 
> include the scheme in the path:
> works:
> --jars file:///home/foo/my.jar#renamed.jar
> doesn't work:
> --jars /home/foo/my.jar#renamed.jar
> Exception in thread "main" java.io.FileNotFoundException: File 
> file:/home/foo/my.jar#renamed.jar does not exist
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:534)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:524)
> at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:416)
> at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:337)
> at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:289)
> at 
> org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:240)
> at 
> org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:329)
> at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$6$$anonfun$apply$2.apply(Client.scala:393)
> at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$6$$anonfun$apply$2.apply(Client.scala:392)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11005) Spark 1.5 Shuffle performance

2015-10-08 Thread Sandeep Pal (JIRA)
Sandeep Pal created SPARK-11005:
---

 Summary: Spark 1.5 Shuffle performance
 Key: SPARK-11005
 URL: https://issues.apache.org/jira/browse/SPARK-11005
 Project: Spark
  Issue Type: Question
  Components: Shuffle, SQL
Affects Versions: 1.5.0
 Environment: 6 node cluster with 1 master and 5 worker nodes.
Memory > 100 GB each
Cores = 72 each
Input data ~496 GB
Reporter: Sandeep Pal


In case of terasort by Spark SQL with 20 total cores(4 cores/ executor), the 
performance of the map tasks is 14 minutes (around 26s-30s each) where as if I 
increase the number of cores to 60(12 cores /executor), the performance of map 
degrades to 30 minutes ( ~2.3 minutes per task). I believe the map tasks are 
independent of each other in the shuffle. 
Each map task has 128 MB input (HDFS block size) in both of the above cases. 
So, what makes the performance degradation with increasing number of cores.









--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11005) Spark 1.5 Shuffle performance - (sort-based shuffle)

2015-10-08 Thread Sandeep Pal (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Pal updated SPARK-11005:

Environment: 
6 node cluster with 1 master and 5 worker nodes.
Memory > 100 GB each
Cores = 72 each
Input data ~94 GB

  was:
6 node cluster with 1 master and 5 worker nodes.
Memory > 100 GB each
Cores = 72 each
Input data ~496 GB


> Spark 1.5 Shuffle performance - (sort-based shuffle)
> 
>
> Key: SPARK-11005
> URL: https://issues.apache.org/jira/browse/SPARK-11005
> Project: Spark
>  Issue Type: Question
>  Components: Shuffle, SQL
>Affects Versions: 1.5.0
> Environment: 6 node cluster with 1 master and 5 worker nodes.
> Memory > 100 GB each
> Cores = 72 each
> Input data ~94 GB
>Reporter: Sandeep Pal
>
> In case of terasort by Spark SQL with 20 total cores(4 cores/ executor), the 
> performance of the map tasks is 14 minutes (around 26s-30s each) where as if 
> I increase the number of cores to 60(12 cores /executor), the performance of 
> map degrades to 30 minutes ( ~2.3 minutes per task). I believe the map tasks 
> are independent of each other in the shuffle. 
> Each map task has 128 MB input (HDFS block size) in both of the above cases. 
> So, what makes the performance degradation with increasing number of cores.
>   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10836) SparkR: Add sort function to dataframe

2015-10-08 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-10836.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8920
[https://github.com/apache/spark/pull/8920]

> SparkR: Add sort function to dataframe
> --
>
> Key: SPARK-10836
> URL: https://issues.apache.org/jira/browse/SPARK-10836
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Narine Kokhlikyan
>Priority: Minor
> Fix For: 1.6.0
>
>
> Hi everyone,
> the sort function can be used as an alternative to arrange(... ).
> As arguments it accepts x - dataframe, decreasing - TRUE/FALSE, a list of 
> orderings for columns and the list of columns, represented as string names
> for example: 
>  sort(df, TRUE, "col1","col2","col3","col5") # for example, if we want to 
> sort some of the columns in the same order
>  sort(df, decreasing=TRUE, "col1")
>  sort(df, decreasing=c(TRUE,FALSE), "col1","col2")
> Thanks,
> Narine



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10836) SparkR: Add sort function to dataframe

2015-10-08 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-10836:
--
Assignee: Narine Kokhlikyan

> SparkR: Add sort function to dataframe
> --
>
> Key: SPARK-10836
> URL: https://issues.apache.org/jira/browse/SPARK-10836
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Narine Kokhlikyan
>Assignee: Narine Kokhlikyan
>Priority: Minor
> Fix For: 1.6.0
>
>
> Hi everyone,
> the sort function can be used as an alternative to arrange(... ).
> As arguments it accepts x - dataframe, decreasing - TRUE/FALSE, a list of 
> orderings for columns and the list of columns, represented as string names
> for example: 
>  sort(df, TRUE, "col1","col2","col3","col5") # for example, if we want to 
> sort some of the columns in the same order
>  sort(df, decreasing=TRUE, "col1")
>  sort(df, decreasing=c(TRUE,FALSE), "col1","col2")
> Thanks,
> Narine



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10998) Show non-children in default Expression.toString

2015-10-08 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-10998.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9022
[https://github.com/apache/spark/pull/9022]

> Show non-children in default Expression.toString
> 
>
> Key: SPARK-10998
> URL: https://issues.apache.org/jira/browse/SPARK-10998
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8654) Analysis exception when using "NULL IN (...)": invalid cast

2015-10-08 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-8654.
-
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8983
[https://github.com/apache/spark/pull/8983]

> Analysis exception when using "NULL IN (...)": invalid cast
> ---
>
> Key: SPARK-8654
> URL: https://issues.apache.org/jira/browse/SPARK-8654
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Santiago M. Mola
>Priority: Minor
> Fix For: 1.6.0
>
>
> The following query throws an analysis exception:
> {code}
> SELECT * FROM t WHERE NULL NOT IN (1, 2, 3);
> {code}
> The exception is:
> {code}
> org.apache.spark.sql.AnalysisException: invalid cast from int to null;
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:66)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:52)
> {code}
> Here is a test that can be added to AnalysisSuite to check the issue:
> {code}
>   test("SPARK- regression test") {
> val plan = Project(Alias(In(Literal(null), Seq(Literal(1), Literal(2))), 
> "a")() :: Nil,
>   LocalRelation()
> )
> caseInsensitiveAnalyze(plan)
>   }
> {code}
> Note that this kind of query is a corner case, but it is still valid SQL. An 
> expression such as "NULL IN (...)" or "NULL NOT IN (...)" always gives NULL 
> as a result, even if the list contains NULL. So it is safe to translate these 
> expressions to Literal(null) during analysis.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11004) MapReduce Hive-like join operations for RDDs

2015-10-08 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14949060#comment-14949060
 ] 

Sean Owen commented on SPARK-11004:
---

Literally run a Mapper and Reducer on Spark? I think it would be more efficient 
to run then via MapReduce. I imagine that most people using MR still have a 
Hadoop cluster handy that they're running Spark on too, so this is quite viable.

What's the MapReduce method you're referring to? not literally calling to 
MapReduce right?
I'm trying to isolate the distinct change being described here. Otherwise I 
think this is already covered in existing improvements to Spark itself.

> MapReduce Hive-like join operations for RDDs
> 
>
> Key: SPARK-11004
> URL: https://issues.apache.org/jira/browse/SPARK-11004
> Project: Spark
>  Issue Type: New Feature
>Reporter: Glenn Strycker
>
> Could a feature be added to Spark that would use disk-only MapReduce 
> operations for the very largest RDD joins?
> MapReduce is able to handle incredibly large table joins in a stable, 
> predictable way with gracious failures and recovery.  I have applications 
> that are able to join 2 tables without error in Hive, but these same tables, 
> when converted into RDDs, are unable to join in Spark (I am using the same 
> cluster, and have played around with all of the memory configurations, 
> persisting to disk, checkpointing, etc., and the RDDs are just too big for 
> Spark on my cluster)
> So, Spark is usually able to handle fairly large RDD joins, but occasionally 
> runs into problems when the tables are just too big (e.g. the notorious 2GB 
> shuffle limit issue, memory problems, etc.)  There are so many parameters to 
> adjust (number of partitions, number of cores, memory per core, etc.) that it 
> is difficult to guarantee stability on a shared cluster (say, running Yarn) 
> with other jobs.
> Could a feature be added to Spark that would use disk-only MapReduce commands 
> to do very large joins?
> That is, instead of myRDD1.join(myRDD2), we would have a special operation 
> myRDD1.mapReduceJoin(myRDD2) that would checkpoint both RDDs to disk, run 
> MapReduce, and then convert the results of the join back into a standard RDD.
> This might add stability for Spark jobs that deal with extremely large data, 
> and enable developers to mix-and-match some Spark and MapReduce operations in 
> the same program, rather than writing Hive scripts and stringing together 
> Spark and MapReduce programs, which has extremely large overhead to convert 
> RDDs to Hive tables and back again.
> Despite memory-level operations being where most of Spark's speed gains lie, 
> sometimes using disk-only may help with stability!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11005) Spark 1.5 Shuffle performance - (sort-based shuffle)

2015-10-08 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14949059#comment-14949059
 ] 

Sean Owen commented on SPARK-11005:
---

[~vnayak053] coud I ask you to put this on the mailing list? It's not quite a 
JIRA as you're asking a question. There's a similar thread at this very moment 
about why jobs may not scale linearly: 
https://www.mail-archive.com/user@spark.apache.org/msg38382.html

> Spark 1.5 Shuffle performance - (sort-based shuffle)
> 
>
> Key: SPARK-11005
> URL: https://issues.apache.org/jira/browse/SPARK-11005
> Project: Spark
>  Issue Type: Question
>  Components: Shuffle, SQL
>Affects Versions: 1.5.0
> Environment: 6 node cluster with 1 master and 5 worker nodes.
> Memory > 100 GB each
> Cores = 72 each
> Input data ~94 GB
>Reporter: Sandeep Pal
>
> In case of terasort by Spark SQL with 20 total cores(4 cores/ executor), the 
> performance of the map tasks is 14 minutes (around 26s-30s each) where as if 
> I increase the number of cores to 60(12 cores /executor), the performance of 
> map degrades to 30 minutes ( ~2.3 minutes per task). I believe the map tasks 
> are independent of each other in the shuffle. 
> Each map task has 128 MB input (HDFS block size) in both of the above cases. 
> So, what makes the performance degradation with increasing number of cores.
>   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10914) UnsafeRow serialization breaks when two machines have different Oops size

2015-10-08 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-10914:

Description: 
Updated description (by rxin on Oct 8, 2015)

To reproduce, launch Spark using
{code}
MASTER=local-cluster[2,1,1024] bin/spark-shell --conf 
"spark.executor.extraJavaOptions=-XX:-UseCompressedOops"
{code}





*Original bug report description*:

Using an inner join, to match together two integer columns, I generally get no 
results when there should be matches.  But the results vary and depend on 
whether the dataframes are coming from SQL, JSON, or cached, as well as the 
order in which I cache things and query them.

This minimal example reproduces it consistently for me in the spark-shell, on 
new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from 
http://spark.apache.org/downloads.html.)

{code}
/* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */
val x = sql("select 1 xx union all select 2") 
val y = sql("select 1 yy union all select 2")

x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */
/* If I cache both tables it works: */
x.cache()
y.cache()
x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */

/* but this still doesn't work: */
x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */
{code}

  was:

Updated description (by rxin on Oct 8, 2015)

To reproduce, launch Spark using
{code}
MASTER=local-cluster[2,1,1024] bin/spark-shell --conf 
"spark.executor.extraJavaOptions=-XX:-UseCompressedOops"
{code}





Original bug report description:

Using an inner join, to match together two integer columns, I generally get no 
results when there should be matches.  But the results vary and depend on 
whether the dataframes are coming from SQL, JSON, or cached, as well as the 
order in which I cache things and query them.

This minimal example reproduces it consistently for me in the spark-shell, on 
new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from 
http://spark.apache.org/downloads.html.)

{code}
/* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */
val x = sql("select 1 xx union all select 2") 
val y = sql("select 1 yy union all select 2")

x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */
/* If I cache both tables it works: */
x.cache()
y.cache()
x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */

/* but this still doesn't work: */
x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */
{code}


> UnsafeRow serialization breaks when two machines have different Oops size
> -
>
> Key: SPARK-10914
> URL: https://issues.apache.org/jira/browse/SPARK-10914
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
> Environment: Ubuntu 14.04 (spark-slave), 12.04 (master)
>Reporter: Ben Moran
>
> Updated description (by rxin on Oct 8, 2015)
> To reproduce, launch Spark using
> {code}
> MASTER=local-cluster[2,1,1024] bin/spark-shell --conf 
> "spark.executor.extraJavaOptions=-XX:-UseCompressedOops"
> {code}
> *Original bug report description*:
> Using an inner join, to match together two integer columns, I generally get 
> no results when there should be matches.  But the results vary and depend on 
> whether the dataframes are coming from SQL, JSON, or cached, as well as the 
> order in which I cache things and query them.
> This minimal example reproduces it consistently for me in the spark-shell, on 
> new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from 
> http://spark.apache.org/downloads.html.)
> {code}
> /* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */
> val x = sql("select 1 xx union all select 2") 
> val y = sql("select 1 yy union all select 2")
> x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */
> /* If I cache both tables it works: */
> x.cache()
> y.cache()
> x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */
> /* but this still doesn't work: */
> x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11007) Add dictionary support for CatalystDecimalConverter

2015-10-08 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-11007:
--

 Summary: Add dictionary support for CatalystDecimalConverter
 Key: SPARK-11007
 URL: https://issues.apache.org/jira/browse/SPARK-11007
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.1, 1.5.0, 1.4.1, 1.4.0
Reporter: Cheng Lian
Assignee: Cheng Lian


Currently {{CatalystDecimalConverter}} doesn't explicitly support dictionary 
encoding. The consequence is that, the underlying Parquet {{ColumnReader}} 
always sends raw {{Int}}/{{Long}}/{{Binary}} values decoded from the dictionary 
to {{CatalystDecimalConverter}} even if the column is encoded using a 
dictionary. By adding explicit dictionary support (similar to what 
{{CatalystStringConverter}} does), we can avoid constructing decimals 
repeatedly. This should be especially effective for binary backed decimals.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10914) UnsafeRow serialization breaks when two machines have different Oops size

2015-10-08 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-10914:

Summary: UnsafeRow serialization breaks when two machines have different 
Oops size  (was: Incorrect empty join sets when executor-memory >= 32g)

> UnsafeRow serialization breaks when two machines have different Oops size
> -
>
> Key: SPARK-10914
> URL: https://issues.apache.org/jira/browse/SPARK-10914
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
> Environment: Ubuntu 14.04 (spark-slave), 12.04 (master)
>Reporter: Ben Moran
>
> Using an inner join, to match together two integer columns, I generally get 
> no results when there should be matches.  But the results vary and depend on 
> whether the dataframes are coming from SQL, JSON, or cached, as well as the 
> order in which I cache things and query them.
> This minimal example reproduces it consistently for me in the spark-shell, on 
> new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from 
> http://spark.apache.org/downloads.html.)
> {code}
> /* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */
> val x = sql("select 1 xx union all select 2") 
> val y = sql("select 1 yy union all select 2")
> x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */
> /* If I cache both tables it works: */
> x.cache()
> y.cache()
> x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */
> /* but this still doesn't work: */
> x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11004) MapReduce Hive-like join operations for RDDs

2015-10-08 Thread Glenn Strycker (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14949076#comment-14949076
 ] 

Glenn Strycker commented on SPARK-11004:


Currently we could do the following from withing a linux script:

1) run Spark through creating the 2 RDDs, 2) save both to a Hive table, 3) end 
my Spark program, 4) run Hive Beeline on a "select * from table1 join table2 on 
columnname" and insert this into a new 3rd table, 5) start a new Spark program 
to continue on from the first program in step 3, loading the output of the Hive 
join into the results RDD.

Rather than doing all of this overhead, it would be pretty cool if we could run 
the map and reduce jobs as a Hive join would do, only on our existing Spark 
RDDs, without needing to involve Hive or wrapper scripts, but doing everything 
from within a single Spark program.

I believe the main stability gain is due to Hive performing everything on disk 
instead of memory.  Since we already can checkpoint RDDs to memory, perhaps 
this ticket request could be accomplished by adding a feature to RDDs that 
would be performed on the checkpointed files, rather than the in-memory RDDs.

We're just looking for a stability gain and ability to increase the potential 
size of RDDs in their operations.  Right now our Hive Beeline scripts are 
out-performing Spark for these super massive table joins.

> MapReduce Hive-like join operations for RDDs
> 
>
> Key: SPARK-11004
> URL: https://issues.apache.org/jira/browse/SPARK-11004
> Project: Spark
>  Issue Type: New Feature
>Reporter: Glenn Strycker
>
> Could a feature be added to Spark that would use disk-only MapReduce 
> operations for the very largest RDD joins?
> MapReduce is able to handle incredibly large table joins in a stable, 
> predictable way with gracious failures and recovery.  I have applications 
> that are able to join 2 tables without error in Hive, but these same tables, 
> when converted into RDDs, are unable to join in Spark (I am using the same 
> cluster, and have played around with all of the memory configurations, 
> persisting to disk, checkpointing, etc., and the RDDs are just too big for 
> Spark on my cluster)
> So, Spark is usually able to handle fairly large RDD joins, but occasionally 
> runs into problems when the tables are just too big (e.g. the notorious 2GB 
> shuffle limit issue, memory problems, etc.)  There are so many parameters to 
> adjust (number of partitions, number of cores, memory per core, etc.) that it 
> is difficult to guarantee stability on a shared cluster (say, running Yarn) 
> with other jobs.
> Could a feature be added to Spark that would use disk-only MapReduce commands 
> to do very large joins?
> That is, instead of myRDD1.join(myRDD2), we would have a special operation 
> myRDD1.mapReduceJoin(myRDD2) that would checkpoint both RDDs to disk, run 
> MapReduce, and then convert the results of the join back into a standard RDD.
> This might add stability for Spark jobs that deal with extremely large data, 
> and enable developers to mix-and-match some Spark and MapReduce operations in 
> the same program, rather than writing Hive scripts and stringing together 
> Spark and MapReduce programs, which has extremely large overhead to convert 
> RDDs to Hive tables and back again.
> Despite memory-level operations being where most of Spark's speed gains lie, 
> sometimes using disk-only may help with stability!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11006) Rename NullColumnAccess as NullColumnAccessor

2015-10-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14949010#comment-14949010
 ] 

Apache Spark commented on SPARK-11006:
--

User 'tedyu' has created a pull request for this issue:
https://github.com/apache/spark/pull/9028

> Rename NullColumnAccess as NullColumnAccessor
> -
>
> Key: SPARK-11006
> URL: https://issues.apache.org/jira/browse/SPARK-11006
> Project: Spark
>  Issue Type: Task
>Reporter: Ted Yu
>Priority: Trivial
>
> In sql/core/src/main/scala/org/apache/spark/sql/columnar/ColumnAccessor.scala 
> , NullColumnAccess should be renmaed as NullColumnAccessor so that same 
> convention is adhered to for the accessors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11006) Rename NullColumnAccess as NullColumnAccessor

2015-10-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11006:


Assignee: (was: Apache Spark)

> Rename NullColumnAccess as NullColumnAccessor
> -
>
> Key: SPARK-11006
> URL: https://issues.apache.org/jira/browse/SPARK-11006
> Project: Spark
>  Issue Type: Task
>Reporter: Ted Yu
>Priority: Trivial
>
> In sql/core/src/main/scala/org/apache/spark/sql/columnar/ColumnAccessor.scala 
> , NullColumnAccess should be renmaed as NullColumnAccessor so that same 
> convention is adhered to for the accessors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11006) Rename NullColumnAccess as NullColumnAccessor

2015-10-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11006:


Assignee: Apache Spark

> Rename NullColumnAccess as NullColumnAccessor
> -
>
> Key: SPARK-11006
> URL: https://issues.apache.org/jira/browse/SPARK-11006
> Project: Spark
>  Issue Type: Task
>Reporter: Ted Yu
>Assignee: Apache Spark
>Priority: Trivial
>
> In sql/core/src/main/scala/org/apache/spark/sql/columnar/ColumnAccessor.scala 
> , NullColumnAccess should be renmaed as NullColumnAccessor so that same 
> convention is adhered to for the accessors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10914) UnsafeRow serialization breaks when two machines have different Oops size

2015-10-08 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-10914:

Description: 
*Updated description (by rxin on Oct 8, 2015)*

To reproduce, launch Spark using
{code}
MASTER=local-cluster[2,1,1024] bin/spark-shell --conf 
"spark.executor.extraJavaOptions=-XX:-UseCompressedOops"
{code}





*Original bug report description*:

Using an inner join, to match together two integer columns, I generally get no 
results when there should be matches.  But the results vary and depend on 
whether the dataframes are coming from SQL, JSON, or cached, as well as the 
order in which I cache things and query them.

This minimal example reproduces it consistently for me in the spark-shell, on 
new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from 
http://spark.apache.org/downloads.html.)

{code}
/* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */
val x = sql("select 1 xx union all select 2") 
val y = sql("select 1 yy union all select 2")

x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */
/* If I cache both tables it works: */
x.cache()
y.cache()
x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */

/* but this still doesn't work: */
x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */
{code}

  was:
Updated description (by rxin on Oct 8, 2015)

To reproduce, launch Spark using
{code}
MASTER=local-cluster[2,1,1024] bin/spark-shell --conf 
"spark.executor.extraJavaOptions=-XX:-UseCompressedOops"
{code}





*Original bug report description*:

Using an inner join, to match together two integer columns, I generally get no 
results when there should be matches.  But the results vary and depend on 
whether the dataframes are coming from SQL, JSON, or cached, as well as the 
order in which I cache things and query them.

This minimal example reproduces it consistently for me in the spark-shell, on 
new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from 
http://spark.apache.org/downloads.html.)

{code}
/* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */
val x = sql("select 1 xx union all select 2") 
val y = sql("select 1 yy union all select 2")

x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */
/* If I cache both tables it works: */
x.cache()
y.cache()
x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */

/* but this still doesn't work: */
x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */
{code}


> UnsafeRow serialization breaks when two machines have different Oops size
> -
>
> Key: SPARK-10914
> URL: https://issues.apache.org/jira/browse/SPARK-10914
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
> Environment: Ubuntu 14.04 (spark-slave), 12.04 (master)
>Reporter: Ben Moran
>
> *Updated description (by rxin on Oct 8, 2015)*
> To reproduce, launch Spark using
> {code}
> MASTER=local-cluster[2,1,1024] bin/spark-shell --conf 
> "spark.executor.extraJavaOptions=-XX:-UseCompressedOops"
> {code}
> *Original bug report description*:
> Using an inner join, to match together two integer columns, I generally get 
> no results when there should be matches.  But the results vary and depend on 
> whether the dataframes are coming from SQL, JSON, or cached, as well as the 
> order in which I cache things and query them.
> This minimal example reproduces it consistently for me in the spark-shell, on 
> new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from 
> http://spark.apache.org/downloads.html.)
> {code}
> /* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */
> val x = sql("select 1 xx union all select 2") 
> val y = sql("select 1 yy union all select 2")
> x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */
> /* If I cache both tables it works: */
> x.cache()
> y.cache()
> x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */
> /* but this still doesn't work: */
> x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10914) UnsafeRow serialization breaks when two machines have different Oops size

2015-10-08 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-10914:

Description: 

Updated description (by rxin on Oct 8, 2015)

To reproduce, launch Spark using
{code}
MASTER=local-cluster[2,1,1024] bin/spark-shell --conf 
"spark.executor.extraJavaOptions=-XX:-UseCompressedOops"
{code}





Original bug report description:

Using an inner join, to match together two integer columns, I generally get no 
results when there should be matches.  But the results vary and depend on 
whether the dataframes are coming from SQL, JSON, or cached, as well as the 
order in which I cache things and query them.

This minimal example reproduces it consistently for me in the spark-shell, on 
new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from 
http://spark.apache.org/downloads.html.)

{code}
/* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */
val x = sql("select 1 xx union all select 2") 
val y = sql("select 1 yy union all select 2")

x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */
/* If I cache both tables it works: */
x.cache()
y.cache()
x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */

/* but this still doesn't work: */
x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */
{code}

  was:
Using an inner join, to match together two integer columns, I generally get no 
results when there should be matches.  But the results vary and depend on 
whether the dataframes are coming from SQL, JSON, or cached, as well as the 
order in which I cache things and query them.

This minimal example reproduces it consistently for me in the spark-shell, on 
new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from 
http://spark.apache.org/downloads.html.)

{code}
/* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */
val x = sql("select 1 xx union all select 2") 
val y = sql("select 1 yy union all select 2")

x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */
/* If I cache both tables it works: */
x.cache()
y.cache()
x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */

/* but this still doesn't work: */
x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */
{code}


> UnsafeRow serialization breaks when two machines have different Oops size
> -
>
> Key: SPARK-10914
> URL: https://issues.apache.org/jira/browse/SPARK-10914
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
> Environment: Ubuntu 14.04 (spark-slave), 12.04 (master)
>Reporter: Ben Moran
>
> Updated description (by rxin on Oct 8, 2015)
> To reproduce, launch Spark using
> {code}
> MASTER=local-cluster[2,1,1024] bin/spark-shell --conf 
> "spark.executor.extraJavaOptions=-XX:-UseCompressedOops"
> {code}
> Original bug report description:
> Using an inner join, to match together two integer columns, I generally get 
> no results when there should be matches.  But the results vary and depend on 
> whether the dataframes are coming from SQL, JSON, or cached, as well as the 
> order in which I cache things and query them.
> This minimal example reproduces it consistently for me in the spark-shell, on 
> new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from 
> http://spark.apache.org/downloads.html.)
> {code}
> /* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */
> val x = sql("select 1 xx union all select 2") 
> val y = sql("select 1 yy union all select 2")
> x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */
> /* If I cache both tables it works: */
> x.cache()
> y.cache()
> x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */
> /* but this still doesn't work: */
> x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10914) UnsafeRow serialization breaks when two machines have different Oops size

2015-10-08 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14949079#comment-14949079
 ] 

Reynold Xin commented on SPARK-10914:
-

OK I figured it out. Updated the description.


> UnsafeRow serialization breaks when two machines have different Oops size
> -
>
> Key: SPARK-10914
> URL: https://issues.apache.org/jira/browse/SPARK-10914
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
> Environment: Ubuntu 14.04 (spark-slave), 12.04 (master)
>Reporter: Ben Moran
>
> *Updated description (by rxin on Oct 8, 2015)*
> To reproduce, launch Spark using
> {code}
> MASTER=local-cluster[2,1,1024] bin/spark-shell --conf 
> "spark.executor.extraJavaOptions=-XX:-UseCompressedOops"
> {code}
> And then run the following
> {code}
> scala> sql("select 1 xx").collect()
> {code}
> The problem is that UnsafeRow contains 3 pieces of information when pointing 
> to some data in memory (an object, a base offset, and length). When the row 
> is serialized with Java/Kryo serialization, the object layout in memory can 
> change if two machines have different pointer width (Oops in JVM).
> *Original bug report description*:
> Using an inner join, to match together two integer columns, I generally get 
> no results when there should be matches.  But the results vary and depend on 
> whether the dataframes are coming from SQL, JSON, or cached, as well as the 
> order in which I cache things and query them.
> This minimal example reproduces it consistently for me in the spark-shell, on 
> new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from 
> http://spark.apache.org/downloads.html.)
> {code}
> /* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */
> val x = sql("select 1 xx union all select 2") 
> val y = sql("select 1 yy union all select 2")
> x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */
> /* If I cache both tables it works: */
> x.cache()
> y.cache()
> x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */
> /* but this still doesn't work: */
> x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10914) UnsafeRow serialization breaks when two machines have different Oops size

2015-10-08 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-10914:

Description: 
*Updated description (by rxin on Oct 8, 2015)*

To reproduce, launch Spark using
{code}
MASTER=local-cluster[2,1,1024] bin/spark-shell --conf 
"spark.executor.extraJavaOptions=-XX:-UseCompressedOops"
{code}

And then run the following
{code}
scala> sql("select 1 xx").collect()
{code}

The problem is that UnsafeRow contains 3 pieces of information when pointing to 
some data in memory (an object, a base offset, and length). When the row is 
serialized with Java/Kryo serialization, the object layout in memory can change 
if two machines have different pointer width (Oops in JVM).



*Original bug report description*:

Using an inner join, to match together two integer columns, I generally get no 
results when there should be matches.  But the results vary and depend on 
whether the dataframes are coming from SQL, JSON, or cached, as well as the 
order in which I cache things and query them.

This minimal example reproduces it consistently for me in the spark-shell, on 
new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from 
http://spark.apache.org/downloads.html.)

{code}
/* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */
val x = sql("select 1 xx union all select 2") 
val y = sql("select 1 yy union all select 2")

x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */
/* If I cache both tables it works: */
x.cache()
y.cache()
x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */

/* but this still doesn't work: */
x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */
{code}

  was:
*Updated description (by rxin on Oct 8, 2015)*

To reproduce, launch Spark using
{code}
MASTER=local-cluster[2,1,1024] bin/spark-shell --conf 
"spark.executor.extraJavaOptions=-XX:-UseCompressedOops"
{code}





*Original bug report description*:

Using an inner join, to match together two integer columns, I generally get no 
results when there should be matches.  But the results vary and depend on 
whether the dataframes are coming from SQL, JSON, or cached, as well as the 
order in which I cache things and query them.

This minimal example reproduces it consistently for me in the spark-shell, on 
new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from 
http://spark.apache.org/downloads.html.)

{code}
/* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */
val x = sql("select 1 xx union all select 2") 
val y = sql("select 1 yy union all select 2")

x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */
/* If I cache both tables it works: */
x.cache()
y.cache()
x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */

/* but this still doesn't work: */
x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */
{code}


> UnsafeRow serialization breaks when two machines have different Oops size
> -
>
> Key: SPARK-10914
> URL: https://issues.apache.org/jira/browse/SPARK-10914
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
> Environment: Ubuntu 14.04 (spark-slave), 12.04 (master)
>Reporter: Ben Moran
>
> *Updated description (by rxin on Oct 8, 2015)*
> To reproduce, launch Spark using
> {code}
> MASTER=local-cluster[2,1,1024] bin/spark-shell --conf 
> "spark.executor.extraJavaOptions=-XX:-UseCompressedOops"
> {code}
> And then run the following
> {code}
> scala> sql("select 1 xx").collect()
> {code}
> The problem is that UnsafeRow contains 3 pieces of information when pointing 
> to some data in memory (an object, a base offset, and length). When the row 
> is serialized with Java/Kryo serialization, the object layout in memory can 
> change if two machines have different pointer width (Oops in JVM).
> *Original bug report description*:
> Using an inner join, to match together two integer columns, I generally get 
> no results when there should be matches.  But the results vary and depend on 
> whether the dataframes are coming from SQL, JSON, or cached, as well as the 
> order in which I cache things and query them.
> This minimal example reproduces it consistently for me in the spark-shell, on 
> new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from 
> http://spark.apache.org/downloads.html.)
> {code}
> /* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */
> val x = sql("select 1 xx union all select 2") 
> val y = sql("select 1 yy union all select 2")
> x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */
> /* If I cache both tables it works: */
> x.cache()
> y.cache()
> x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */
> /* but this still doesn't work: */
> x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 

[jira] [Updated] (SPARK-11008) Spark window function returns inconsistent/wrong results

2015-10-08 Thread Prasad Chalasani (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prasad Chalasani updated SPARK-11008:
-
Description: 
Summary: applying a windowing function on a data-frame, followed by count() 
gives widely varying results in repeated runs: none exceed the correct value, 
but of course all but one are wrong. On large data-sets I sometimes get as 
small as HALF of the correct value.

A minimal reproducible example is here: 

(1) start spark-shell
(2) run these:
val data = 1.to(100).map(x => (x,1))
import sqlContext.implicits._
val tbl = sc.parallelize(data).toDF("id", "time")
tbl.write.parquet("s3n://path/to/mybucket/id-time-tiny.pqt")

(3) exit the shell (this is important)
(4) start spark-shell again
(5) run these:
import org.apache.spark.sql.expressions.Window
val df = sqlContext.read.parquet("s3n://path/to/mybucket/id-time-tiny.pqt")
val win = Window.partitionBy("id").orderBy("time")

df.select($"id", 
(rank().over(win)).alias("rnk")).filter("rnk=1").select("id").count()

I get 98, but the correct result is 100. 
If I re-run the code in step 5 in the same shell, then the result gets "fixed" 
and I always get 100.

Note this is only a minimal reproducible example to reproduce the error. In my 
real application the size of the data is much larger and the window function is 
not trivial as above (i.e. there are multiple "time" values per "id", etc), and 
I see results sometimes as small as HALF of the correct value (e.g. 120,000 
while the correct value is 250,000). So this is a serious problem.



  was:
Summary: applying a windowing function on a data-frame, followed by count() 
gives widely varying results in repeated runs: none exceed the correct value, 
but of course all but one are wrong. On large data-sets I sometimes get as 
small as HALF of the correct value.

A minimal reproducible example is here: 

(1) start spark-shell
(2) run these:
val data = 1.to(100).map(x => (x,1))
import sqlContext.implicits._
val tbl = sc.parallelize(data).toDF("id", "time")
tbl.write.parquet("s3n://path/to/mybucket/id-time-tiny.pqt")

(3) exit the shell (this is important)
(4) start spark-shell again
(5) run these:
import org.apache.spark.sql.expressions.Window
val df = sqlContext.read.parquet("s3n://path/to/mybucket/id-time-tiny.pqt")
val win = Window.partitionBy("id").orderBy("time")

df.select($"id", 
(rank().over(win)).alias("rnk")).filter("rnk=1").select("id").count()

I get 98, but the correct result is 100. 
If I re-run the above, then the result gets "fixed" and I always get 100.

Note this is only a minimal reproducible example to reproduce the error. In my 
real application the size of the data is much larger and the window function is 
not trivial as above (i.e. there are multiple "time" values per "id", etc), and 
I see results sometimes as small as HALF of the correct value (e.g. 120,000 
while the correct value is 250,000). So this is a serious problem.




> Spark window function returns inconsistent/wrong results
> 
>
> Key: SPARK-11008
> URL: https://issues.apache.org/jira/browse/SPARK-11008
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.4.0, 1.5.0
> Environment: Amazon Linux AMI (Amazon Linux version 2015.09)
>Reporter: Prasad Chalasani
>Priority: Blocker
>
> Summary: applying a windowing function on a data-frame, followed by count() 
> gives widely varying results in repeated runs: none exceed the correct value, 
> but of course all but one are wrong. On large data-sets I sometimes get as 
> small as HALF of the correct value.
> A minimal reproducible example is here: 
> (1) start spark-shell
> (2) run these:
> val data = 1.to(100).map(x => (x,1))
> import sqlContext.implicits._
> val tbl = sc.parallelize(data).toDF("id", "time")
> tbl.write.parquet("s3n://path/to/mybucket/id-time-tiny.pqt")
> (3) exit the shell (this is important)
> (4) start spark-shell again
> (5) run these:
> import org.apache.spark.sql.expressions.Window
> val df = sqlContext.read.parquet("s3n://path/to/mybucket/id-time-tiny.pqt")
> val win = Window.partitionBy("id").orderBy("time")
> df.select($"id", 
> (rank().over(win)).alias("rnk")).filter("rnk=1").select("id").count()
> I get 98, but the correct result is 100. 
> If I re-run the code in step 5 in the same shell, then the result gets 
> "fixed" and I always get 100.
> Note this is only a minimal reproducible example to reproduce the error. In 
> my real application the size of the data is much larger and the window 
> function is not trivial as above (i.e. there are multiple "time" values per 
> "id", etc), and I see results sometimes as small as HALF of the correct value 
> (e.g. 120,000 while the correct value is 

[jira] [Updated] (SPARK-11009) RowNumber in HiveContext returns negative values in cluster mode

2015-10-08 Thread Saif Addin Ellafi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saif Addin Ellafi updated SPARK-11009:
--
Environment: Standalone cluster mode. No hadoop/hive is present in the 
environment (no hive-site.xml), only using HiveContext. Spark build as with 
hadoop 2.6.0. Default spark configuration variables. cluster has 4 nodes, but 
happens with n nodes as well.  (was: Standalone cluster mode
No hadoop/hive is present in the environment (no hive-site.xml), only using 
HiveContext. Spark build as with hadoop 2.6.0.
Default spark configuration variables.
cluster has 4 nodes, but happens with n nodes as well.)

> RowNumber in HiveContext returns negative values in cluster mode
> 
>
> Key: SPARK-11009
> URL: https://issues.apache.org/jira/browse/SPARK-11009
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
> Environment: Standalone cluster mode. No hadoop/hive is present in 
> the environment (no hive-site.xml), only using HiveContext. Spark build as 
> with hadoop 2.6.0. Default spark configuration variables. cluster has 4 
> nodes, but happens with n nodes as well.
>Reporter: Saif Addin Ellafi
>
> This issue happens when submitting the job into a standalone cluster. Have 
> not tried YARN or MESOS. Repartition df into 1 piece or default parallelism=1 
> does not fix the issue. Also tried having only one node in the cluster, with 
> same result. Other shuffle configuration changes do not alter the results 
> either.
> The issue does NOT happen in --master local[*].
> val ws = Window.
> partitionBy("client_id").
> orderBy("date")
>  
> val nm = "repeatMe"
> df.select(df.col("*"), rowNumber().over(ws).as(nm))
>  
> 
> df.filter(df("repeatMe").isNotNull).orderBy("repeatMe").take(50).foreach(println(_))
>  
> --->
>  
> Long, DateType, Int
> [219483904822,2006-06-01,-1863462909]
> [219483904822,2006-09-01,-1863462909]
> [219483904822,2007-01-01,-1863462909]
> [219483904822,2007-08-01,-1863462909]
> [219483904822,2007-07-01,-1863462909]
> [192489238423,2007-07-01,-1863462774]
> [192489238423,2007-02-01,-1863462774]
> [192489238423,2006-11-01,-1863462774]
> [192489238423,2006-08-01,-1863462774]
> [192489238423,2007-08-01,-1863462774]
> [192489238423,2006-09-01,-1863462774]
> [192489238423,2007-03-01,-1863462774]
> [192489238423,2006-10-01,-1863462774]
> [192489238423,2007-05-01,-1863462774]
> [192489238423,2006-06-01,-1863462774]
> [192489238423,2006-12-01,-1863462774]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10971) sparkR: RRunner should allow setting path to Rscript

2015-10-08 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14949167#comment-14949167
 ] 

Felix Cheung commented on SPARK-10971:
--

I think he is suggesting the path to R/Rscript to be configurable, but not 
distribute RScript itself

> sparkR: RRunner should allow setting path to Rscript
> 
>
> Key: SPARK-10971
> URL: https://issues.apache.org/jira/browse/SPARK-10971
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>
> I'm running spark on yarn and trying to use R in cluster mode. RRunner seems 
> to just call Rscript and assumes its in the path. But on our YARN deployment 
> R isn't installed on the nodes so it needs to be distributed along with the 
> job and we need the ability to point to where it gets installed. sparkR in 
> client mode has the config spark.sparkr.r.command to point to Rscript. 
> RRunner should have something similar so it works in cluster mode



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10995) Graceful shutdown drops processing in Spark Streaming

2015-10-08 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14949169#comment-14949169
 ] 

Sean Owen commented on SPARK-10995:
---

Ah right what I mean is that your _slide duration_ is equal to your window 
size. Isn't this the same as using a 15 minute batch duration with no windows? 
This is probably orthogonal anyway.

I understand the point now -- it only keeps running for one more batch 
interval, not one more slide duration. Maybe a 15 minute batch interval is a 
workaround here, but would not be in a slightly different situation.

I suppose I'm asking what {{waitForShutdownSignal()}} waits for, but that may 
also be inconsequential. It's not killing threads or initiating other shutdown 
parallel, nothing else that might interfere. If you see shutdown starting 
normally then that seems OK.


> Graceful shutdown drops processing in Spark Streaming
> -
>
> Key: SPARK-10995
> URL: https://issues.apache.org/jira/browse/SPARK-10995
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1
>Reporter: Michal Cizmazia
>
> After triggering the graceful shutdown on the following application, the 
> application stops before the windowed stream reaches its slide duration. As a 
> result, the data is not completely processed (i.e. saveToMyStorage is not 
> called) before shutdown.
> According to the documentation, graceful shutdown should ensure that the 
> data, which has been received, is completely processed before shutdown.
> https://spark.apache.org/docs/1.4.0/streaming-programming-guide.html#upgrading-application-code
> Spark version: 1.4.1
> Code snippet:
> {code:java}
> Function0 factory = () -> {
> JavaStreamingContext context = new JavaStreamingContext(sparkConf, 
> Durations.minutes(1));
> context.checkpoint("/test");
> JavaDStream records = 
> context.receiverStream(myReliableReceiver).flatMap(...);
> records.persist(StorageLevel.MEMORY_AND_DISK());
> records.foreachRDD(rdd -> { rdd.count(); return null; });
> records
> .window(Durations.minutes(15), Durations.minutes(15))
> .foreachRDD(rdd -> saveToMyStorage(rdd));
> return context;
> };
> try (JavaStreamingContext context = JavaStreamingContext.getOrCreate("/test", 
> factory)) {
> context.start();
> waitForShutdownSignal();
> Boolean stopSparkContext = true;
> Boolean stopGracefully = true;
> context.stop(stopSparkContext, stopGracefully);
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11004) MapReduce Hive-like join operations for RDDs

2015-10-08 Thread Glenn Strycker (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14949171#comment-14949171
 ] 

Glenn Strycker commented on SPARK-11004:


So maybe we can simplify this idea down to forcing "disk-only shuffles" only in 
particular places.  Spark could add a "force disk-only" parameter to the 
existing RDD join function so the command would look like this:

rdd1.join(rdd2, diskonly = true)

since it is the memory shuffles that seem to be causing my join problems.

> MapReduce Hive-like join operations for RDDs
> 
>
> Key: SPARK-11004
> URL: https://issues.apache.org/jira/browse/SPARK-11004
> Project: Spark
>  Issue Type: New Feature
>Reporter: Glenn Strycker
>
> Could a feature be added to Spark that would use disk-only MapReduce 
> operations for the very largest RDD joins?
> MapReduce is able to handle incredibly large table joins in a stable, 
> predictable way with gracious failures and recovery.  I have applications 
> that are able to join 2 tables without error in Hive, but these same tables, 
> when converted into RDDs, are unable to join in Spark (I am using the same 
> cluster, and have played around with all of the memory configurations, 
> persisting to disk, checkpointing, etc., and the RDDs are just too big for 
> Spark on my cluster)
> So, Spark is usually able to handle fairly large RDD joins, but occasionally 
> runs into problems when the tables are just too big (e.g. the notorious 2GB 
> shuffle limit issue, memory problems, etc.)  There are so many parameters to 
> adjust (number of partitions, number of cores, memory per core, etc.) that it 
> is difficult to guarantee stability on a shared cluster (say, running Yarn) 
> with other jobs.
> Could a feature be added to Spark that would use disk-only MapReduce commands 
> to do very large joins?
> That is, instead of myRDD1.join(myRDD2), we would have a special operation 
> myRDD1.mapReduceJoin(myRDD2) that would checkpoint both RDDs to disk, run 
> MapReduce, and then convert the results of the join back into a standard RDD.
> This might add stability for Spark jobs that deal with extremely large data, 
> and enable developers to mix-and-match some Spark and MapReduce operations in 
> the same program, rather than writing Hive scripts and stringing together 
> Spark and MapReduce programs, which has extremely large overhead to convert 
> RDDs to Hive tables and back again.
> Despite memory-level operations being where most of Spark's speed gains lie, 
> sometimes using disk-only may help with stability!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11004) MapReduce Hive-like join operations for RDDs

2015-10-08 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14949176#comment-14949176
 ] 

Sean Owen commented on SPARK-11004:
---

I suppose I'd be surprised if using disk over memory helped, but you can easily 
test this already with {{spark.shuffle.memoryFraction=0}} (or maybe it has to 
be a very small value). Here are the configuration knobs: 
http://spark.apache.org/docs/latest/configuration.html#shuffle-behavior

> MapReduce Hive-like join operations for RDDs
> 
>
> Key: SPARK-11004
> URL: https://issues.apache.org/jira/browse/SPARK-11004
> Project: Spark
>  Issue Type: New Feature
>Reporter: Glenn Strycker
>
> Could a feature be added to Spark that would use disk-only MapReduce 
> operations for the very largest RDD joins?
> MapReduce is able to handle incredibly large table joins in a stable, 
> predictable way with gracious failures and recovery.  I have applications 
> that are able to join 2 tables without error in Hive, but these same tables, 
> when converted into RDDs, are unable to join in Spark (I am using the same 
> cluster, and have played around with all of the memory configurations, 
> persisting to disk, checkpointing, etc., and the RDDs are just too big for 
> Spark on my cluster)
> So, Spark is usually able to handle fairly large RDD joins, but occasionally 
> runs into problems when the tables are just too big (e.g. the notorious 2GB 
> shuffle limit issue, memory problems, etc.)  There are so many parameters to 
> adjust (number of partitions, number of cores, memory per core, etc.) that it 
> is difficult to guarantee stability on a shared cluster (say, running Yarn) 
> with other jobs.
> Could a feature be added to Spark that would use disk-only MapReduce commands 
> to do very large joins?
> That is, instead of myRDD1.join(myRDD2), we would have a special operation 
> myRDD1.mapReduceJoin(myRDD2) that would checkpoint both RDDs to disk, run 
> MapReduce, and then convert the results of the join back into a standard RDD.
> This might add stability for Spark jobs that deal with extremely large data, 
> and enable developers to mix-and-match some Spark and MapReduce operations in 
> the same program, rather than writing Hive scripts and stringing together 
> Spark and MapReduce programs, which has extremely large overhead to convert 
> RDDs to Hive tables and back again.
> Despite memory-level operations being where most of Spark's speed gains lie, 
> sometimes using disk-only may help with stability!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10903) Make sqlContext global

2015-10-08 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14949185#comment-14949185
 ] 

Felix Cheung commented on SPARK-10903:
--

[~sunrui] Agreed. I'd like to propose adding .Deprecated to the existing 
functions too, and plan to move away from those in the next 1.x release.

Thoughts? [~shivaram][~davies][~falaki]

> Make sqlContext global 
> ---
>
> Key: SPARK-10903
> URL: https://issues.apache.org/jira/browse/SPARK-10903
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Narine Kokhlikyan
>Priority: Minor
>
> Make sqlContext global so that we don't have to always specify it.
> e.g. createDataFrame(iris) instead of createDataFrame(sqlContext, iris)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11004) MapReduce Hive-like join operations for RDDs

2015-10-08 Thread Glenn Strycker (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14949187#comment-14949187
 ] 

Glenn Strycker commented on SPARK-11004:


Awesome -- thanks, I'll try that out.

Is there a way to change this setting dynamically from within a Spark job, so 
that the fraction can be higher for most of the job and then drop down to 0 
only for the difficult parts?

> MapReduce Hive-like join operations for RDDs
> 
>
> Key: SPARK-11004
> URL: https://issues.apache.org/jira/browse/SPARK-11004
> Project: Spark
>  Issue Type: New Feature
>Reporter: Glenn Strycker
>
> Could a feature be added to Spark that would use disk-only MapReduce 
> operations for the very largest RDD joins?
> MapReduce is able to handle incredibly large table joins in a stable, 
> predictable way with gracious failures and recovery.  I have applications 
> that are able to join 2 tables without error in Hive, but these same tables, 
> when converted into RDDs, are unable to join in Spark (I am using the same 
> cluster, and have played around with all of the memory configurations, 
> persisting to disk, checkpointing, etc., and the RDDs are just too big for 
> Spark on my cluster)
> So, Spark is usually able to handle fairly large RDD joins, but occasionally 
> runs into problems when the tables are just too big (e.g. the notorious 2GB 
> shuffle limit issue, memory problems, etc.)  There are so many parameters to 
> adjust (number of partitions, number of cores, memory per core, etc.) that it 
> is difficult to guarantee stability on a shared cluster (say, running Yarn) 
> with other jobs.
> Could a feature be added to Spark that would use disk-only MapReduce commands 
> to do very large joins?
> That is, instead of myRDD1.join(myRDD2), we would have a special operation 
> myRDD1.mapReduceJoin(myRDD2) that would checkpoint both RDDs to disk, run 
> MapReduce, and then convert the results of the join back into a standard RDD.
> This might add stability for Spark jobs that deal with extremely large data, 
> and enable developers to mix-and-match some Spark and MapReduce operations in 
> the same program, rather than writing Hive scripts and stringing together 
> Spark and MapReduce programs, which has extremely large overhead to convert 
> RDDs to Hive tables and back again.
> Despite memory-level operations being where most of Spark's speed gains lie, 
> sometimes using disk-only may help with stability!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10971) sparkR: RRunner should allow setting path to Rscript

2015-10-08 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14949199#comment-14949199
 ] 

Thomas Graves commented on SPARK-10971:
---

you shouldn't have to install everything a user needs on the YARN nodes.   This 
can cause many different types of issues, the main one being version conflicts 
and a Maintenance head ache.   The only downside to that is if you aren't using 
the distributed cache properly there is overhead in downloading that.  Perhaps 
there are distributions that don't recommend or cases you want it installed for 
performance reasons but a general use YARN cluster needs to allow users to send 
their dependencies with their applications.

So yes I am just suggesting the path to Rscript be configurable.  You should be 
able to set a config like spark.sparkr.r.command to point to where Rscript is 
located.




> sparkR: RRunner should allow setting path to Rscript
> 
>
> Key: SPARK-10971
> URL: https://issues.apache.org/jira/browse/SPARK-10971
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>
> I'm running spark on yarn and trying to use R in cluster mode. RRunner seems 
> to just call Rscript and assumes its in the path. But on our YARN deployment 
> R isn't installed on the nodes so it needs to be distributed along with the 
> job and we need the ability to point to where it gets installed. sparkR in 
> client mode has the config spark.sparkr.r.command to point to Rscript. 
> RRunner should have something similar so it works in cluster mode



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10382) Make example code in user guide testable

2015-10-08 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14949326#comment-14949326
 ] 

Xusen Yin commented on SPARK-10382:
---

Or I can custom it to use in Spark project easily.

> Make example code in user guide testable
> 
>
> Key: SPARK-10382
> URL: https://issues.apache.org/jira/browse/SPARK-10382
> Project: Spark
>  Issue Type: Brainstorming
>  Components: Documentation, ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Xusen Yin
>Priority: Critical
>
> The example code in the user guide is embedded in the markdown and hence it 
> is not easy to test. It would be nice to automatically test them. This JIRA 
> is to discuss options to automate example code testing and see what we can do 
> in Spark 1.6.
> One option I propose is to move actual example code to spark/examples and 
> test compilation in Jenkins builds. Then in the markdown, we can reference 
> part of the code to show in the user guide. This requires adding a Jekyll tag 
> that is similar to 
> https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, 
> e.g., called include_example.
> {code}
> {% include_example scala ml.KMeansExample guide %}
> {code}
> Jekyll will find 
> `examples/src/main/scala/org/apache/spark/examples/ml/KMeansExample.scala` 
> and pick code blocks marked "guide" and put them under `{% highlight %}` in 
> the markdown. We can discuss the syntax for marker comments.
> Just one way to implement this. It would be nice to hear more ideas.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8546) PMML export for Naive Bayes

2015-10-08 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14949361#comment-14949361
 ] 

Xusen Yin commented on SPARK-8546:
--

Hi [~mengxr], I'd like to work on it.

> PMML export for Naive Bayes
> ---
>
> Key: SPARK-8546
> URL: https://issues.apache.org/jira/browse/SPARK-8546
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> The naive Bayes section of PMML standard can be found at 
> http://www.dmg.org/v4-1/NaiveBayes.html. We should first figure out how to 
> generate PMML for both binomial and multinomial naive Bayes models using 
> JPMML (maybe [~vfed] can help).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10858) YARN: archives/jar/files rename with # doesn't work unless scheme given

2015-10-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14949372#comment-14949372
 ] 

Apache Spark commented on SPARK-10858:
--

User 'tgravescs' has created a pull request for this issue:
https://github.com/apache/spark/pull/9035

> YARN: archives/jar/files rename with # doesn't work unless scheme given
> ---
>
> Key: SPARK-10858
> URL: https://issues.apache.org/jira/browse/SPARK-10858
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Minor
>
> The YARN distributed cache feature with --jars, --archives, --files where you 
> can rename the file/archive using a # symbol only works if you explicitly 
> include the scheme in the path:
> works:
> --jars file:///home/foo/my.jar#renamed.jar
> doesn't work:
> --jars /home/foo/my.jar#renamed.jar
> Exception in thread "main" java.io.FileNotFoundException: File 
> file:/home/foo/my.jar#renamed.jar does not exist
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:534)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:524)
> at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:416)
> at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:337)
> at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:289)
> at 
> org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:240)
> at 
> org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:329)
> at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$6$$anonfun$apply$2.apply(Client.scala:393)
> at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$6$$anonfun$apply$2.apply(Client.scala:392)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11015) Add computeCost and clusterCenters to KMeansModel in spark.ml package

2015-10-08 Thread Richard Garris (JIRA)
Richard Garris created SPARK-11015:
--

 Summary: Add computeCost and clusterCenters to KMeansModel in 
spark.ml package
 Key: SPARK-11015
 URL: https://issues.apache.org/jira/browse/SPARK-11015
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Richard Garris


The Transformer version of KMeansModel does not currently have methods to 
computeCost or get the centers of the cluster centroids. If there could be a 
way to get this either by exposing the parentModel or by adding these method it 
would make things easier.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5949) Driver program has to register roaring bitmap classes used by spark with Kryo when number of partitions is greater than 2000

2015-10-08 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14949525#comment-14949525
 ] 

Charles Allen commented on SPARK-5949:
--

[~lemire] pinging to see if you have any suggestions on how to handle 
situations like this.

> Driver program has to register roaring bitmap classes used by spark with Kryo 
> when number of partitions is greater than 2000
> 
>
> Key: SPARK-5949
> URL: https://issues.apache.org/jira/browse/SPARK-5949
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Peter Torok
>Assignee: Imran Rashid
>  Labels: kryo, partitioning, serialization
> Fix For: 1.4.0
>
>
> When more than 2000 partitions are being used with Kryo, the following 
> classes need to be registered by driver program:
> - org.apache.spark.scheduler.HighlyCompressedMapStatus
> - org.roaringbitmap.RoaringBitmap
> - org.roaringbitmap.RoaringArray
> - org.roaringbitmap.ArrayContainer
> - org.roaringbitmap.RoaringArray$Element
> - org.roaringbitmap.RoaringArray$Element[]
> - short[]
> Our project doesn't have dependency on roaring bitmap and 
> HighlyCompressedMapStatus is intended for internal spark usage. Spark should 
> take care of this registration when Kryo is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10382) Make example code in user guide testable

2015-10-08 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14949324#comment-14949324
 ] 

Xusen Yin commented on SPARK-10382:
---

Hi Xiangrui,

As far as I know, there are several plugins of Jekyll have already implemented 
this function. Here are the links:

- https://github.com/imathis/octopress/blob/master/plugins/include_code.rb
This one insert code from local file

- https://github.com/bwillis/jekyll-github-sample
This one insert code from GitHub

- https://github.com/octopress/render-code
This one insert code from local file

And the first one is exactly what we want. So I think, instead of writing a new 
one, we can use the first one above. What do you think?

> Make example code in user guide testable
> 
>
> Key: SPARK-10382
> URL: https://issues.apache.org/jira/browse/SPARK-10382
> Project: Spark
>  Issue Type: Brainstorming
>  Components: Documentation, ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Xusen Yin
>Priority: Critical
>
> The example code in the user guide is embedded in the markdown and hence it 
> is not easy to test. It would be nice to automatically test them. This JIRA 
> is to discuss options to automate example code testing and see what we can do 
> in Spark 1.6.
> One option I propose is to move actual example code to spark/examples and 
> test compilation in Jenkins builds. Then in the markdown, we can reference 
> part of the code to show in the user guide. This requires adding a Jekyll tag 
> that is similar to 
> https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, 
> e.g., called include_example.
> {code}
> {% include_example scala ml.KMeansExample guide %}
> {code}
> Jekyll will find 
> `examples/src/main/scala/org/apache/spark/examples/ml/KMeansExample.scala` 
> and pick code blocks marked "guide" and put them under `{% highlight %}` in 
> the markdown. We can discuss the syntax for marker comments.
> Just one way to implement this. It would be nice to hear more ideas.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11014) RPC Time Out Exceptions

2015-10-08 Thread Gurpreet Singh (JIRA)
Gurpreet Singh created SPARK-11014:
--

 Summary: RPC Time Out Exceptions
 Key: SPARK-11014
 URL: https://issues.apache.org/jira/browse/SPARK-11014
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.5.1
 Environment: YARN
Reporter: Gurpreet Singh


I am seeing lots of the following RPC exception messages in YARN logs:



15/10/08 13:04:27 WARN executor.Executor: Issue communicating with driver in 
heartbeater
org.apache.spark.SparkException: Error sending message [message = 
Heartbeat(437,[Lscala.Tuple2;@34199eb1,BlockManagerId(437, 
phxaishdc9dn1294.stratus.phx.ebay.com, 47480))]
at 
org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:118)
at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:77)
at 
org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:452)
at 
org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:472)
at 
org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:472)
at 
org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:472)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:472)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.rpc.RpcTimeoutException: Futures timed out after 
[120 seconds]. This timeout is controlled by spark.rpc.askTimeout
at 
org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcEnv.scala:214)
at 
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcEnv.scala:229)
at 
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcEnv.scala:225)
at 
scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcEnv.scala:242)
at 
org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101)
... 14 more
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [120 
seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at 
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcEnv.scala:241)
... 15 more

##



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6723) Model import/export for ChiSqSelector

2015-10-08 Thread Jayant Shekhar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14949548#comment-14949548
 ] 

Jayant Shekhar commented on SPARK-6723:
---

Thanks [~fliang]

[~mengxr] Can you trigger tests on the PR. Thanks.

> Model import/export for ChiSqSelector
> -
>
> Key: SPARK-6723
> URL: https://issues.apache.org/jira/browse/SPARK-6723
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10936) UDAF "mode" for categorical variables

2015-10-08 Thread Frederick Reiss (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14949306#comment-14949306
 ] 

Frederick Reiss commented on SPARK-10936:
-

Mode is not an algebraic aggregate.  To find the mode in a single pass over the 
original data, one needs to track the full set of distinct values in the 
underlying set, as well as the number of times each value occurs in the records 
seen so far. For high-cardinality columns, this requirement will result in 
unbounded state.

I can see three ways forward here: 

a) Stuff a hash table into the PartialAggregate2 API's result buffer, and hope 
that this buffer does not exhaust the heap, produce O(n^2) behavior when the 
column cardinality is high, or stop working on future (or present?) versions of 
codegen.

b) Implement an approximate mode with fixed-size intermediate state (for 
example, a compressed reservoir sample), similar to how the current 
HyperLogLog++ aggregate works. Approximate computation of the mode will work 
well most of the time but will have unbounded error in corner cases.

c) Add mode as another member of the "distinct" family of Spark aggregates, 
such as SUM/COUNT/AVERAGE DISTINCT. Use the same pre-Tungsten style of 
processing to handle mode for now.  Create a follow-on JIRA to move mode over 
to the fast path at the same time that the other DISTINCT aggregates switch 
over.

I think that (c) is the best option overall, but I'm happy to defer to others 
with deeper understanding. My thinking is that, while it would be good to have 
a mode aggregate available, mode is a relatively uncommon use case. Slow-path 
processing for mode is ok as a short-term expedient. Once SUM DISTINCT and 
related aggregates are fully moved onto the new framework, transitioning mode 
to the fast path should be easy.


> UDAF "mode" for categorical variables
> -
>
> Key: SPARK-10936
> URL: https://issues.apache.org/jira/browse/SPARK-10936
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SQL
>Reporter: Xiangrui Meng
>
> This is similar to frequent items except that we don't have a threshold on 
> the frequency. So an exact implementation might require a global shuffle.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11013) SparkPlan may mistakenly register child plan's accumulators for SQL metrics

2015-10-08 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-11013:
---

 Summary: SparkPlan may mistakenly register child plan's 
accumulators for SQL metrics
 Key: SPARK-11013
 URL: https://issues.apache.org/jira/browse/SPARK-11013
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Wenchen Fan


The reason is that: when we call RDD API inside SparkPlan, we are very likely 
to reference the SparkPlan in the closure and thus serialize and transfer a 
SparkPlan tree to executor side. When we deserialize it, the accumulators in 
child SparkPlan are also deserialized and registered, and always report zero 
value.
This is not a problem currently because we only have one operation to aggregate 
the accumulators: add. However, if we wanna support more complex metric like 
min, the extra zero values will lead to wrong result.

Take TungstenAggregate as an example, I logged "stageId, partitionId, 
accumName, accumId" when an accumulator is deserialized and registered, and 
logged the "accumId -> accumValue" map when a task ends. The output is:
{code}
scala> val df = Seq(1 -> "a", 2 -> "b").toDF("a", "b").groupBy().count()
df: org.apache.spark.sql.DataFrame = [count: bigint]

scala> df.collect
register: 0 0 Some(number of input rows) 4
register: 0 0 Some(number of output rows) 5
register: 1 0 Some(number of input rows) 4
register: 1 0 Some(number of output rows) 5
register: 1 0 Some(number of input rows) 2
register: 1 0 Some(number of output rows) 3
Map(5 -> 1, 4 -> 2, 6 -> 4458496)
Map(5 -> 0, 2 -> 1, 7 -> 4458496, 3 -> 1, 4 -> 0)
res0: Array[org.apache.spark.sql.Row] = Array([2])
{code}


The best choice is to avoid serialize and deserialize a SparkPlan tree, which 
can be achieved by LocalNode.
Or we can do some workaround to fix this serialization problem for the 
problematic SparkPlans like TungstenAggregate, TungstenSort.
Or we can improve the SQL metrics framework to make it more robust to this case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8654) Analysis exception when using "NULL IN (...)": invalid cast

2015-10-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14949360#comment-14949360
 ] 

Apache Spark commented on SPARK-8654:
-

User 'marmbrus' has created a pull request for this issue:
https://github.com/apache/spark/pull/9034

> Analysis exception when using "NULL IN (...)": invalid cast
> ---
>
> Key: SPARK-8654
> URL: https://issues.apache.org/jira/browse/SPARK-8654
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Santiago M. Mola
>Priority: Minor
> Fix For: 1.6.0
>
>
> The following query throws an analysis exception:
> {code}
> SELECT * FROM t WHERE NULL NOT IN (1, 2, 3);
> {code}
> The exception is:
> {code}
> org.apache.spark.sql.AnalysisException: invalid cast from int to null;
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:66)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:52)
> {code}
> Here is a test that can be added to AnalysisSuite to check the issue:
> {code}
>   test("SPARK- regression test") {
> val plan = Project(Alias(In(Literal(null), Seq(Literal(1), Literal(2))), 
> "a")() :: Nil,
>   LocalRelation()
> )
> caseInsensitiveAnalyze(plan)
>   }
> {code}
> Note that this kind of query is a corner case, but it is still valid SQL. An 
> expression such as "NULL IN (...)" or "NULL NOT IN (...)" always gives NULL 
> as a result, even if the list contains NULL. So it is safe to translate these 
> expressions to Literal(null) during analysis.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10988) Reduce duplication in Aggregate2's expression rewriting logic

2015-10-08 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-10988.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9015
[https://github.com/apache/spark/pull/9015]

> Reduce duplication in Aggregate2's expression rewriting logic
> -
>
> Key: SPARK-10988
> URL: https://issues.apache.org/jira/browse/SPARK-10988
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 1.6.0
>
>
> In `aggregate/utils.scala`, there is a substantial amount of duplication in 
> the expression-rewriting logic. As a prerequisite to supporting imperative 
> aggregate functions in `TungstenAggregate`, we should refactor this file so 
> that the same expression-rewriting logic is used for both `SortAggregate` and 
> `TungstenAggregate`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5949) Driver program has to register roaring bitmap classes used by spark with Kryo when number of partitions is greater than 2000

2015-10-08 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14949518#comment-14949518
 ] 

Charles Allen commented on SPARK-5949:
--

This breaks when using more recent versions of Roaring where 
org.roaringbitmap.RoaringArray$Element is no longer present. The following 
stack trace appears:

{code}
A needed class was not found. This could be due to an error in your runpath. 
Missing class: org/roaringbitmap/RoaringArray$Element
java.lang.NoClassDefFoundError: org/roaringbitmap/RoaringArray$Element
at 
org.apache.spark.serializer.KryoSerializer$.(KryoSerializer.scala:338)
at 
org.apache.spark.serializer.KryoSerializer$.(KryoSerializer.scala)
at 
org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:93)
at 
org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:237)
at 
org.apache.spark.serializer.KryoSerializerInstance.(KryoSerializer.scala:222)
at 
org.apache.spark.serializer.KryoSerializer.newInstance(KryoSerializer.scala:138)
at 
org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:201)
at 
org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:102)
at 
org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:85)
at 
org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
at 
org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:63)
at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1318)
at 
org.apache.spark.SparkContext$$anonfun$hadoopFile$1.apply(SparkContext.scala:1006)
at 
org.apache.spark.SparkContext$$anonfun$hadoopFile$1.apply(SparkContext.scala:1003)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.SparkContext.withScope(SparkContext.scala:700)
at org.apache.spark.SparkContext.hadoopFile(SparkContext.scala:1003)
at 
org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:818)
at 
org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:816)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.SparkContext.withScope(SparkContext.scala:700)
at org.apache.spark.SparkContext.textFile(SparkContext.scala:816)
at 
io.druid.indexer.spark.SparkDruidIndexer$$anonfun$2.apply(SparkDruidIndexer.scala:84)
at 
io.druid.indexer.spark.SparkDruidIndexer$$anonfun$2.apply(SparkDruidIndexer.scala:84)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at 
io.druid.indexer.spark.SparkDruidIndexer$.loadData(SparkDruidIndexer.scala:84)
at 
io.druid.indexer.spark.TestSparkDruidIndexer$$anonfun$1.apply$mcV$sp(TestSparkDruidIndexer.scala:131)
at 
io.druid.indexer.spark.TestSparkDruidIndexer$$anonfun$1.apply(TestSparkDruidIndexer.scala:40)
at 
io.druid.indexer.spark.TestSparkDruidIndexer$$anonfun$1.apply(TestSparkDruidIndexer.scala:40)
at 
org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
at org.scalatest.Transformer.apply(Transformer.scala:22)
at org.scalatest.Transformer.apply(Transformer.scala:20)
at org.scalatest.FlatSpecLike$$anon$1.apply(FlatSpecLike.scala:1647)
at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
at org.scalatest.FlatSpec.withFixture(FlatSpec.scala:1683)
at 
org.scalatest.FlatSpecLike$class.invokeWithFixture$1(FlatSpecLike.scala:1644)
at 
org.scalatest.FlatSpecLike$$anonfun$runTest$1.apply(FlatSpecLike.scala:1656)
at 
org.scalatest.FlatSpecLike$$anonfun$runTest$1.apply(FlatSpecLike.scala:1656)
at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
at org.scalatest.FlatSpecLike$class.runTest(FlatSpecLike.scala:1656)
at org.scalatest.FlatSpec.runTest(FlatSpec.scala:1683)
at 
org.scalatest.FlatSpecLike$$anonfun$runTests$1.apply(FlatSpecLike.scala:1714)
at 
org.scalatest.FlatSpecLike$$anonfun$runTests$1.apply(FlatSpecLike.scala:1714)
at 

  1   2   3   >