[jira] [Resolved] (SPARK-12198) SparkR support read.parquet and deprecate parquetFile

2015-12-10 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-12198.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 10191
[https://github.com/apache/spark/pull/10191]

> SparkR support read.parquet and deprecate parquetFile
> -
>
> Key: SPARK-12198
> URL: https://issues.apache.org/jira/browse/SPARK-12198
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Yanbo Liang
> Fix For: 1.6.0
>
>
> SparkR support read.parquet and deprecate parquetFile.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12172) Consider removing SparkR internal RDD APIs

2015-12-10 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15051320#comment-15051320
 ] 

Shivaram Venkataraman commented on SPARK-12172:
---

My opinion is that we should introduce the DataFrame UDF support for SparkR 
before removing the RDD API. It'll enable almost all the use cases that the 
current private API supports.

> Consider removing SparkR internal RDD APIs
> --
>
> Key: SPARK-12172
> URL: https://issues.apache.org/jira/browse/SPARK-12172
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Felix Cheung
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12266) cannot handle postgis raster type

2015-12-10 Thread Severin Thaler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Severin Thaler updated SPARK-12266:
---
Description: 
on server postgis running with a table that has a column of type raster
on client running spark standalone application that pulls data from remote 
postgis server using spark-sql and jdbc driver:

{code:none}
val conf = new SparkConf().setAppName("Simple Application").setMaster("local")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val jdbcDF = sqlContext.read.format("jdbc").options(Map("url" -> 
"jdbc:postgresql://postgishost/test?user=postgres=userpassword", 
"dbtable" -> "atlas")).load()
val rows = jdbcDF.take(3)
println(rows(0).toString())
println(rows(1).toString())
println(rows(2).toString())
{code}

when running this app i get:

{code:java}
java.sql.SQLException: Unsupported type 
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.org$apache$spark$sql$execution$datasources$jdbc$JDBCRDD$$getCatalystType(JDBCRDD.scala:103)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anonfun$1.apply(JDBCRDD.scala:140)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anonfun$1.apply(JDBCRDD.scala:140)
  at scala.Option.getOrElse(Option.scala:121)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:139)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.(JDBCRelation.scala:91)
  at 
org.apache.spark.sql.execution.datasources.jdbc.DefaultSource.createRelation(DefaultSource.scala:60)
  at 
org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:125)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114)
  at rdd.BasicRDDTests$$anonfun$1.apply$mcV$sp(BasicRDDTests.scala:65)
{code}


  was:
on server postgis server running with a table that has a column of type raster
on client running spark standalone application that pulls data from remote 
postgis server using spark-sql and jdbc driver:

{code:none}
val conf = new SparkConf().setAppName("Simple Application").setMaster("local")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val jdbcDF = sqlContext.read.format("jdbc").options(Map("url" -> 
"jdbc:postgresql://postgishost/test?user=postgres=userpassword", 
"dbtable" -> "atlas")).load()
val rows = jdbcDF.take(3)
println(rows(0).toString())
println(rows(1).toString())
println(rows(2).toString())
{code}

when running this app i get:

{code:java}
java.sql.SQLException: Unsupported type 
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.org$apache$spark$sql$execution$datasources$jdbc$JDBCRDD$$getCatalystType(JDBCRDD.scala:103)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anonfun$1.apply(JDBCRDD.scala:140)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anonfun$1.apply(JDBCRDD.scala:140)
  at scala.Option.getOrElse(Option.scala:121)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:139)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.(JDBCRelation.scala:91)
  at 
org.apache.spark.sql.execution.datasources.jdbc.DefaultSource.createRelation(DefaultSource.scala:60)
  at 
org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:125)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114)
  at rdd.BasicRDDTests$$anonfun$1.apply$mcV$sp(BasicRDDTests.scala:65)
{code}



> cannot handle postgis raster type
> -
>
> Key: SPARK-12266
> URL: https://issues.apache.org/jira/browse/SPARK-12266
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 1.5.2
> Environment: PostGIS Server:
> Ubuntu-14.04
> Postgres 9.4.5
> PostGIS 2.2.1
> Spark standalone client:
> Java 7
> Spark 1.5.2
> PostGIS JDBC driver built within java directory of PostGIS 2.1.8, see: 
> http://postgis.net/docs/manual-2.1/postgis_installation.html#idp58575520
>Reporter: Severin Thaler
>  Labels: dataframe, postgres, sql
>
> on server postgis running with a table that has a column of type raster
> on client running spark standalone application that pulls data from remote 
> postgis server using spark-sql and jdbc driver:
> {code:none}
> val conf = new SparkConf().setAppName("Simple Application").setMaster("local")
> val sc = new SparkContext(conf)
> val sqlContext = new SQLContext(sc)
> val jdbcDF = sqlContext.read.format("jdbc").options(Map("url" -> 
> "jdbc:postgresql://postgishost/test?user=postgres=userpassword", 
> "dbtable" -> "atlas")).load()
> val rows = jdbcDF.take(3)
> println(rows(0).toString())
> println(rows(1).toString())
> println(rows(2).toString())
> {code}
> when running this app i get:
> {code:java}
> java.sql.SQLException: 

[jira] [Updated] (SPARK-12266) cannot handle postgis raster type

2015-12-10 Thread Severin Thaler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Severin Thaler updated SPARK-12266:
---
Description: 
on server postgis server running with a table that has a column of type raster
on client running spark standalone application that pulls data from remote 
postgis server using spark-sql and jdbc driver:

{code:none}
val conf = new SparkConf().setAppName("Simple Application").setMaster("local")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val jdbcDF = sqlContext.read.format("jdbc").options(Map("url" -> 
"jdbc:postgresql://postgishost/test?user=postgres=userpassword", 
"dbtable" -> "atlas")).load()
val rows = jdbcDF.take(3)
println(rows(0).toString())
println(rows(1).toString())
println(rows(2).toString())
{code}

  was:
on server postgis server running with a table that has a column of type raster
on client running spark standalone application that pulls data from remote 
postgis server using spark-sql and jdbc driver:

{code:scala}
val conf = new SparkConf().setAppName("Simple Application").setMaster("local")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val jdbcDF = sqlContext.read.format("jdbc").options(Map("url" -> 
"jdbc:postgresql://postgishost/test?user=postgres=userpassword", 
"dbtable" -> "atlas")).load()
val rows = jdbcDF.take(3)
println(rows(0).toString())
println(rows(1).toString())
println(rows(2).toString())
{code}


> cannot handle postgis raster type
> -
>
> Key: SPARK-12266
> URL: https://issues.apache.org/jira/browse/SPARK-12266
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 1.5.2
> Environment: PostGIS Server:
> Ubuntu-14.04
> Postgres 9.4.5
> PostGIS 2.2.1
> Spark standalone client:
> Java 7
> Spark 1.5.2
> PostGIS JDBC driver built within java directory of PostGIS 2.1.8, see: 
> http://postgis.net/docs/manual-2.1/postgis_installation.html#idp58575520
>Reporter: Severin Thaler
>  Labels: dataframe, postgres, sql
>
> on server postgis server running with a table that has a column of type raster
> on client running spark standalone application that pulls data from remote 
> postgis server using spark-sql and jdbc driver:
> {code:none}
> val conf = new SparkConf().setAppName("Simple Application").setMaster("local")
> val sc = new SparkContext(conf)
> val sqlContext = new SQLContext(sc)
> val jdbcDF = sqlContext.read.format("jdbc").options(Map("url" -> 
> "jdbc:postgresql://postgishost/test?user=postgres=userpassword", 
> "dbtable" -> "atlas")).load()
> val rows = jdbcDF.take(3)
> println(rows(0).toString())
> println(rows(1).toString())
> println(rows(2).toString())
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12266) cannot handle postgis raster type

2015-12-10 Thread Severin Thaler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Severin Thaler updated SPARK-12266:
---
Description: 
on server postgis running with a table that has a column of type raster
on client running spark standalone application that pulls data from postgis 
server using and jdbc driver and spark-sql:

{code:none}
val conf = new SparkConf().setAppName("Simple Application").setMaster("local")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val jdbcDF = sqlContext.read.format("jdbc").options(Map("url" -> 
"jdbc:postgresql://postgishost/test?user=postgres=userpassword", 
"dbtable" -> "atlas")).load()
val rows = jdbcDF.take(3)
println(rows(0).toString())
println(rows(1).toString())
println(rows(2).toString())
{code}

when running this app i get:

{code:java}
java.sql.SQLException: Unsupported type 
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.org$apache$spark$sql$execution$datasources$jdbc$JDBCRDD$$getCatalystType(JDBCRDD.scala:103)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anonfun$1.apply(JDBCRDD.scala:140)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anonfun$1.apply(JDBCRDD.scala:140)
  at scala.Option.getOrElse(Option.scala:121)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:139)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.(JDBCRelation.scala:91)
  at 
org.apache.spark.sql.execution.datasources.jdbc.DefaultSource.createRelation(DefaultSource.scala:60)
  at 
org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:125)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114)
  at rdd.BasicRDDTests$$anonfun$1.apply$mcV$sp(BasicRDDTests.scala:65)
{code}


  was:
on server postgis running with a table that has a column of type raster
on client running spark standalone application that pulls data from postgis 
server using spark-sql and jdbc driver:

{code:none}
val conf = new SparkConf().setAppName("Simple Application").setMaster("local")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val jdbcDF = sqlContext.read.format("jdbc").options(Map("url" -> 
"jdbc:postgresql://postgishost/test?user=postgres=userpassword", 
"dbtable" -> "atlas")).load()
val rows = jdbcDF.take(3)
println(rows(0).toString())
println(rows(1).toString())
println(rows(2).toString())
{code}

when running this app i get:

{code:java}
java.sql.SQLException: Unsupported type 
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.org$apache$spark$sql$execution$datasources$jdbc$JDBCRDD$$getCatalystType(JDBCRDD.scala:103)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anonfun$1.apply(JDBCRDD.scala:140)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anonfun$1.apply(JDBCRDD.scala:140)
  at scala.Option.getOrElse(Option.scala:121)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:139)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.(JDBCRelation.scala:91)
  at 
org.apache.spark.sql.execution.datasources.jdbc.DefaultSource.createRelation(DefaultSource.scala:60)
  at 
org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:125)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114)
  at rdd.BasicRDDTests$$anonfun$1.apply$mcV$sp(BasicRDDTests.scala:65)
{code}



> cannot handle postgis raster type
> -
>
> Key: SPARK-12266
> URL: https://issues.apache.org/jira/browse/SPARK-12266
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 1.5.2
> Environment: PostGIS Server:
> Ubuntu-14.04
> Postgres 9.4.5
> PostGIS 2.2.1
> Spark standalone client:
> Java 7
> Spark 1.5.2
> PostGIS JDBC driver built within java directory of PostGIS 2.1.8, see: 
> http://postgis.net/docs/manual-2.1/postgis_installation.html#idp58575520
>Reporter: Severin Thaler
>  Labels: dataframe, postgres, sql
>
> on server postgis running with a table that has a column of type raster
> on client running spark standalone application that pulls data from postgis 
> server using and jdbc driver and spark-sql:
> {code:none}
> val conf = new SparkConf().setAppName("Simple Application").setMaster("local")
> val sc = new SparkContext(conf)
> val sqlContext = new SQLContext(sc)
> val jdbcDF = sqlContext.read.format("jdbc").options(Map("url" -> 
> "jdbc:postgresql://postgishost/test?user=postgres=userpassword", 
> "dbtable" -> "atlas")).load()
> val rows = jdbcDF.take(3)
> println(rows(0).toString())
> println(rows(1).toString())
> println(rows(2).toString())
> {code}
> when running this app i get:
> {code:java}
> java.sql.SQLException: Unsupported type 
>   at 

[jira] [Resolved] (SPARK-12266) cannot handle postgis raster type

2015-12-10 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-12266.
---
Resolution: Not A Problem

That sounds like a custom type, according to JDBC. I don't think you can expect 
Spark to handle this data type directly? it's not a standard one.

> cannot handle postgis raster type
> -
>
> Key: SPARK-12266
> URL: https://issues.apache.org/jira/browse/SPARK-12266
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 1.5.2
> Environment: PostGIS Server:
> Ubuntu-14.04
> Postgres 9.4.5
> PostGIS 2.2.1
> Spark standalone client:
> Java 7
> Spark 1.5.2
> PostGIS JDBC driver built within java directory of PostGIS 2.1.8, see: 
> http://postgis.net/docs/manual-2.1/postgis_installation.html#idp58575520
>Reporter: Severin Thaler
>  Labels: dataframe, postgres, sql
>
> on server postgis running with a table that has a column of type raster
> on client running spark standalone application that pulls data from postgis 
> server using and jdbc driver and spark-sql:
> {code:none}
> val conf = new SparkConf().setAppName("Simple Application").setMaster("local")
> val sc = new SparkContext(conf)
> val sqlContext = new SQLContext(sc)
> val jdbcDF = sqlContext.read.format("jdbc").options(Map("url" -> 
> "jdbc:postgresql://postgishost/test?user=postgres=userpassword", 
> "dbtable" -> "atlas")).load()
> val rows = jdbcDF.take(3)
> println(rows(0).toString())
> println(rows(1).toString())
> println(rows(2).toString())
> {code}
> when running this spark app i get:
> {code:java}
> java.sql.SQLException: Unsupported type 
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.org$apache$spark$sql$execution$datasources$jdbc$JDBCRDD$$getCatalystType(JDBCRDD.scala:103)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anonfun$1.apply(JDBCRDD.scala:140)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anonfun$1.apply(JDBCRDD.scala:140)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:139)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.(JDBCRelation.scala:91)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.DefaultSource.createRelation(DefaultSource.scala:60)
>   at 
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:125)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114)
>   at rdd.BasicRDDTests$$anonfun$1.apply$mcV$sp(BasicRDDTests.scala:65)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12266) cannot handle postgis raster type

2015-12-10 Thread Severin Thaler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Severin Thaler updated SPARK-12266:
---
Description: 
on server postgis running with a table that has a column of type raster
on client running spark standalone application that pulls data from postgis 
server using and jdbc driver and spark-sql:

{code:none}
val conf = new SparkConf().setAppName("Simple Application").setMaster("local")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val jdbcDF = sqlContext.read.format("jdbc").options(Map("url" -> 
"jdbc:postgresql://postgishost/test?user=postgres=userpassword", 
"dbtable" -> "atlas")).load()
val rows = jdbcDF.take(3)
println(rows(0).toString())
println(rows(1).toString())
println(rows(2).toString())
{code}

when running this spark app i get:

{code:java}
java.sql.SQLException: Unsupported type 
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.org$apache$spark$sql$execution$datasources$jdbc$JDBCRDD$$getCatalystType(JDBCRDD.scala:103)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anonfun$1.apply(JDBCRDD.scala:140)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anonfun$1.apply(JDBCRDD.scala:140)
  at scala.Option.getOrElse(Option.scala:121)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:139)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.(JDBCRelation.scala:91)
  at 
org.apache.spark.sql.execution.datasources.jdbc.DefaultSource.createRelation(DefaultSource.scala:60)
  at 
org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:125)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114)
  at rdd.BasicRDDTests$$anonfun$1.apply$mcV$sp(BasicRDDTests.scala:65)
{code}


  was:
on server postgis running with a table that has a column of type raster
on client running spark standalone application that pulls data from postgis 
server using and jdbc driver and spark-sql:

{code:none}
val conf = new SparkConf().setAppName("Simple Application").setMaster("local")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val jdbcDF = sqlContext.read.format("jdbc").options(Map("url" -> 
"jdbc:postgresql://postgishost/test?user=postgres=userpassword", 
"dbtable" -> "atlas")).load()
val rows = jdbcDF.take(3)
println(rows(0).toString())
println(rows(1).toString())
println(rows(2).toString())
{code}

when running this app i get:

{code:java}
java.sql.SQLException: Unsupported type 
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.org$apache$spark$sql$execution$datasources$jdbc$JDBCRDD$$getCatalystType(JDBCRDD.scala:103)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anonfun$1.apply(JDBCRDD.scala:140)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anonfun$1.apply(JDBCRDD.scala:140)
  at scala.Option.getOrElse(Option.scala:121)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:139)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.(JDBCRelation.scala:91)
  at 
org.apache.spark.sql.execution.datasources.jdbc.DefaultSource.createRelation(DefaultSource.scala:60)
  at 
org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:125)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114)
  at rdd.BasicRDDTests$$anonfun$1.apply$mcV$sp(BasicRDDTests.scala:65)
{code}



> cannot handle postgis raster type
> -
>
> Key: SPARK-12266
> URL: https://issues.apache.org/jira/browse/SPARK-12266
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 1.5.2
> Environment: PostGIS Server:
> Ubuntu-14.04
> Postgres 9.4.5
> PostGIS 2.2.1
> Spark standalone client:
> Java 7
> Spark 1.5.2
> PostGIS JDBC driver built within java directory of PostGIS 2.1.8, see: 
> http://postgis.net/docs/manual-2.1/postgis_installation.html#idp58575520
>Reporter: Severin Thaler
>  Labels: dataframe, postgres, sql
>
> on server postgis running with a table that has a column of type raster
> on client running spark standalone application that pulls data from postgis 
> server using and jdbc driver and spark-sql:
> {code:none}
> val conf = new SparkConf().setAppName("Simple Application").setMaster("local")
> val sc = new SparkContext(conf)
> val sqlContext = new SQLContext(sc)
> val jdbcDF = sqlContext.read.format("jdbc").options(Map("url" -> 
> "jdbc:postgresql://postgishost/test?user=postgres=userpassword", 
> "dbtable" -> "atlas")).load()
> val rows = jdbcDF.take(3)
> println(rows(0).toString())
> println(rows(1).toString())
> println(rows(2).toString())
> {code}
> when running this spark app i get:
> {code:java}
> java.sql.SQLException: Unsupported 

[jira] [Comment Edited] (SPARK-12266) cannot handle postgis raster type

2015-12-10 Thread Severin Thaler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15051475#comment-15051475
 ] 

Severin Thaler edited comment on SPARK-12266 at 12/10/15 7:08 PM:
--

yes, not spark is to blame but the jdbc postgis driver.
ill file a bug report there. thanks


was (Author: severin thaler):
yes, im going to simplify the example and just use a simple client, not a spark 
app. i should still see the issue showing up there and then ill file a bug to 
the devs of the jdbc postgis driver. thanks.

> cannot handle postgis raster type
> -
>
> Key: SPARK-12266
> URL: https://issues.apache.org/jira/browse/SPARK-12266
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 1.5.2
> Environment: PostGIS Server:
> Ubuntu-14.04
> Postgres 9.4.5
> PostGIS 2.2.1
> Spark standalone client:
> Java 7
> Spark 1.5.2
> PostGIS JDBC driver built within java directory of PostGIS 2.1.8, see: 
> http://postgis.net/docs/manual-2.1/postgis_installation.html#idp58575520
>Reporter: Severin Thaler
>  Labels: dataframe, postgres, sql
>
> on server postgis running with a table that has a column of type raster
> on client running spark standalone application that pulls data from postgis 
> server using and jdbc driver and spark-sql:
> {code:none}
> val conf = new SparkConf().setAppName("Simple Application").setMaster("local")
> val sc = new SparkContext(conf)
> val sqlContext = new SQLContext(sc)
> val jdbcDF = sqlContext.read.format("jdbc").options(Map("url" -> 
> "jdbc:postgresql://postgishost/test?user=postgres=userpassword", 
> "dbtable" -> "atlas")).load()
> val rows = jdbcDF.take(3)
> println(rows(0).toString())
> println(rows(1).toString())
> println(rows(2).toString())
> {code}
> when running this spark app i get:
> {code:java}
> java.sql.SQLException: Unsupported type 
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.org$apache$spark$sql$execution$datasources$jdbc$JDBCRDD$$getCatalystType(JDBCRDD.scala:103)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anonfun$1.apply(JDBCRDD.scala:140)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anonfun$1.apply(JDBCRDD.scala:140)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:139)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.(JDBCRelation.scala:91)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.DefaultSource.createRelation(DefaultSource.scala:60)
>   at 
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:125)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114)
>   at rdd.BasicRDDTests$$anonfun$1.apply$mcV$sp(BasicRDDTests.scala:65)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12267) Standalone master keeps references to disassociated workers until they sent no heartbeats

2015-12-10 Thread Jacek Laskowski (JIRA)
Jacek Laskowski created SPARK-12267:
---

 Summary: Standalone master keeps references to disassociated 
workers until they sent no heartbeats
 Key: SPARK-12267
 URL: https://issues.apache.org/jira/browse/SPARK-12267
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.6.0
Reporter: Jacek Laskowski


While toying with Spark Standalone I've noticed the following messages
in the logs of the master:

{code}
INFO Master: Registering worker 192.168.1.6:59919 with 2 cores, 2.0 GB RAM
INFO Master: localhost:59920 got disassociated, removing it.
...
WARN Master: Removing worker-20151210090708-192.168.1.6-59919 because
we got no heartbeat in 60 seconds
INFO Master: Removing worker worker-20151210090708-192.168.1.6-59919
on 192.168.1.6:59919
{code}

Why does the message "WARN Master: Removing
worker-20151210090708-192.168.1.6-59919 because we got no heartbeat in
60 seconds" appear when the worker should've been removed already (as
pointed out in "INFO Master: localhost:59920 got disassociated,
removing it.")?

Could it be that the ids are different - 192.168.1.6:59919 vs localhost:59920?

I started master using {{./sbin/start-master.sh -h localhost}} and the
workers {{./sbin/start-slave.sh spark://localhost:7077}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12267) Standalone master keeps references to disassociated workers until they sent no heartbeats

2015-12-10 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15051502#comment-15051502
 ] 

Shixiong Zhu commented on SPARK-12267:
--

cc [~vanzin]

> Standalone master keeps references to disassociated workers until they sent 
> no heartbeats
> -
>
> Key: SPARK-12267
> URL: https://issues.apache.org/jira/browse/SPARK-12267
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Jacek Laskowski
>
> While toying with Spark Standalone I've noticed the following messages
> in the logs of the master:
> {code}
> INFO Master: Registering worker 192.168.1.6:59919 with 2 cores, 2.0 GB RAM
> INFO Master: localhost:59920 got disassociated, removing it.
> ...
> WARN Master: Removing worker-20151210090708-192.168.1.6-59919 because
> we got no heartbeat in 60 seconds
> INFO Master: Removing worker worker-20151210090708-192.168.1.6-59919
> on 192.168.1.6:59919
> {code}
> Why does the message "WARN Master: Removing
> worker-20151210090708-192.168.1.6-59919 because we got no heartbeat in
> 60 seconds" appear when the worker should've been removed already (as
> pointed out in "INFO Master: localhost:59920 got disassociated,
> removing it.")?
> Could it be that the ids are different - 192.168.1.6:59919 vs localhost:59920?
> I started master using {{./sbin/start-master.sh -h localhost}} and the
> workers {{./sbin/start-slave.sh spark://localhost:7077}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12266) cannot handle postgis raster type

2015-12-10 Thread Severin Thaler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Severin Thaler updated SPARK-12266:
---
Description: 
on server postgis server running with a table that has a column of type raster
on client running spark standalone application that pulls data from remote 
postgis server using spark-sql and jdbc driver:

{code:scala}
val conf = new SparkConf().setAppName("Simple Application").setMaster("local")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val jdbcDF = sqlContext.read.format("jdbc").options(Map("url" -> 
"jdbc:postgresql://postgishost/test?user=postgres=userpassword", 
"dbtable" -> "atlas")).load()
val rows = jdbcDF.take(3)
println(rows(0).toString())
println(rows(1).toString())
println(rows(2).toString())
{code}

  was:
on server postgis server running with a table that has a column of type raster
on client running spark standalone application that pulls data from remote 
postgis server using spark-sql and jdbc driver:

{code:scala}
val conf = new SparkConf().setAppName("Simple Application").setMaster("local")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val jdbcDF = sqlContext.read.format("jdbc").options(
  Map("url" -> 
"jdbc:postgresql://postgishost/test?user=postgres=userpassword",
"dbtable" -> "atlas")).load()

val rows = jdbcDF.take(3)
println(rows(0).toString())
println(rows(1).toString())
println(rows(2).toString())
{code}


> cannot handle postgis raster type
> -
>
> Key: SPARK-12266
> URL: https://issues.apache.org/jira/browse/SPARK-12266
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 1.5.2
> Environment: PostGIS Server:
> Ubuntu-14.04
> Postgres 9.4.5
> PostGIS 2.2.1
> Spark standalone client:
> Java 7
> Spark 1.5.2
> PostGIS JDBC driver built within java directory of PostGIS 2.1.8, see: 
> http://postgis.net/docs/manual-2.1/postgis_installation.html#idp58575520
>Reporter: Severin Thaler
>  Labels: dataframe, postgres, sql
>
> on server postgis server running with a table that has a column of type raster
> on client running spark standalone application that pulls data from remote 
> postgis server using spark-sql and jdbc driver:
> {code:scala}
> val conf = new SparkConf().setAppName("Simple Application").setMaster("local")
> val sc = new SparkContext(conf)
> val sqlContext = new SQLContext(sc)
> val jdbcDF = sqlContext.read.format("jdbc").options(Map("url" -> 
> "jdbc:postgresql://postgishost/test?user=postgres=userpassword", 
> "dbtable" -> "atlas")).load()
> val rows = jdbcDF.take(3)
> println(rows(0).toString())
> println(rows(1).toString())
> println(rows(2).toString())
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12266) cannot handle postgis raster type

2015-12-10 Thread Severin Thaler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Severin Thaler updated SPARK-12266:
---
Description: 
on server postgis running with a table that has a column of type raster
on client running spark standalone application that pulls data from postgis 
server using spark-sql and jdbc driver:

{code:none}
val conf = new SparkConf().setAppName("Simple Application").setMaster("local")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val jdbcDF = sqlContext.read.format("jdbc").options(Map("url" -> 
"jdbc:postgresql://postgishost/test?user=postgres=userpassword", 
"dbtable" -> "atlas")).load()
val rows = jdbcDF.take(3)
println(rows(0).toString())
println(rows(1).toString())
println(rows(2).toString())
{code}

when running this app i get:

{code:java}
java.sql.SQLException: Unsupported type 
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.org$apache$spark$sql$execution$datasources$jdbc$JDBCRDD$$getCatalystType(JDBCRDD.scala:103)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anonfun$1.apply(JDBCRDD.scala:140)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anonfun$1.apply(JDBCRDD.scala:140)
  at scala.Option.getOrElse(Option.scala:121)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:139)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.(JDBCRelation.scala:91)
  at 
org.apache.spark.sql.execution.datasources.jdbc.DefaultSource.createRelation(DefaultSource.scala:60)
  at 
org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:125)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114)
  at rdd.BasicRDDTests$$anonfun$1.apply$mcV$sp(BasicRDDTests.scala:65)
{code}


  was:
on server postgis running with a table that has a column of type raster
on client running spark standalone application that pulls data from remote 
postgis server using spark-sql and jdbc driver:

{code:none}
val conf = new SparkConf().setAppName("Simple Application").setMaster("local")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val jdbcDF = sqlContext.read.format("jdbc").options(Map("url" -> 
"jdbc:postgresql://postgishost/test?user=postgres=userpassword", 
"dbtable" -> "atlas")).load()
val rows = jdbcDF.take(3)
println(rows(0).toString())
println(rows(1).toString())
println(rows(2).toString())
{code}

when running this app i get:

{code:java}
java.sql.SQLException: Unsupported type 
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.org$apache$spark$sql$execution$datasources$jdbc$JDBCRDD$$getCatalystType(JDBCRDD.scala:103)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anonfun$1.apply(JDBCRDD.scala:140)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anonfun$1.apply(JDBCRDD.scala:140)
  at scala.Option.getOrElse(Option.scala:121)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:139)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.(JDBCRelation.scala:91)
  at 
org.apache.spark.sql.execution.datasources.jdbc.DefaultSource.createRelation(DefaultSource.scala:60)
  at 
org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:125)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114)
  at rdd.BasicRDDTests$$anonfun$1.apply$mcV$sp(BasicRDDTests.scala:65)
{code}



> cannot handle postgis raster type
> -
>
> Key: SPARK-12266
> URL: https://issues.apache.org/jira/browse/SPARK-12266
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 1.5.2
> Environment: PostGIS Server:
> Ubuntu-14.04
> Postgres 9.4.5
> PostGIS 2.2.1
> Spark standalone client:
> Java 7
> Spark 1.5.2
> PostGIS JDBC driver built within java directory of PostGIS 2.1.8, see: 
> http://postgis.net/docs/manual-2.1/postgis_installation.html#idp58575520
>Reporter: Severin Thaler
>  Labels: dataframe, postgres, sql
>
> on server postgis running with a table that has a column of type raster
> on client running spark standalone application that pulls data from postgis 
> server using spark-sql and jdbc driver:
> {code:none}
> val conf = new SparkConf().setAppName("Simple Application").setMaster("local")
> val sc = new SparkContext(conf)
> val sqlContext = new SQLContext(sc)
> val jdbcDF = sqlContext.read.format("jdbc").options(Map("url" -> 
> "jdbc:postgresql://postgishost/test?user=postgres=userpassword", 
> "dbtable" -> "atlas")).load()
> val rows = jdbcDF.take(3)
> println(rows(0).toString())
> println(rows(1).toString())
> println(rows(2).toString())
> {code}
> when running this app i get:
> {code:java}
> java.sql.SQLException: Unsupported type 
>   at 

[jira] [Commented] (SPARK-12237) Unsupported message RpcMessage causes message retries

2015-12-10 Thread Nan Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15051499#comment-15051499
 ] 

Nan Zhu commented on SPARK-12237:
-

if that's the case, I don't think it would happen in the real world

As Executor will not directly communicate with the Master

> Unsupported message RpcMessage causes message retries
> -
>
> Key: SPARK-12237
> URL: https://issues.apache.org/jira/browse/SPARK-12237
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Jacek Laskowski
>
> When an unsupported message is sent to an endpoint, Spark throws 
> {{org.apache.spark.SparkException}} and retries sending the message. It 
> should *not* since the message is unsupported.
> {code}
> WARN NettyRpcEndpointRef: Error sending message [message = 
> RetrieveSparkProps] in 1 attempts
> org.apache.spark.SparkException: Unsupported message 
> RpcMessage(localhost:51137,RetrieveSparkProps,org.apache.spark.rpc.netty.RemoteNettyRpcCallContext@c0a6275)
>  from localhost:51137
>   at 
> org.apache.spark.rpc.netty.Inbox$$anonfun$process$1$$anonfun$apply$mcV$sp$1.apply(Inbox.scala:105)
>   at 
> org.apache.spark.rpc.netty.Inbox$$anonfun$process$1$$anonfun$apply$mcV$sp$1.apply(Inbox.scala:104)
>   at 
> org.apache.spark.deploy.master.Master$$anonfun$receiveAndReply$1.applyOrElse(Master.scala:373)
>   at 
> org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:104)
>   at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
>   at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
>   at 
> org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> WARN NettyRpcEndpointRef: Error sending message [message = 
> RetrieveSparkProps] in 2 attempts
> org.apache.spark.SparkException: Unsupported message 
> RpcMessage(localhost:51137,RetrieveSparkProps,org.apache.spark.rpc.netty.RemoteNettyRpcCallContext@73a76a5a)
>  from localhost:51137
>   at 
> org.apache.spark.rpc.netty.Inbox$$anonfun$process$1$$anonfun$apply$mcV$sp$1.apply(Inbox.scala:105)
>   at 
> org.apache.spark.rpc.netty.Inbox$$anonfun$process$1$$anonfun$apply$mcV$sp$1.apply(Inbox.scala:104)
>   at 
> org.apache.spark.deploy.master.Master$$anonfun$receiveAndReply$1.applyOrElse(Master.scala:373)
>   at 
> org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:104)
>   at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
>   at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
>   at 
> org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> WARN NettyRpcEndpointRef: Error sending message [message = 
> RetrieveSparkProps] in 3 attempts
> org.apache.spark.SparkException: Unsupported message 
> RpcMessage(localhost:51137,RetrieveSparkProps,org.apache.spark.rpc.netty.RemoteNettyRpcCallContext@670bfda7)
>  from localhost:51137
>   at 
> org.apache.spark.rpc.netty.Inbox$$anonfun$process$1$$anonfun$apply$mcV$sp$1.apply(Inbox.scala:105)
>   at 
> org.apache.spark.rpc.netty.Inbox$$anonfun$process$1$$anonfun$apply$mcV$sp$1.apply(Inbox.scala:104)
>   at 
> org.apache.spark.deploy.master.Master$$anonfun$receiveAndReply$1.applyOrElse(Master.scala:373)
>   at 
> org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:104)
>   at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
>   at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
>   at 
> org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1672)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:68)
>   at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:151)
>   at 
> 

[jira] [Commented] (SPARK-12237) Unsupported message RpcMessage causes message retries

2015-12-10 Thread Jacek Laskowski (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15051500#comment-15051500
 ] 

Jacek Laskowski commented on SPARK-12237:
-

Sure, but that's not the issue who talks to whom, but how the improper messages 
are handled when they somehow get routed improperly. I think unsupported 
messages should end up in dead letters Inbox for maintenance, and don't lead to 
retries.

> Unsupported message RpcMessage causes message retries
> -
>
> Key: SPARK-12237
> URL: https://issues.apache.org/jira/browse/SPARK-12237
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Jacek Laskowski
>
> When an unsupported message is sent to an endpoint, Spark throws 
> {{org.apache.spark.SparkException}} and retries sending the message. It 
> should *not* since the message is unsupported.
> {code}
> WARN NettyRpcEndpointRef: Error sending message [message = 
> RetrieveSparkProps] in 1 attempts
> org.apache.spark.SparkException: Unsupported message 
> RpcMessage(localhost:51137,RetrieveSparkProps,org.apache.spark.rpc.netty.RemoteNettyRpcCallContext@c0a6275)
>  from localhost:51137
>   at 
> org.apache.spark.rpc.netty.Inbox$$anonfun$process$1$$anonfun$apply$mcV$sp$1.apply(Inbox.scala:105)
>   at 
> org.apache.spark.rpc.netty.Inbox$$anonfun$process$1$$anonfun$apply$mcV$sp$1.apply(Inbox.scala:104)
>   at 
> org.apache.spark.deploy.master.Master$$anonfun$receiveAndReply$1.applyOrElse(Master.scala:373)
>   at 
> org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:104)
>   at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
>   at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
>   at 
> org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> WARN NettyRpcEndpointRef: Error sending message [message = 
> RetrieveSparkProps] in 2 attempts
> org.apache.spark.SparkException: Unsupported message 
> RpcMessage(localhost:51137,RetrieveSparkProps,org.apache.spark.rpc.netty.RemoteNettyRpcCallContext@73a76a5a)
>  from localhost:51137
>   at 
> org.apache.spark.rpc.netty.Inbox$$anonfun$process$1$$anonfun$apply$mcV$sp$1.apply(Inbox.scala:105)
>   at 
> org.apache.spark.rpc.netty.Inbox$$anonfun$process$1$$anonfun$apply$mcV$sp$1.apply(Inbox.scala:104)
>   at 
> org.apache.spark.deploy.master.Master$$anonfun$receiveAndReply$1.applyOrElse(Master.scala:373)
>   at 
> org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:104)
>   at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
>   at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
>   at 
> org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> WARN NettyRpcEndpointRef: Error sending message [message = 
> RetrieveSparkProps] in 3 attempts
> org.apache.spark.SparkException: Unsupported message 
> RpcMessage(localhost:51137,RetrieveSparkProps,org.apache.spark.rpc.netty.RemoteNettyRpcCallContext@670bfda7)
>  from localhost:51137
>   at 
> org.apache.spark.rpc.netty.Inbox$$anonfun$process$1$$anonfun$apply$mcV$sp$1.apply(Inbox.scala:105)
>   at 
> org.apache.spark.rpc.netty.Inbox$$anonfun$process$1$$anonfun$apply$mcV$sp$1.apply(Inbox.scala:104)
>   at 
> org.apache.spark.deploy.master.Master$$anonfun$receiveAndReply$1.applyOrElse(Master.scala:373)
>   at 
> org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:104)
>   at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
>   at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
>   at 
> org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1672)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:68)
>   at 
> 

[jira] [Assigned] (SPARK-12250) Allow users to define a UDAF without providing details of its inputSchema

2015-12-10 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai reassigned SPARK-12250:


Assignee: Yin Huai

> Allow users to define a UDAF without providing details of its inputSchema
> -
>
> Key: SPARK-12250
> URL: https://issues.apache.org/jira/browse/SPARK-12250
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
> Fix For: 1.6.0
>
>
> Right now, users need to provide the exact inputSchema. Otherwise, our 
> ScalaUDAF will fail because it tries to check the schema and input arguments. 
> We should remove this check because it is common that users may have a 
> complex input schema type and they do not want to provide the detailed schema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11959) Document normal equation solver for ordinary least squares in user guide

2015-12-10 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-11959:
--
Assignee: Yanbo Liang  (was: Xiangrui Meng)

> Document normal equation solver for ordinary least squares in user guide
> 
>
> Key: SPARK-11959
> URL: https://issues.apache.org/jira/browse/SPARK-11959
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>
> Assigning since you wrote the feature, but please reassign as needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12266) cannot handle postgis raster type

2015-12-10 Thread Severin Thaler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Severin Thaler updated SPARK-12266:
---
Description: 
on server postgis server running with a table that has a column of type raster
on client running spark standalone application that pulls data from remote 
postgis server using spark-sql and jdbc driver:

{code:none}
val conf = new SparkConf().setAppName("Simple Application").setMaster("local")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val jdbcDF = sqlContext.read.format("jdbc").options(Map("url" -> 
"jdbc:postgresql://postgishost/test?user=postgres=userpassword", 
"dbtable" -> "atlas")).load()
val rows = jdbcDF.take(3)
println(rows(0).toString())
println(rows(1).toString())
println(rows(2).toString())
{code}

when running this app i get:

{code:none}
java.sql.SQLException: Unsupported type 
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.org$apache$spark$sql$execution$datasources$jdbc$JDBCRDD$$getCatalystType(JDBCRDD.scala:103)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anonfun$1.apply(JDBCRDD.scala:140)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anonfun$1.apply(JDBCRDD.scala:140)
  at scala.Option.getOrElse(Option.scala:121)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:139)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.(JDBCRelation.scala:91)
  at 
org.apache.spark.sql.execution.datasources.jdbc.DefaultSource.createRelation(DefaultSource.scala:60)
  at 
org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:125)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114)
  at rdd.BasicRDDTests$$anonfun$1.apply$mcV$sp(BasicRDDTests.scala:65)
{code}


  was:
on server postgis server running with a table that has a column of type raster
on client running spark standalone application that pulls data from remote 
postgis server using spark-sql and jdbc driver:

{code:none}
val conf = new SparkConf().setAppName("Simple Application").setMaster("local")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val jdbcDF = sqlContext.read.format("jdbc").options(Map("url" -> 
"jdbc:postgresql://postgishost/test?user=postgres=userpassword", 
"dbtable" -> "atlas")).load()
val rows = jdbcDF.take(3)
println(rows(0).toString())
println(rows(1).toString())
println(rows(2).toString())
{code}


> cannot handle postgis raster type
> -
>
> Key: SPARK-12266
> URL: https://issues.apache.org/jira/browse/SPARK-12266
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 1.5.2
> Environment: PostGIS Server:
> Ubuntu-14.04
> Postgres 9.4.5
> PostGIS 2.2.1
> Spark standalone client:
> Java 7
> Spark 1.5.2
> PostGIS JDBC driver built within java directory of PostGIS 2.1.8, see: 
> http://postgis.net/docs/manual-2.1/postgis_installation.html#idp58575520
>Reporter: Severin Thaler
>  Labels: dataframe, postgres, sql
>
> on server postgis server running with a table that has a column of type raster
> on client running spark standalone application that pulls data from remote 
> postgis server using spark-sql and jdbc driver:
> {code:none}
> val conf = new SparkConf().setAppName("Simple Application").setMaster("local")
> val sc = new SparkContext(conf)
> val sqlContext = new SQLContext(sc)
> val jdbcDF = sqlContext.read.format("jdbc").options(Map("url" -> 
> "jdbc:postgresql://postgishost/test?user=postgres=userpassword", 
> "dbtable" -> "atlas")).load()
> val rows = jdbcDF.take(3)
> println(rows(0).toString())
> println(rows(1).toString())
> println(rows(2).toString())
> {code}
> when running this app i get:
> {code:none}
> java.sql.SQLException: Unsupported type 
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.org$apache$spark$sql$execution$datasources$jdbc$JDBCRDD$$getCatalystType(JDBCRDD.scala:103)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anonfun$1.apply(JDBCRDD.scala:140)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anonfun$1.apply(JDBCRDD.scala:140)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:139)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.(JDBCRelation.scala:91)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.DefaultSource.createRelation(DefaultSource.scala:60)
>   at 
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:125)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114)
>   at rdd.BasicRDDTests$$anonfun$1.apply$mcV$sp(BasicRDDTests.scala:65)
> {code}



--
This message was 

[jira] [Updated] (SPARK-12266) cannot handle postgis raster type

2015-12-10 Thread Severin Thaler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Severin Thaler updated SPARK-12266:
---
Description: 
on server postgis server running with a table that has a column of type raster
on client running spark standalone application that pulls data from remote 
postgis server using spark-sql and jdbc driver:

{code:none}
val conf = new SparkConf().setAppName("Simple Application").setMaster("local")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val jdbcDF = sqlContext.read.format("jdbc").options(Map("url" -> 
"jdbc:postgresql://postgishost/test?user=postgres=userpassword", 
"dbtable" -> "atlas")).load()
val rows = jdbcDF.take(3)
println(rows(0).toString())
println(rows(1).toString())
println(rows(2).toString())
{code}

when running this app i get:

{code:java}
java.sql.SQLException: Unsupported type 
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.org$apache$spark$sql$execution$datasources$jdbc$JDBCRDD$$getCatalystType(JDBCRDD.scala:103)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anonfun$1.apply(JDBCRDD.scala:140)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anonfun$1.apply(JDBCRDD.scala:140)
  at scala.Option.getOrElse(Option.scala:121)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:139)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.(JDBCRelation.scala:91)
  at 
org.apache.spark.sql.execution.datasources.jdbc.DefaultSource.createRelation(DefaultSource.scala:60)
  at 
org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:125)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114)
  at rdd.BasicRDDTests$$anonfun$1.apply$mcV$sp(BasicRDDTests.scala:65)
{code}


  was:
on server postgis server running with a table that has a column of type raster
on client running spark standalone application that pulls data from remote 
postgis server using spark-sql and jdbc driver:

{code:none}
val conf = new SparkConf().setAppName("Simple Application").setMaster("local")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val jdbcDF = sqlContext.read.format("jdbc").options(Map("url" -> 
"jdbc:postgresql://postgishost/test?user=postgres=userpassword", 
"dbtable" -> "atlas")).load()
val rows = jdbcDF.take(3)
println(rows(0).toString())
println(rows(1).toString())
println(rows(2).toString())
{code}

when running this app i get:

{code:none}
java.sql.SQLException: Unsupported type 
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.org$apache$spark$sql$execution$datasources$jdbc$JDBCRDD$$getCatalystType(JDBCRDD.scala:103)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anonfun$1.apply(JDBCRDD.scala:140)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anonfun$1.apply(JDBCRDD.scala:140)
  at scala.Option.getOrElse(Option.scala:121)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:139)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.(JDBCRelation.scala:91)
  at 
org.apache.spark.sql.execution.datasources.jdbc.DefaultSource.createRelation(DefaultSource.scala:60)
  at 
org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:125)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114)
  at rdd.BasicRDDTests$$anonfun$1.apply$mcV$sp(BasicRDDTests.scala:65)
{code}



> cannot handle postgis raster type
> -
>
> Key: SPARK-12266
> URL: https://issues.apache.org/jira/browse/SPARK-12266
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 1.5.2
> Environment: PostGIS Server:
> Ubuntu-14.04
> Postgres 9.4.5
> PostGIS 2.2.1
> Spark standalone client:
> Java 7
> Spark 1.5.2
> PostGIS JDBC driver built within java directory of PostGIS 2.1.8, see: 
> http://postgis.net/docs/manual-2.1/postgis_installation.html#idp58575520
>Reporter: Severin Thaler
>  Labels: dataframe, postgres, sql
>
> on server postgis server running with a table that has a column of type raster
> on client running spark standalone application that pulls data from remote 
> postgis server using spark-sql and jdbc driver:
> {code:none}
> val conf = new SparkConf().setAppName("Simple Application").setMaster("local")
> val sc = new SparkContext(conf)
> val sqlContext = new SQLContext(sc)
> val jdbcDF = sqlContext.read.format("jdbc").options(Map("url" -> 
> "jdbc:postgresql://postgishost/test?user=postgres=userpassword", 
> "dbtable" -> "atlas")).load()
> val rows = jdbcDF.take(3)
> println(rows(0).toString())
> println(rows(1).toString())
> println(rows(2).toString())
> {code}
> when running this app i get:
> {code:java}
> 

[jira] [Commented] (SPARK-12266) cannot handle postgis raster type

2015-12-10 Thread Severin Thaler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15051475#comment-15051475
 ] 

Severin Thaler commented on SPARK-12266:


yes, im going to simplify the example and just use a simple client, not a spark 
app. i should still see the issue showing up there and then ill file a bug to 
the devs of the jdbc postgis driver. thanks.

> cannot handle postgis raster type
> -
>
> Key: SPARK-12266
> URL: https://issues.apache.org/jira/browse/SPARK-12266
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 1.5.2
> Environment: PostGIS Server:
> Ubuntu-14.04
> Postgres 9.4.5
> PostGIS 2.2.1
> Spark standalone client:
> Java 7
> Spark 1.5.2
> PostGIS JDBC driver built within java directory of PostGIS 2.1.8, see: 
> http://postgis.net/docs/manual-2.1/postgis_installation.html#idp58575520
>Reporter: Severin Thaler
>  Labels: dataframe, postgres, sql
>
> on server postgis running with a table that has a column of type raster
> on client running spark standalone application that pulls data from postgis 
> server using and jdbc driver and spark-sql:
> {code:none}
> val conf = new SparkConf().setAppName("Simple Application").setMaster("local")
> val sc = new SparkContext(conf)
> val sqlContext = new SQLContext(sc)
> val jdbcDF = sqlContext.read.format("jdbc").options(Map("url" -> 
> "jdbc:postgresql://postgishost/test?user=postgres=userpassword", 
> "dbtable" -> "atlas")).load()
> val rows = jdbcDF.take(3)
> println(rows(0).toString())
> println(rows(1).toString())
> println(rows(2).toString())
> {code}
> when running this spark app i get:
> {code:java}
> java.sql.SQLException: Unsupported type 
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.org$apache$spark$sql$execution$datasources$jdbc$JDBCRDD$$getCatalystType(JDBCRDD.scala:103)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anonfun$1.apply(JDBCRDD.scala:140)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anonfun$1.apply(JDBCRDD.scala:140)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:139)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.(JDBCRelation.scala:91)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.DefaultSource.createRelation(DefaultSource.scala:60)
>   at 
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:125)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114)
>   at rdd.BasicRDDTests$$anonfun$1.apply$mcV$sp(BasicRDDTests.scala:65)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2791) Fix committing, reverting and state tracking in shuffle file consolidation

2015-12-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15051530#comment-15051530
 ] 

Apache Spark commented on SPARK-2791:
-

User 'aarondav' has created a pull request for this issue:
https://github.com/apache/spark/pull/1678

> Fix committing, reverting and state tracking in shuffle file consolidation
> --
>
> Key: SPARK-2791
> URL: https://issues.apache.org/jira/browse/SPARK-2791
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Reporter: Matei Zaharia
>Assignee: Mridul Muralidharan
> Fix For: 1.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12250) Allow users to define a UDAF without providing details of its inputSchema

2015-12-10 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-12250.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 10236
[https://github.com/apache/spark/pull/10236]

> Allow users to define a UDAF without providing details of its inputSchema
> -
>
> Key: SPARK-12250
> URL: https://issues.apache.org/jira/browse/SPARK-12250
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
> Fix For: 1.6.0
>
>
> Right now, users need to provide the exact inputSchema. Otherwise, our 
> ScalaUDAF will fail because it tries to check the schema and input arguments. 
> We should remove this check because it is common that users may have a 
> complex input schema type and they do not want to provide the detailed schema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12228) Use in-memory for execution hive's derby metastore

2015-12-10 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-12228.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10204
[https://github.com/apache/spark/pull/10204]

> Use in-memory for execution hive's derby metastore
> --
>
> Key: SPARK-12228
> URL: https://issues.apache.org/jira/browse/SPARK-12228
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
> Fix For: 2.0.0
>
>
> Starting from Hive 0.13, the derby metastore can use a in-memory backend. 
> Since our execution hive is a fake metastore, if we use in-memory mode, we 
> can reduce the time that is used on creating the execution hive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12266) cannot handle postgis raster type

2015-12-10 Thread Severin Thaler (JIRA)
Severin Thaler created SPARK-12266:
--

 Summary: cannot handle postgis raster type
 Key: SPARK-12266
 URL: https://issues.apache.org/jira/browse/SPARK-12266
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 1.5.2
 Environment: PostGIS Server:
Ubuntu-14.04
Postgres 9.4.5
PostGIS 2.2.1
Spark standalone client:
Java 7
Spark 1.5.2
PostGIS JDBC driver built within java directory of PostGIS 2.1.8
Reporter: Severin Thaler


on server postgis server running with a table that has a column of type raster
on client running spark standalone application that pulls data from remote 
postgis server using spark-sql and jdbc driver





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12266) cannot handle postgis raster type

2015-12-10 Thread Severin Thaler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Severin Thaler updated SPARK-12266:
---
Description: 
on server postgis server running with a table that has a column of type raster
on client running spark standalone application that pulls data from remote 
postgis server using spark-sql and jdbc driver:

val conf = new SparkConf().setAppName("Simple Application").setMaster("local")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val jdbcDF = sqlContext.read.format("jdbc").options(
  Map("url" -> 
"jdbc:postgresql://postgishost/test?user=postgres=userpassword",
"dbtable" -> "atlas")).load()

val rows = jdbcDF.take(3)
println(rows(0).toString())
println(rows(1).toString())
println(rows(2).toString())

  was:
on server postgis server running with a table that has a column of type raster
on client running spark standalone application that pulls data from remote 
postgis server using spark-sql and jdbc driver




> cannot handle postgis raster type
> -
>
> Key: SPARK-12266
> URL: https://issues.apache.org/jira/browse/SPARK-12266
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 1.5.2
> Environment: PostGIS Server:
> Ubuntu-14.04
> Postgres 9.4.5
> PostGIS 2.2.1
> Spark standalone client:
> Java 7
> Spark 1.5.2
> PostGIS JDBC driver built within java directory of PostGIS 2.1.8, see: 
> http://postgis.net/docs/manual-2.1/postgis_installation.html#idp58575520
>Reporter: Severin Thaler
>  Labels: dataframe, postgres, sql
>
> on server postgis server running with a table that has a column of type raster
> on client running spark standalone application that pulls data from remote 
> postgis server using spark-sql and jdbc driver:
> val conf = new SparkConf().setAppName("Simple Application").setMaster("local")
> val sc = new SparkContext(conf)
> val sqlContext = new SQLContext(sc)
> val jdbcDF = sqlContext.read.format("jdbc").options(
>   Map("url" -> 
> "jdbc:postgresql://postgishost/test?user=postgres=userpassword",
> "dbtable" -> "atlas")).load()
> 
> val rows = jdbcDF.take(3)
> println(rows(0).toString())
> println(rows(1).toString())
> println(rows(2).toString())



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-12258) Hive Timestamp UDF is binded with '1969-12-31 15:59:59.999999' for null value

2015-12-10 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-12258:

Comment: was deleted

(was: [~cloud_fan] It sounds like it is related to the PR 
https://github.com/apache/spark/pull/9770. We just need to include the 
timestamp, bigDecimal and date in the filter, since they are not primitive 
types. 

I just tried it. It works! If you do not mind, can I try to fix it? Thanks!)

> Hive Timestamp UDF is binded with '1969-12-31 15:59:59.99' for null value
> -
>
> Key: SPARK-12258
> URL: https://issues.apache.org/jira/browse/SPARK-12258
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Ian
>
> {code}
>   test("Timestamp UDF and Null value") {
> hiveContext.runSqlHive("CREATE TABLE ts_test (ts TIMESTAMP) STORED AS 
> TEXTFILE")
> hiveContext.runSqlHive("INSERT INTO TABLE ts_test VALUES(Null)")
> hiveContext.udf.register("dummy",
>   (ts: Timestamp) => ts
> )
> val result = hiveContext.sql("SELECT dummy(ts) FROM 
> ts_test").collect().mkString("\n")
> assertResult("[null]")(result)
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8162) Run spark-shell cause NullPointerException

2015-12-10 Thread Michael Han (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15050327#comment-15050327
 ] 

Michael Han commented on SPARK-8162:


Same with Aliaksei, Just got the same problem with spark-1.5.2-bin-hadoop2.6 on 
Win7

> Run spark-shell cause NullPointerException
> --
>
> Key: SPARK-8162
> URL: https://issues.apache.org/jira/browse/SPARK-8162
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Spark Shell
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Weizhong
>Assignee: Andrew Or
>Priority: Blocker
> Fix For: 1.4.1, 1.5.0
>
>
> run spark-shell on latest master branch, then failed, details are:
> {noformat}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 1.5.0-SNAPSHOT
>   /_/
> Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_40)
> Type in expressions to have them evaluated.
> Type :help for more information.
> error: error while loading JobProgressListener, Missing dependency 'bad 
> symbolic reference. A signature in JobProgressListener.class refers to term 
> annotations
> in package com.google.common which is not available.
> It may be completely missing from the current classpath, or the version on
> the classpath might be incompatible with the version used when compiling 
> JobProgressListener.class.', required by 
> /opt/apache/spark/lib/spark-assembly-1.5.0-SNAPSHOT-hadoop2.7.0.jar(org/apache/spark/ui/jobs/JobProgressListener.class)
> java.lang.NullPointerException
>   at org.apache.spark.sql.SQLContext.(SQLContext.scala:193)
>   at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:68)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
>   at 
> org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1028)
>   at $iwC$$iwC.(:9)
>   at $iwC.(:18)
>   at (:20)
>   at .(:24)
>   at .()
>   at .(:7)
>   at .()
>   at $print()
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
>   at 
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
>   at 
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
>   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
>   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
>   at 
> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
>   at 
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
>   at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
>   at 
> org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:130)
>   at 
> org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:122)
>   at org.apache.spark.repl.SparkIMain.beQuietDuring(SparkIMain.scala:324)
>   at 
> org.apache.spark.repl.SparkILoopInit$class.initializeSpark(SparkILoopInit.scala:122)
>   at org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:64)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1$$anonfun$apply$mcZ$sp$5.apply$mcV$sp(SparkILoop.scala:974)
>   at 
> org.apache.spark.repl.SparkILoopInit$class.runThunks(SparkILoopInit.scala:157)
>   at org.apache.spark.repl.SparkILoop.runThunks(SparkILoop.scala:64)
>   at 
> org.apache.spark.repl.SparkILoopInit$class.postInitialization(SparkILoopInit.scala:106)
>   at 
> org.apache.spark.repl.SparkILoop.postInitialization(SparkILoop.scala:64)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:991)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
>   at 
> 

[jira] [Commented] (SPARK-12231) Failed to generate predicate Error when using dropna

2015-12-10 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15050357#comment-15050357
 ] 

Liang-Chi Hsieh commented on SPARK-12231:
-

I have opened a PR (https://github.com/apache/spark/pull/10251) for this bug.

> Failed to generate predicate Error when using dropna
> 
>
> Key: SPARK-12231
> URL: https://issues.apache.org/jira/browse/SPARK-12231
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.5.2, 1.6.0
> Environment: python version: 2.7.9
> os: ubuntu 14.04
>Reporter: yahsuan, chang
>
> code to reproduce error
> # write.py
> {code}
> import pyspark
> sc = pyspark.SparkContext()
> sqlc = pyspark.SQLContext(sc)
> df = sqlc.range(10)
> df1 = df.withColumn('a', df['id'] * 2)
> df1.write.partitionBy('id').parquet('./data')
> {code}
> # read.py
> {code}
> import pyspark
> sc = pyspark.SparkContext()
> sqlc = pyspark.SQLContext(sc)
> df2 = sqlc.read.parquet('./data')
> df2.dropna().count()
> {code}
> $ spark-submit write.py
> $ spark-submit read.py
> # error message
> {code}
> 15/12/08 17:20:34 ERROR Filter: Failed to generate predicate, fallback to 
> interpreted org.apache.spark.sql.catalyst.errors.package$TreeNodeException: 
> Binding attribute, tree: a#0L
> ...
> {code}
> If write data without partitionBy, the error won't happen



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12247) Documentation for spark.ml's ALS and collaborative filtering in general

2015-12-10 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15050420#comment-15050420
 ] 

Sean Owen commented on SPARK-12247:
---

[~timhunter] no thank you, go ahead

> Documentation for spark.ml's ALS and collaborative filtering in general
> ---
>
> Key: SPARK-12247
> URL: https://issues.apache.org/jira/browse/SPARK-12247
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib
>Affects Versions: 1.5.2
>Reporter: Timothy Hunter
>
> We need to add a section in the documentation about collaborative filtering 
> in the dataframe API:
>  - copy explanations about collaborative filtering and ALS from spark.mllib
>  - provide an example with spark.ml's ALS



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12261) pyspark crash for large dataset

2015-12-10 Thread zihao (JIRA)
zihao created SPARK-12261:
-

 Summary: pyspark crash for large dataset
 Key: SPARK-12261
 URL: https://issues.apache.org/jira/browse/SPARK-12261
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.5.2
 Environment: windows
Reporter: zihao


I tried to import a local text(over 100mb) file via textFile in pyspark, when i 
ran data.take(), it failed and gave error messages including:
15/12/10 17:17:43 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; 
aborting job
Traceback (most recent call last):
  File "E:/spark_python/test3.py", line 9, in 
lines.take(5)
  File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\rdd.py", line 1299, 
in take
res = self.context.runJob(self, takeUpToNumLeft, p)
  File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\context.py", line 
916, in runJob
port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, 
partitions)
  File "C:\Anaconda2\lib\site-packages\py4j\java_gateway.py", line 813, in 
__call__
answer, self.gateway_client, self.target_id, self.name)
  File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\sql\utils.py", line 
36, in deco
return f(*a, **kw)
  File "C:\Anaconda2\lib\site-packages\py4j\protocol.py", line 308, in 
get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling 
z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 
0, localhost): java.net.SocketException: Connection reset by peer: socket write 
error

Then i ran the same code for a small text file, this time .take() worked fine.
How can i solve this problem?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10364) Support Parquet logical type TIMESTAMP_MILLIS

2015-12-10 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-10364:
---
Description: 
The {{TimestampType}} in Spark SQL is of microsecond precision. Ideally, we 
should convert Spark SQL timestamp values into Parquet {{TIMESTAMP_MICROS}}. 
But unfortunately parquet-mr hasn't supported it yet.

For the read path, we should be able to read {{TIMESTAMP_MILLIS}} Parquet 
values and pad a 0 microsecond part to read values.

For the write path, currently we are writing timestamps as {{INT96}}, similar 
to Impala and Hive. One alternative is that, we can have a separate SQL option 
to let users be able to write Spark SQL timestamp values as 
{{TIMESTAMP_MILLIS}}. Of course, in this way the microsecond part will be 
truncated.

> Support Parquet logical type TIMESTAMP_MILLIS
> -
>
> Key: SPARK-10364
> URL: https://issues.apache.org/jira/browse/SPARK-10364
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Cheng Lian
>
> The {{TimestampType}} in Spark SQL is of microsecond precision. Ideally, we 
> should convert Spark SQL timestamp values into Parquet {{TIMESTAMP_MICROS}}. 
> But unfortunately parquet-mr hasn't supported it yet.
> For the read path, we should be able to read {{TIMESTAMP_MILLIS}} Parquet 
> values and pad a 0 microsecond part to read values.
> For the write path, currently we are writing timestamps as {{INT96}}, similar 
> to Impala and Hive. One alternative is that, we can have a separate SQL 
> option to let users be able to write Spark SQL timestamp values as 
> {{TIMESTAMP_MILLIS}}. Of course, in this way the microsecond part will be 
> truncated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8743) Deregister Codahale metrics for streaming when StreamingContext is closed

2015-12-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15050296#comment-15050296
 ] 

Apache Spark commented on SPARK-8743:
-

User 'nssalian' has created a pull request for this issue:
https://github.com/apache/spark/pull/7249

> Deregister Codahale metrics for streaming when StreamingContext is closed 
> --
>
> Key: SPARK-8743
> URL: https://issues.apache.org/jira/browse/SPARK-8743
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Affects Versions: 1.4.1
>Reporter: Tathagata Das
>Assignee: Neelesh Srinivas Salian
>  Labels: starter
> Fix For: 1.4.2, 1.5.0
>
>
> Currently, when the StreamingContext is closed, the registered metrics are 
> not deregistered. If another streaming context is started, it throws a 
> warning saying that the metrics are already registered. 
> The solution is to deregister the metrics when streamingcontext is stopped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8657) Fail to upload conf archive to viewfs

2015-12-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15050295#comment-15050295
 ] 

Apache Spark commented on SPARK-8657:
-

User 'litao-buptsse' has created a pull request for this issue:
https://github.com/apache/spark/pull/7055

> Fail to upload conf archive to viewfs
> -
>
> Key: SPARK-8657
> URL: https://issues.apache.org/jira/browse/SPARK-8657
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.4.0
> Environment: spark-1.4.2 & hadoop-2.5.0-cdh5.3.2
>Reporter: Tao Li
>Assignee: Tao Li
>Priority: Minor
>  Labels: distributed_cache, viewfs
> Fix For: 1.4.1, 1.5.0
>
>
> When I run in spark-1.4 yarn-client mode, I throws the following Exception 
> when trying to upload conf archive to viewfs:
> 15/06/26 17:56:37 INFO yarn.Client: Uploading resource 
> file:/tmp/spark-095ec3d2-5dad-468c-8d46-2c813457404d/__hadoop_conf__8436284925771788661
> .zip -> 
> viewfs://nsX/user/ultraman/.sparkStaging/application_1434370929997_191242/__hadoop_conf__8436284925771788661.zip
> 15/06/26 17:56:38 INFO yarn.Client: Deleting staging directory 
> .sparkStaging/application_1434370929997_191242
> 15/06/26 17:56:38 ERROR spark.SparkContext: Error initializing SparkContext.
> java.lang.IllegalArgumentException: Wrong FS: 
> hdfs://SunshineNameNode2:8020/user/ultraman/.sparkStaging/application_1434370929997_191242/__had
> oop_conf__8436284925771788661.zip, expected: viewfs://nsX/
> at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:645)
> at 
> org.apache.hadoop.fs.viewfs.ViewFileSystem.getUriPath(ViewFileSystem.java:117)
> at 
> org.apache.hadoop.fs.viewfs.ViewFileSystem.getFileStatus(ViewFileSystem.java:346)
> at 
> org.apache.spark.deploy.yarn.ClientDistributedCacheManager.addResource(ClientDistributedCacheManager.scala:67)
> at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$5.apply(Client.scala:341)
> at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$5.apply(Client.scala:338)
> at scala.Option.foreach(Option.scala:236)
> at 
> org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:338)
> at 
> org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:559)
> at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:115)
> at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:58)
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:141)
> at org.apache.spark.SparkContext.(SparkContext.scala:497)
> at 
> org.apache.spark.repl.SparkILoop.createSparkContext(SparkILoop.scala:1017)
> at $line3.$read$$iwC$$iwC.(:9)
> at $line3.$read$$iwC.(:18)
> at $line3.$read.(:20)
> at $line3.$read$.(:24)
> at $line3.$read$.()
> at $line3.$eval$.(:7)
> at $line3.$eval$.()
> at $line3.$eval.$print()
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
> at 
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
> at 
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
> The bug is easy to fix, we should pass the correct file system object to 
> addResource. The similar issure is: 
> https://github.com/apache/spark/pull/1483. I will attach my bug fix PR very 
> soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12262) describe extended doesn't return table on detail info tabled stored as PARQUET format

2015-12-10 Thread pin_zhang (JIRA)
pin_zhang created SPARK-12262:
-

 Summary: describe extended doesn't return table on detail info 
tabled stored as PARQUET format
 Key: SPARK-12262
 URL: https://issues.apache.org/jira/browse/SPARK-12262
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.2
Reporter: pin_zhang


1. start hive server with start-thriftserver.sh
2. create table table1 (id  int) ;
create table table2(id  int) STORED AS PARQUET;
3. describe extended table1 ;
return detailed info
4. describe extended table2 ;
result has no detailed info





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12260) Graceful Shutdown with In-Memory State

2015-12-10 Thread Mao, Wei (JIRA)
Mao, Wei created SPARK-12260:


 Summary: Graceful Shutdown with In-Memory State
 Key: SPARK-12260
 URL: https://issues.apache.org/jira/browse/SPARK-12260
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Reporter: Mao, Wei


Users often stop and restart their streaming jobs for tasks such as 
maintenance, software upgrades or even application logic updates. When a job 
re-starts it should pick up where it left off i.e. any state information that 
existed when the job stopped should be used as the initial state when the job 
restarts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7751) Add @Since annotation to stable and experimental methods in MLlib

2015-12-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15050289#comment-15050289
 ] 

Apache Spark commented on SPARK-7751:
-

User 'petz2000' has created a pull request for this issue:
https://github.com/apache/spark/pull/7370

> Add @Since annotation to stable and experimental methods in MLlib
> -
>
> Key: SPARK-7751
> URL: https://issues.apache.org/jira/browse/SPARK-7751
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation, MLlib
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Minor
>  Labels: starter
>
> This is useful to check whether a feature exists in some version of Spark. 
> This is an umbrella JIRA to track the progress. We want to have -@since tag- 
> @Since annotation for both stable (those without any 
> Experimental/DeveloperApi/AlphaComponent annotations) and experimental 
> methods in MLlib:
> (Do NOT tag private or package private classes or methods, nor local 
> variables and methods.)
> * an example PR for Scala: https://github.com/apache/spark/pull/8309
> We need to dig the history of git commit to figure out what was the Spark 
> version when a method was first introduced. Take `NaiveBayes.setModelType` as 
> an example. We can grep `def setModelType` at different version git tags.
> {code}
> meng@xm:~/src/spark
> $ git show 
> v1.3.0:mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala
>  | grep "def setModelType"
> meng@xm:~/src/spark
> $ git show 
> v1.4.0:mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala
>  | grep "def setModelType"
>   def setModelType(modelType: String): NaiveBayes = {
> {code}
> If there are better ways, please let us know.
> We cannot add all -@since tags- @Since annotation in a single PR, which is 
> hard to review. So we made some subtasks for each package, for example 
> `org.apache.spark.classification`. Feel free to add more sub-tasks for Python 
> and the `spark.ml` package.
> Plan:
> 1. In 1.5, we try to add @Since annotation to all stable/experimental methods 
> under `spark.mllib`.
> 2. Starting from 1.6, we require @Since annotation in all new PRs.
> 3. In 1.6, we try to add @SInce annotation to all stable/experimental methods 
> under `spark.ml`, `pyspark.mllib`, and `pyspark.ml`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12258) Hive Timestamp UDF is binded with '1969-12-31 15:59:59.999999' for null value

2015-12-10 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15050398#comment-15050398
 ] 

Xiao Li commented on SPARK-12258:
-

A PR has been submitted. Thanks


> Hive Timestamp UDF is binded with '1969-12-31 15:59:59.99' for null value
> -
>
> Key: SPARK-12258
> URL: https://issues.apache.org/jira/browse/SPARK-12258
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Ian
>
> {code}
>   test("Timestamp UDF and Null value") {
> hiveContext.runSqlHive("CREATE TABLE ts_test (ts TIMESTAMP) STORED AS 
> TEXTFILE")
> hiveContext.runSqlHive("INSERT INTO TABLE ts_test VALUES(Null)")
> hiveContext.udf.register("dummy",
>   (ts: Timestamp) => ts
> )
> val result = hiveContext.sql("SELECT dummy(ts) FROM 
> ts_test").collect().mkString("\n")
> assertResult("[null]")(result)
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12260) Graceful Shutdown with In-Memory State

2015-12-10 Thread Mao, Wei (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15050282#comment-15050282
 ] 

Mao, Wei commented on SPARK-12260:
--

Here is the design doc: 
https://docs.google.com/document/d/1JS9W370hNTUCtwHLa8WiKRuOuIbo8A-UwYBdbK7WGg8/edit?usp=sharing

> Graceful Shutdown with In-Memory State
> --
>
> Key: SPARK-12260
> URL: https://issues.apache.org/jira/browse/SPARK-12260
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Mao, Wei
>  Labels: streaming
>
> Users often stop and restart their streaming jobs for tasks such as 
> maintenance, software upgrades or even application logic updates. When a job 
> re-starts it should pick up where it left off i.e. any state information that 
> existed when the job stopped should be used as the initial state when the job 
> restarts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10366) Support Parquet logical type DATE

2015-12-10 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-10366.

Resolution: Fixed

Actually this has already been implemented since at least 1.4.

> Support Parquet logical type DATE
> -
>
> Key: SPARK-10366
> URL: https://issues.apache.org/jira/browse/SPARK-10366
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Cheng Lian
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8657) Fail to upload conf archive to viewfs

2015-12-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15050473#comment-15050473
 ] 

Apache Spark commented on SPARK-8657:
-

User 'litao-buptsse' has created a pull request for this issue:
https://github.com/apache/spark/pull/7053

> Fail to upload conf archive to viewfs
> -
>
> Key: SPARK-8657
> URL: https://issues.apache.org/jira/browse/SPARK-8657
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.4.0
> Environment: spark-1.4.2 & hadoop-2.5.0-cdh5.3.2
>Reporter: Tao Li
>Assignee: Tao Li
>Priority: Minor
>  Labels: distributed_cache, viewfs
> Fix For: 1.4.1, 1.5.0
>
>
> When I run in spark-1.4 yarn-client mode, I throws the following Exception 
> when trying to upload conf archive to viewfs:
> 15/06/26 17:56:37 INFO yarn.Client: Uploading resource 
> file:/tmp/spark-095ec3d2-5dad-468c-8d46-2c813457404d/__hadoop_conf__8436284925771788661
> .zip -> 
> viewfs://nsX/user/ultraman/.sparkStaging/application_1434370929997_191242/__hadoop_conf__8436284925771788661.zip
> 15/06/26 17:56:38 INFO yarn.Client: Deleting staging directory 
> .sparkStaging/application_1434370929997_191242
> 15/06/26 17:56:38 ERROR spark.SparkContext: Error initializing SparkContext.
> java.lang.IllegalArgumentException: Wrong FS: 
> hdfs://SunshineNameNode2:8020/user/ultraman/.sparkStaging/application_1434370929997_191242/__had
> oop_conf__8436284925771788661.zip, expected: viewfs://nsX/
> at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:645)
> at 
> org.apache.hadoop.fs.viewfs.ViewFileSystem.getUriPath(ViewFileSystem.java:117)
> at 
> org.apache.hadoop.fs.viewfs.ViewFileSystem.getFileStatus(ViewFileSystem.java:346)
> at 
> org.apache.spark.deploy.yarn.ClientDistributedCacheManager.addResource(ClientDistributedCacheManager.scala:67)
> at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$5.apply(Client.scala:341)
> at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$5.apply(Client.scala:338)
> at scala.Option.foreach(Option.scala:236)
> at 
> org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:338)
> at 
> org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:559)
> at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:115)
> at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:58)
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:141)
> at org.apache.spark.SparkContext.(SparkContext.scala:497)
> at 
> org.apache.spark.repl.SparkILoop.createSparkContext(SparkILoop.scala:1017)
> at $line3.$read$$iwC$$iwC.(:9)
> at $line3.$read$$iwC.(:18)
> at $line3.$read.(:20)
> at $line3.$read$.(:24)
> at $line3.$read$.()
> at $line3.$eval$.(:7)
> at $line3.$eval$.()
> at $line3.$eval.$print()
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
> at 
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
> at 
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
> The bug is easy to fix, we should pass the correct file system object to 
> addResource. The similar issure is: 
> https://github.com/apache/spark/pull/1483. I will attach my bug fix PR very 
> soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8657) Fail to upload conf archive to viewfs

2015-12-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15050471#comment-15050471
 ] 

Apache Spark commented on SPARK-8657:
-

User 'litao-buptsse' has created a pull request for this issue:
https://github.com/apache/spark/pull/7042

> Fail to upload conf archive to viewfs
> -
>
> Key: SPARK-8657
> URL: https://issues.apache.org/jira/browse/SPARK-8657
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.4.0
> Environment: spark-1.4.2 & hadoop-2.5.0-cdh5.3.2
>Reporter: Tao Li
>Assignee: Tao Li
>Priority: Minor
>  Labels: distributed_cache, viewfs
> Fix For: 1.4.1, 1.5.0
>
>
> When I run in spark-1.4 yarn-client mode, I throws the following Exception 
> when trying to upload conf archive to viewfs:
> 15/06/26 17:56:37 INFO yarn.Client: Uploading resource 
> file:/tmp/spark-095ec3d2-5dad-468c-8d46-2c813457404d/__hadoop_conf__8436284925771788661
> .zip -> 
> viewfs://nsX/user/ultraman/.sparkStaging/application_1434370929997_191242/__hadoop_conf__8436284925771788661.zip
> 15/06/26 17:56:38 INFO yarn.Client: Deleting staging directory 
> .sparkStaging/application_1434370929997_191242
> 15/06/26 17:56:38 ERROR spark.SparkContext: Error initializing SparkContext.
> java.lang.IllegalArgumentException: Wrong FS: 
> hdfs://SunshineNameNode2:8020/user/ultraman/.sparkStaging/application_1434370929997_191242/__had
> oop_conf__8436284925771788661.zip, expected: viewfs://nsX/
> at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:645)
> at 
> org.apache.hadoop.fs.viewfs.ViewFileSystem.getUriPath(ViewFileSystem.java:117)
> at 
> org.apache.hadoop.fs.viewfs.ViewFileSystem.getFileStatus(ViewFileSystem.java:346)
> at 
> org.apache.spark.deploy.yarn.ClientDistributedCacheManager.addResource(ClientDistributedCacheManager.scala:67)
> at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$5.apply(Client.scala:341)
> at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$5.apply(Client.scala:338)
> at scala.Option.foreach(Option.scala:236)
> at 
> org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:338)
> at 
> org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:559)
> at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:115)
> at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:58)
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:141)
> at org.apache.spark.SparkContext.(SparkContext.scala:497)
> at 
> org.apache.spark.repl.SparkILoop.createSparkContext(SparkILoop.scala:1017)
> at $line3.$read$$iwC$$iwC.(:9)
> at $line3.$read$$iwC.(:18)
> at $line3.$read.(:20)
> at $line3.$read$.(:24)
> at $line3.$read$.()
> at $line3.$eval$.(:7)
> at $line3.$eval$.()
> at $line3.$eval.$print()
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
> at 
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
> at 
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
> The bug is easy to fix, we should pass the correct file system object to 
> addResource. The similar issure is: 
> https://github.com/apache/spark/pull/1483. I will attach my bug fix PR very 
> soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8657) Fail to upload conf archive to viewfs

2015-12-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15050470#comment-15050470
 ] 

Apache Spark commented on SPARK-8657:
-

User 'litao-buptsse' has created a pull request for this issue:
https://github.com/apache/spark/pull/7041

> Fail to upload conf archive to viewfs
> -
>
> Key: SPARK-8657
> URL: https://issues.apache.org/jira/browse/SPARK-8657
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.4.0
> Environment: spark-1.4.2 & hadoop-2.5.0-cdh5.3.2
>Reporter: Tao Li
>Assignee: Tao Li
>Priority: Minor
>  Labels: distributed_cache, viewfs
> Fix For: 1.4.1, 1.5.0
>
>
> When I run in spark-1.4 yarn-client mode, I throws the following Exception 
> when trying to upload conf archive to viewfs:
> 15/06/26 17:56:37 INFO yarn.Client: Uploading resource 
> file:/tmp/spark-095ec3d2-5dad-468c-8d46-2c813457404d/__hadoop_conf__8436284925771788661
> .zip -> 
> viewfs://nsX/user/ultraman/.sparkStaging/application_1434370929997_191242/__hadoop_conf__8436284925771788661.zip
> 15/06/26 17:56:38 INFO yarn.Client: Deleting staging directory 
> .sparkStaging/application_1434370929997_191242
> 15/06/26 17:56:38 ERROR spark.SparkContext: Error initializing SparkContext.
> java.lang.IllegalArgumentException: Wrong FS: 
> hdfs://SunshineNameNode2:8020/user/ultraman/.sparkStaging/application_1434370929997_191242/__had
> oop_conf__8436284925771788661.zip, expected: viewfs://nsX/
> at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:645)
> at 
> org.apache.hadoop.fs.viewfs.ViewFileSystem.getUriPath(ViewFileSystem.java:117)
> at 
> org.apache.hadoop.fs.viewfs.ViewFileSystem.getFileStatus(ViewFileSystem.java:346)
> at 
> org.apache.spark.deploy.yarn.ClientDistributedCacheManager.addResource(ClientDistributedCacheManager.scala:67)
> at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$5.apply(Client.scala:341)
> at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$5.apply(Client.scala:338)
> at scala.Option.foreach(Option.scala:236)
> at 
> org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:338)
> at 
> org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:559)
> at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:115)
> at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:58)
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:141)
> at org.apache.spark.SparkContext.(SparkContext.scala:497)
> at 
> org.apache.spark.repl.SparkILoop.createSparkContext(SparkILoop.scala:1017)
> at $line3.$read$$iwC$$iwC.(:9)
> at $line3.$read$$iwC.(:18)
> at $line3.$read.(:20)
> at $line3.$read$.(:24)
> at $line3.$read$.()
> at $line3.$eval$.(:7)
> at $line3.$eval$.()
> at $line3.$eval.$print()
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
> at 
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
> at 
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
> The bug is easy to fix, we should pass the correct file system object to 
> addResource. The similar issure is: 
> https://github.com/apache/spark/pull/1483. I will attach my bug fix PR very 
> soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12257) Non partitioned insert into a partitioned Hive table doesn't fail

2015-12-10 Thread Dilip Biswal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15050319#comment-15050319
 ] 

Dilip Biswal commented on SPARK-12257:
--

Was able to reproduce this issue. Looking into it.

> Non partitioned insert into a partitioned Hive table doesn't fail
> -
>
> Key: SPARK-12257
> URL: https://issues.apache.org/jira/browse/SPARK-12257
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Mark Grover
>Priority: Minor
>
> I am using Spark 1.5.1 but I anticipate this to be a problem with master as 
> well (will check later).
> I have a dataframe, and a partitioned Hive table that I want to insert the 
> contents of the data frame into.
> Let's say mytable is a non-partitioned Hive table and mytable_partitioned is 
> a partitioned Hive table. In Hive, if you try to insert from the 
> non-partitioned mytable table into mytable_partitioned without specifying the 
> partition, the query fails, as expected:
> {quote}
> INSERT INTO mytable_partitioned SELECT * FROM mytable;
> {quote}
> Error: Error while compiling statement: FAILED: SemanticException 1:12 Need 
> to specify partition columns because the destination table is partitioned. 
> Error encountered near token 'mytable_partitioned' (state=42000,code=4)
> {quote}
> However, if I do the same in Spark SQL:
> {code}
> val myDfTempTable = myDf.registerTempTable("my_df_temp_table")
> sqlContext.sql("INSERT INTO mytable_partitioned SELECT * FROM 
> my_df_temp_table")
> {code}
> This appears to succeed but does no insertion. This should fail with an error 
> stating the data is being inserted into a partitioned table without 
> specifying the name of the partition.
> Of course, the name of the partition is explicitly specified, both Hive and 
> Spark SQL do the right thing and function correctly.
> In hive:
> {code}
> INSERT INTO mytable_partitioned PARTITION (y='abc') SELECT * FROM mytable;
> {code}
> In Spark SQL:
> {code}
> val myDfTempTable = myDf.registerTempTable("my_df_temp_table")
> sqlContext.sql("INSERT INTO mytable_partitioned PARTITION (y='abc') SELECT * 
> FROM my_df_temp_table")
> {code}
> And, here are the definitions of my tables, as reference:
> {code}
> CREATE TABLE mytable(x INT);
> CREATE TABLE mytable_partitioned (x INT) PARTITIONED BY (y INT);
> {code}
> You will also need to insert some dummy data into mytable to ensure that the 
> insertion is actually not working:
> {code}
> #!/bin/bash
> rm -rf data.txt;
> for i in {0..9}; do
> echo $i >> data.txt
> done
> sudo -u hdfs hadoop fs -put data.txt /user/hive/warehouse/mytable
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12148) SparkR: rename DataFrame to SparkDataFrame

2015-12-10 Thread Michael Lawrence (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15051720#comment-15051720
 ] 

Michael Lawrence commented on SPARK-12148:
--

I agree. A deprecation cycle is good idea.

> SparkR: rename DataFrame to SparkDataFrame
> --
>
> Key: SPARK-12148
> URL: https://issues.apache.org/jira/browse/SPARK-12148
> Project: Spark
>  Issue Type: Wish
>  Components: R, SparkR
>Reporter: Michael Lawrence
>Priority: Minor
>
> The SparkR package represents a Spark DataFrame with the class "DataFrame". 
> That conflicts with the more general DataFrame class defined in the S4Vectors 
> package. Would it not be more appropriate to use the name "SparkDataFrame" 
> instead?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12269) Update aws-java-sdk version

2015-12-10 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15051741#comment-15051741
 ] 

Sean Owen commented on SPARK-12269:
---

Ah, I realize Spark is on Jackson 2.4, not 2.5. If that difference has been a 
problem elsewhere, I guess I'm surprised if it's not a problem here. The AWS 
SDK will be using 2.4.4 then effectively. {{mvn dependency:tree}} can help diff 
the dependency tree before and after to understand exactly what's changed.

> Update aws-java-sdk version
> ---
>
> Key: SPARK-12269
> URL: https://issues.apache.org/jira/browse/SPARK-12269
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Brian London
>Priority: Minor
>
> The current Spark Streaming kinesis connector references a quite old version 
> 1.9.40 of the AWS Java SDK (1.10.40 is current).  Numerous AWS features 
> including Kinesis Firehose are unavailable in 1.9.  Those two versions of  
> the AWS SDK in turn require conflicting versions of Jackson (2.4.4 and 2.5.3 
> respectively) such that one cannot include the current AWS SDK in a project 
> that also uses the Spark Streaming Kinesis ASL.
> Bumping the version of Jackson and the AWS library solves this problem and 
> will allow Firehose integrations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6270) Standalone Master hangs when streaming job completes and event logging is enabled

2015-12-10 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15051753#comment-15051753
 ] 

Steve Loughran commented on SPARK-6270:
---

[~shivaram] : can you have a look @ the logs and see what events were common. 
Just curious: is there something that's too noisy...that could be the root 
cause.

> Standalone Master hangs when streaming job completes and event logging is 
> enabled
> -
>
> Key: SPARK-6270
> URL: https://issues.apache.org/jira/browse/SPARK-6270
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Streaming
>Affects Versions: 1.2.0, 1.2.1, 1.3.0, 1.5.1
>Reporter: Tathagata Das
>Priority: Critical
>
> If the event logging is enabled, the Spark Standalone Master tries to 
> recreate the web UI of a completed Spark application from its event logs. 
> However if this event log is huge (e.g. for a Spark Streaming application), 
> then the master hangs in its attempt to read and recreate the web ui. This 
> hang causes the whole standalone cluster to be unusable. 
> Workaround is to disable the event logging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12267) Standalone master keeps references to disassociated workers until they sent no heartbeats

2015-12-10 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15051683#comment-15051683
 ] 

Marcelo Vanzin commented on SPARK-12267:


Pasting Shixiong's comments from github 
(https://github.com/apache/spark/pull/9138):

{quote}
@vanzin just found an issue about this change. Now if the master receives 
RegisterWorker, it won't use the workerRef to send the reply. So there is no 
connection from Master to the server in Worker. If the Worker is killed now, 
Master only observes some client is lost, but the address is just a client 
address in Worker and won't match the Worker address. So Master cannot remove 
this dead Worker at once. However, this Worker will be removed in 60 seconds 
because of no heartbeat.
{quote}

> Standalone master keeps references to disassociated workers until they sent 
> no heartbeats
> -
>
> Key: SPARK-12267
> URL: https://issues.apache.org/jira/browse/SPARK-12267
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Jacek Laskowski
>
> While toying with Spark Standalone I've noticed the following messages
> in the logs of the master:
> {code}
> INFO Master: Registering worker 192.168.1.6:59919 with 2 cores, 2.0 GB RAM
> INFO Master: localhost:59920 got disassociated, removing it.
> ...
> WARN Master: Removing worker-20151210090708-192.168.1.6-59919 because
> we got no heartbeat in 60 seconds
> INFO Master: Removing worker worker-20151210090708-192.168.1.6-59919
> on 192.168.1.6:59919
> {code}
> Why does the message "WARN Master: Removing
> worker-20151210090708-192.168.1.6-59919 because we got no heartbeat in
> 60 seconds" appear when the worker should've been removed already (as
> pointed out in "INFO Master: localhost:59920 got disassociated,
> removing it.")?
> Could it be that the ids are different - 192.168.1.6:59919 vs localhost:59920?
> I started master using {{./sbin/start-master.sh -h localhost}} and the
> workers {{./sbin/start-slave.sh spark://localhost:7077}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11713) Initial RDD for updateStateByKey for pyspark

2015-12-10 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-11713.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10082
[https://github.com/apache/spark/pull/10082]

> Initial RDD for updateStateByKey for pyspark
> 
>
> Key: SPARK-11713
> URL: https://issues.apache.org/jira/browse/SPARK-11713
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: David Watson
> Fix For: 2.0.0
>
>
> It would be infinitely useful to add initial rdd to the pyspark DStream 
> interface to match the scala and java interfaces 
> (https://issues.apache.org/jira/browse/SPARK-3660).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12212) Clarify the distinction between spark.mllib and spark.ml

2015-12-10 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-12212:
--
Target Version/s: 1.6.0

> Clarify the distinction between spark.mllib and spark.ml
> 
>
> Key: SPARK-12212
> URL: https://issues.apache.org/jira/browse/SPARK-12212
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 1.5.2
>Reporter: Timothy Hunter
>Assignee: Timothy Hunter
> Fix For: 1.6.1, 2.0.0
>
>
> There is a confusion in the documentation of MLLib as to what exactly MLlib: 
> is it the package, or is it the whole effort of ML on spark, and how it 
> differs from spark.ml? Is MLLib going to be deprecated?
> We should do the following:
>  - refer to the mllib the code package as spark.mllib across all the 
> documentation. Alternative name is "RDD API of MLlib".
>  - refer to MLlib the project that encompasses spark.ml + spark.mllib as 
> MLlib (it should be the default)
>  - replaces reference to "Pipeline API" by spark.ml or the "Dataframe API of 
> MLlib". I would deemphasize that this API is for building pipelines. Some 
> users are lead to believe from the documentation that spark.ml can only be 
> used for building pipelines and that using a single algorithm can only be 
> done with spark.mllib.
> Most relevant places:
>  - {{mllib-guide.md}}
>  - {{mllib-linear-methods.md}}
>  - {{mllib-dimensionality-reduction.md}}
>  - {{mllib-pmml-model-export.md}}
>  - {{mllib-statistics.md}}
> In these files, most references to {{MLlib}} are meant to refer to 
> {{spark.mllib}} instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12217) Document invalid handling for StringIndexer

2015-12-10 Thread Benjamin Fradet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15051698#comment-15051698
 ] 

Benjamin Fradet commented on SPARK-12217:
-

Sorry [~srowen], my bad, I wanted to duplicate the values on a previous jira 
but didnt know the implications.

> Document invalid handling for StringIndexer
> ---
>
> Key: SPARK-12217
> URL: https://issues.apache.org/jira/browse/SPARK-12217
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Reporter: Benjamin Fradet
>Priority: Minor
>
> Documentation is needed regarding the handling of invalid labels in 
> StringIndexer



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4816) Maven profile netlib-lgpl does not work

2015-12-10 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15051699#comment-15051699
 ] 

Sean Owen commented on SPARK-4816:
--

I just tried building the 1.4.1 tarball with -Pnetlib-lgpl and I see the 
"netlib-native*" items in the assembly JAR as expected. How are you building 
and what are you looking at?

> Maven profile netlib-lgpl does not work
> ---
>
> Key: SPARK-4816
> URL: https://issues.apache.org/jira/browse/SPARK-4816
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.1.0
> Environment: maven 3.0.5 / Ubuntu
>Reporter: Guillaume Pitel
>Priority: Minor
> Fix For: 1.1.1
>
>
> When doing what the documentation recommends to recompile Spark with Netlib 
> Native system binding (i.e. to bind with openblas or, in my case, MKL), 
> mvn -Pnetlib-lgpl -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests 
> clean package
> The resulting assembly jar still lacked the netlib-system class. (I checked 
> the content of spark-assembly...jar)
> When forcing the netlib-lgpl profile in MLLib package to be active, the jar 
> is correctly built.
> So I guess it's a problem with the way maven passes profiles activitations to 
> children modules.
> Also, despite the documentation claiming that if the job's jar contains 
> netlib with necessary bindings, it should works, it does not. The classloader 
> must be unhappy with two occurrences of netlib ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12148) SparkR: rename DataFrame to SparkDataFrame

2015-12-10 Thread Dan Putler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15051716#comment-15051716
 ] 

Dan Putler commented on SPARK-12148:


Michael Lawrence's arguments are very valid. The S4Vector package seems to be 
an important Bioconductor package, and likely has a lot of users in the 
bioinformatics community, which is a community that is also likely to have a 
high share of SparkR users, so the name collision issues are real. The one 
thing that needs to be thought through is how to mitigate the effect on 
existing SparkR code that users have written (I'm less concerned about the 
inconsistencies in the naming convention across Scala, Python, and R). Given 
the fairly short period of time SparkR has supported DataFrames, the amount of 
existing user code is likely not enormous. However, I think it does make sense 
to have a transition period of one Spark release where a call to (SparkR) 
DataFrame results in a warning that the function is being deprecated, and that 
SparkDataFrame (or whatever else we choose to rename it) should be used instead.

> SparkR: rename DataFrame to SparkDataFrame
> --
>
> Key: SPARK-12148
> URL: https://issues.apache.org/jira/browse/SPARK-12148
> Project: Spark
>  Issue Type: Wish
>  Components: R, SparkR
>Reporter: Michael Lawrence
>Priority: Minor
>
> The SparkR package represents a Spark DataFrame with the class "DataFrame". 
> That conflicts with the more general DataFrame class defined in the S4Vectors 
> package. Would it not be more appropriate to use the name "SparkDataFrame" 
> instead?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9578) Stemmer feature transformer

2015-12-10 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15051775#comment-15051775
 ] 

holdenk commented on SPARK-9578:


[~yuhaoyan] Are you working on this?

> Stemmer feature transformer
> ---
>
> Key: SPARK-9578
> URL: https://issues.apache.org/jira/browse/SPARK-9578
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Transformer mentioned first in [SPARK-5571] based on suggestion from 
> [~aloknsingh].  Very standard NLP preprocessing task.
> From [~aloknsingh]:
> {quote}
> We have one scala stemmer in scalanlp%chalk 
> https://github.com/scalanlp/chalk/tree/master/src/main/scala/chalk/text/analyze
>   which can easily copied (as it is apache project) and is in scala too.
> I think this will be better alternative than lucene englishAnalyzer or 
> opennlp.
> Note: we already use the scalanlp%breeze via the maven dependency so I think 
> adding scalanlp%chalk dependency is also the options. But as you had said we 
> can copy the code as it is small.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12212) Clarify the distinction between spark.mllib and spark.ml

2015-12-10 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-12212.
---
   Resolution: Fixed
Fix Version/s: 1.6.1
   2.0.0

Issue resolved by pull request 10234
[https://github.com/apache/spark/pull/10234]

> Clarify the distinction between spark.mllib and spark.ml
> 
>
> Key: SPARK-12212
> URL: https://issues.apache.org/jira/browse/SPARK-12212
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 1.5.2
>Reporter: Timothy Hunter
>Assignee: Timothy Hunter
> Fix For: 2.0.0, 1.6.1
>
>
> There is a confusion in the documentation of MLLib as to what exactly MLlib: 
> is it the package, or is it the whole effort of ML on spark, and how it 
> differs from spark.ml? Is MLLib going to be deprecated?
> We should do the following:
>  - refer to the mllib the code package as spark.mllib across all the 
> documentation. Alternative name is "RDD API of MLlib".
>  - refer to MLlib the project that encompasses spark.ml + spark.mllib as 
> MLlib (it should be the default)
>  - replaces reference to "Pipeline API" by spark.ml or the "Dataframe API of 
> MLlib". I would deemphasize that this API is for building pipelines. Some 
> users are lead to believe from the documentation that spark.ml can only be 
> used for building pipelines and that using a single algorithm can only be 
> done with spark.mllib.
> Most relevant places:
>  - {{mllib-guide.md}}
>  - {{mllib-linear-methods.md}}
>  - {{mllib-dimensionality-reduction.md}}
>  - {{mllib-pmml-model-export.md}}
>  - {{mllib-statistics.md}}
> In these files, most references to {{MLlib}} are meant to refer to 
> {{spark.mllib}} instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12269) Update aws-java-sdk version

2015-12-10 Thread Brian London (JIRA)
Brian London created SPARK-12269:


 Summary: Update aws-java-sdk version
 Key: SPARK-12269
 URL: https://issues.apache.org/jira/browse/SPARK-12269
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Brian London


The current Spark Streaming kinesis connector references a quite old version 
1.9.40 of the AWS Java SDK (1.10.40 is current).  Numerous AWS features 
including Kinesis Firehose are unavailable in 1.9.  Those two versions of  the 
AWS SDK in turn require conflicting versions of Jackson (2.4.4 and 2.5.3 
respectively) such that one cannot include the current AWS SDK in a project 
that also uses the Spark Streaming Kinesis ASL.

Bumping the version of Jackson and the AWS library solves this problem and will 
allow Firehose integrations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2870) Thorough schema inference directly on RDDs of Python dictionaries

2015-12-10 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15051696#comment-15051696
 ] 

holdenk commented on SPARK-2870:


I can take a crack at implementing this if no one else is planning too.

> Thorough schema inference directly on RDDs of Python dictionaries
> -
>
> Key: SPARK-2870
> URL: https://issues.apache.org/jira/browse/SPARK-2870
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Reporter: Nicholas Chammas
>
> h4. Background
> I love the {{SQLContext.jsonRDD()}} and {{SQLContext.jsonFile()}} methods. 
> They process JSON text directly and infer a schema that covers the entire 
> source data set. 
> This is very important with semi-structured data like JSON since individual 
> elements in the data set are free to have different structures. Matching 
> fields across elements may even have different value types.
> For example:
> {code}
> {"a": 5}
> {"a": "cow"}
> {code}
> To get a queryable schema that covers the whole data set, you need to infer a 
> schema by looking at the whole data set. The aforementioned 
> {{SQLContext.json...()}} methods do this very well. 
> h4. Feature Request
> What we need is for {{SQlContext.inferSchema()}} to do this, too. 
> Alternatively, we need a new {{SQLContext}} method that works on RDDs of 
> Python dictionaries and does something functionally equivalent to this:
> {code}
> SQLContext.jsonRDD(RDD[dict].map(lambda x: json.dumps(x)))
> {code}
> As of 1.0.2, 
> [{{inferSchema()}}|http://spark.apache.org/docs/latest/api/python/pyspark.sql.SQLContext-class.html#inferSchema]
>  just looks at the first element in the data set. This won't help much when 
> the structure of the elements in the target RDD is variable.
> h4. Example Use Case
> * You have some JSON text data that you want to analyze using Spark SQL. 
> * You would use one of the {{SQLContext.json...()}} methods, but you need to 
> do some filtering on the data first to remove bad elements--basically, some 
> minimal schema validation.
> * You deserialize the JSON objects to Python {{dict}} s and filter out the 
> bad ones. You now have an RDD of dictionaries.
> * From this RDD, you want a SchemaRDD that captures the schema for the whole 
> data set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12267) Standalone master keeps references to disassociated workers until they sent no heartbeats

2015-12-10 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15051700#comment-15051700
 ] 

Marcelo Vanzin edited comment on SPARK-12267 at 12/10/15 9:59 PM:
--

[~zsxwing] you're right but that's the wrong PR you picked; I think this is a 
side-effect of SPARK-10997.

Still trying to figure out the best way to fix this.


was (Author: vanzin):
[~shixi...@databricks.com] you're right but that's the wrong PR you picked; I 
think this is a side-effect of SPARK-10997.

Still trying to figure out the best way to fix this.

> Standalone master keeps references to disassociated workers until they sent 
> no heartbeats
> -
>
> Key: SPARK-12267
> URL: https://issues.apache.org/jira/browse/SPARK-12267
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Jacek Laskowski
>
> While toying with Spark Standalone I've noticed the following messages
> in the logs of the master:
> {code}
> INFO Master: Registering worker 192.168.1.6:59919 with 2 cores, 2.0 GB RAM
> INFO Master: localhost:59920 got disassociated, removing it.
> ...
> WARN Master: Removing worker-20151210090708-192.168.1.6-59919 because
> we got no heartbeat in 60 seconds
> INFO Master: Removing worker worker-20151210090708-192.168.1.6-59919
> on 192.168.1.6:59919
> {code}
> Why does the message "WARN Master: Removing
> worker-20151210090708-192.168.1.6-59919 because we got no heartbeat in
> 60 seconds" appear when the worker should've been removed already (as
> pointed out in "INFO Master: localhost:59920 got disassociated,
> removing it.")?
> Could it be that the ids are different - 192.168.1.6:59919 vs localhost:59920?
> I started master using {{./sbin/start-master.sh -h localhost}} and the
> workers {{./sbin/start-slave.sh spark://localhost:7077}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12268) pyspark shell uses execfile which breaks python3 compatibility

2015-12-10 Thread Erik Selin (JIRA)
Erik Selin created SPARK-12268:
--

 Summary: pyspark shell uses execfile which breaks python3 
compatibility
 Key: SPARK-12268
 URL: https://issues.apache.org/jira/browse/SPARK-12268
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.5.2, 1.6.1
Reporter: Erik Selin


The pyspark shell allows custom start scripts to run using the PYTHONSTARTUP 
environment variable. The value specified there will get run at the end of the 
shell startup by a call to execfile. However, execfile is deprecated in python3 
and thus this does not work for python3 users. The simply fix is to follow the 
2to3 recommendation and read, compile and exec the file manually as per this 
PR: https://github.com/apache/spark/pull/10255



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12267) Standalone master keeps references to disassociated workers until they sent no heartbeats

2015-12-10 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15051700#comment-15051700
 ] 

Marcelo Vanzin commented on SPARK-12267:


[~shixi...@databricks.com] you're right but that's the wrong PR you picked; I 
think this is a side-effect of SPARK-10997.

Still trying to figure out the best way to fix this.

> Standalone master keeps references to disassociated workers until they sent 
> no heartbeats
> -
>
> Key: SPARK-12267
> URL: https://issues.apache.org/jira/browse/SPARK-12267
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Jacek Laskowski
>
> While toying with Spark Standalone I've noticed the following messages
> in the logs of the master:
> {code}
> INFO Master: Registering worker 192.168.1.6:59919 with 2 cores, 2.0 GB RAM
> INFO Master: localhost:59920 got disassociated, removing it.
> ...
> WARN Master: Removing worker-20151210090708-192.168.1.6-59919 because
> we got no heartbeat in 60 seconds
> INFO Master: Removing worker worker-20151210090708-192.168.1.6-59919
> on 192.168.1.6:59919
> {code}
> Why does the message "WARN Master: Removing
> worker-20151210090708-192.168.1.6-59919 because we got no heartbeat in
> 60 seconds" appear when the worker should've been removed already (as
> pointed out in "INFO Master: localhost:59920 got disassociated,
> removing it.")?
> Could it be that the ids are different - 192.168.1.6:59919 vs localhost:59920?
> I started master using {{./sbin/start-master.sh -h localhost}} and the
> workers {{./sbin/start-slave.sh spark://localhost:7077}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-12042) Python API for mllib.stat.test.StreamingTest

2015-12-10 Thread holdenk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk updated SPARK-12042:

Comment: was deleted

(was: I could take a crack at this since I've been giving some thought recently 
to testing & Spark if no one else is planning on tackling this JIRA.)

> Python API for mllib.stat.test.StreamingTest
> 
>
> Key: SPARK-12042
> URL: https://issues.apache.org/jira/browse/SPARK-12042
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Reporter: Yanbo Liang
>
> Python API for mllib.stat.test.StreamingTest.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12072) python dataframe ._jdf.schema().json() breaks on large metadata dataframes

2015-12-10 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15051705#comment-15051705
 ] 

holdenk commented on SPARK-12072:
-

Ok cool, let me take a look and see if there is another way to fix this.

> python dataframe ._jdf.schema().json() breaks on large metadata dataframes
> --
>
> Key: SPARK-12072
> URL: https://issues.apache.org/jira/browse/SPARK-12072
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.2
>Reporter: Rares Mirica
>
> When a dataframe contains a column with a large number of values in ml_attr, 
> schema evaluation will routinely fail on getting the schema as json, this 
> will, in turn, cause a bunch of problems with, eg: calling udfs on the schema 
> because calling columns relies on 
> _parse_datatype_json_string(self._jdf.schema().json())



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12267) Standalone master keeps references to disassociated workers until they sent no heartbeats

2015-12-10 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15051732#comment-15051732
 ] 

Marcelo Vanzin commented on SPARK-12267:


I think the following would work. The problem right now is that the Worker 
listens for incoming connections; and when that happens, the {{senderAddress}} 
of RPC messages becomes the listening address of the Worker, instead of the 
address of the socket sending messages to the Master. When the worker 
disconnects, the Master sees a disconnection from that client socket, but 
doesn't know that it actually relates to that listening address, so doesn't 
unregister anything.

I think instead that, in Netty's case, {{RpcCallContext.senderAddress}} should 
always be the address of the client socket, regardless of whether the sender is 
listening. That would fix this problem. RpcEndpoints for those listening 
processes would still have the listen address of the RpcEnv.

There are three places where `senderAddress` is used outside of Master:

- MapOutputTrackerMasterEndpoint when handling GetMapOutputStatuses, but that's 
only logging
- in CoarseGrainedSchedulerBackend when handling {{RegisterExecutor}}, but that 
already seems to be doing the right thing (since in Netty's case executors are 
not listening)
- in ReceiverTracker when handling RegisterReceiver, but that's also only 
logging

So the above suggestion should work as far as I can tell.

> Standalone master keeps references to disassociated workers until they sent 
> no heartbeats
> -
>
> Key: SPARK-12267
> URL: https://issues.apache.org/jira/browse/SPARK-12267
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Jacek Laskowski
>
> While toying with Spark Standalone I've noticed the following messages
> in the logs of the master:
> {code}
> INFO Master: Registering worker 192.168.1.6:59919 with 2 cores, 2.0 GB RAM
> INFO Master: localhost:59920 got disassociated, removing it.
> ...
> WARN Master: Removing worker-20151210090708-192.168.1.6-59919 because
> we got no heartbeat in 60 seconds
> INFO Master: Removing worker worker-20151210090708-192.168.1.6-59919
> on 192.168.1.6:59919
> {code}
> Why does the message "WARN Master: Removing
> worker-20151210090708-192.168.1.6-59919 because we got no heartbeat in
> 60 seconds" appear when the worker should've been removed already (as
> pointed out in "INFO Master: localhost:59920 got disassociated,
> removing it.")?
> Could it be that the ids are different - 192.168.1.6:59919 vs localhost:59920?
> I started master using {{./sbin/start-master.sh -h localhost}} and the
> workers {{./sbin/start-slave.sh spark://localhost:7077}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12269) Update aws-java-sdk version

2015-12-10 Thread Brian London (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15051733#comment-15051733
 ] 

Brian London commented on SPARK-12269:
--

The jackson 2.4 and 2.5 incompatibility has appeared elsewhere.  See the 
following for example:

https://github.com/FasterXML/jackson-module-scala/issues/177
http://stackoverflow.com/questions/31039367/spark-parallelize-could-not-find-creator-property-with-name-id

All tests pass after changing those two versions.  Could you elaborate what you 
mean by research conflicts beyond that?

> Update aws-java-sdk version
> ---
>
> Key: SPARK-12269
> URL: https://issues.apache.org/jira/browse/SPARK-12269
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Brian London
>Priority: Minor
>
> The current Spark Streaming kinesis connector references a quite old version 
> 1.9.40 of the AWS Java SDK (1.10.40 is current).  Numerous AWS features 
> including Kinesis Firehose are unavailable in 1.9.  Those two versions of  
> the AWS SDK in turn require conflicting versions of Jackson (2.4.4 and 2.5.3 
> respectively) such that one cannot include the current AWS SDK in a project 
> that also uses the Spark Streaming Kinesis ASL.
> Bumping the version of Jackson and the AWS library solves this problem and 
> will allow Firehose integrations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12267) Standalone master keeps references to disassociated workers until they sent no heartbeats

2015-12-10 Thread dont_ping_this_account (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15051757#comment-15051757
 ] 

dont_ping_this_account commented on SPARK-12267:


Sounds correct. But looks a lot changes to {{Master.scala}}. Need to pass 
{{senderAddress}} to a lot of places. I think we can just handle it in 
`NettyRpcHandler` like the previous way.

> Standalone master keeps references to disassociated workers until they sent 
> no heartbeats
> -
>
> Key: SPARK-12267
> URL: https://issues.apache.org/jira/browse/SPARK-12267
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Jacek Laskowski
>
> While toying with Spark Standalone I've noticed the following messages
> in the logs of the master:
> {code}
> INFO Master: Registering worker 192.168.1.6:59919 with 2 cores, 2.0 GB RAM
> INFO Master: localhost:59920 got disassociated, removing it.
> ...
> WARN Master: Removing worker-20151210090708-192.168.1.6-59919 because
> we got no heartbeat in 60 seconds
> INFO Master: Removing worker worker-20151210090708-192.168.1.6-59919
> on 192.168.1.6:59919
> {code}
> Why does the message "WARN Master: Removing
> worker-20151210090708-192.168.1.6-59919 because we got no heartbeat in
> 60 seconds" appear when the worker should've been removed already (as
> pointed out in "INFO Master: localhost:59920 got disassociated,
> removing it.")?
> Could it be that the ids are different - 192.168.1.6:59919 vs localhost:59920?
> I started master using {{./sbin/start-master.sh -h localhost}} and the
> workers {{./sbin/start-slave.sh spark://localhost:7077}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12269) Update aws-java-sdk version

2015-12-10 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12269:
--
Priority: Minor  (was: Major)

Makes some sense, but the question is always: are there any incompatible 
changes? do dependencies change? for example I think the Jackson dependency 
change is actually in the right direction, to match Spark's, but this is what 
you'd need to research and establish if you're asking for a dependency update 
across minor versions.

> Update aws-java-sdk version
> ---
>
> Key: SPARK-12269
> URL: https://issues.apache.org/jira/browse/SPARK-12269
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Brian London
>Priority: Minor
>
> The current Spark Streaming kinesis connector references a quite old version 
> 1.9.40 of the AWS Java SDK (1.10.40 is current).  Numerous AWS features 
> including Kinesis Firehose are unavailable in 1.9.  Those two versions of  
> the AWS SDK in turn require conflicting versions of Jackson (2.4.4 and 2.5.3 
> respectively) such that one cannot include the current AWS SDK in a project 
> that also uses the Spark Streaming Kinesis ASL.
> Bumping the version of Jackson and the AWS library solves this problem and 
> will allow Firehose integrations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11563) Use RpcEnv to transfer generated classes in spark-shell

2015-12-10 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-11563.

   Resolution: Fixed
 Assignee: Marcelo Vanzin
Fix Version/s: 2.0.0

> Use RpcEnv to transfer generated classes in spark-shell
> ---
>
> Key: SPARK-11563
> URL: https://issues.apache.org/jira/browse/SPARK-11563
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Shell
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
> Fix For: 2.0.0
>
>
> Similar to SPARK-11140 (and building on top of it), spark-shell should 
> transfer generated classes using the RpcEnv support for file transfers added 
> in that change.
> That would mean easy security configuration and, as a bonus, lower overhead 
> (and one less HTTP server running inside Spark).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12258) Hive Timestamp UDF is binded with '1969-12-31 15:59:59.999999' for null value

2015-12-10 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-12258:
-
Target Version/s: 1.6.0

> Hive Timestamp UDF is binded with '1969-12-31 15:59:59.99' for null value
> -
>
> Key: SPARK-12258
> URL: https://issues.apache.org/jira/browse/SPARK-12258
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Ian
>
> {code}
>   test("Timestamp UDF and Null value") {
> hiveContext.runSqlHive("CREATE TABLE ts_test (ts TIMESTAMP) STORED AS 
> TEXTFILE")
> hiveContext.runSqlHive("INSERT INTO TABLE ts_test VALUES(Null)")
> hiveContext.udf.register("dummy",
>   (ts: Timestamp) => ts
> )
> val result = hiveContext.sql("SELECT dummy(ts) FROM 
> ts_test").collect().mkString("\n")
> assertResult("[null]")(result)
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12213) Query with only one distinct should not having on expand

2015-12-10 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-12213:
--

Assignee: Davies Liu

> Query with only one distinct should not having on expand
> 
>
> Key: SPARK-12213
> URL: https://issues.apache.org/jira/browse/SPARK-12213
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> Expand will double the number of records, slow down projection and 
> aggregation, it's better to generate a plan without Expand for a query with 
> only one distinct (for example, ss_max in TPCDS)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12267) Standalone master keeps references to disassociated workers until they sent no heartbeats

2015-12-10 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15051767#comment-15051767
 ] 

Marcelo Vanzin commented on SPARK-12267:


bq.  I think we can just handle it in `NettyRpcHandler` like the previous way.

Yeah that's my plan.

> Standalone master keeps references to disassociated workers until they sent 
> no heartbeats
> -
>
> Key: SPARK-12267
> URL: https://issues.apache.org/jira/browse/SPARK-12267
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Jacek Laskowski
>
> While toying with Spark Standalone I've noticed the following messages
> in the logs of the master:
> {code}
> INFO Master: Registering worker 192.168.1.6:59919 with 2 cores, 2.0 GB RAM
> INFO Master: localhost:59920 got disassociated, removing it.
> ...
> WARN Master: Removing worker-20151210090708-192.168.1.6-59919 because
> we got no heartbeat in 60 seconds
> INFO Master: Removing worker worker-20151210090708-192.168.1.6-59919
> on 192.168.1.6:59919
> {code}
> Why does the message "WARN Master: Removing
> worker-20151210090708-192.168.1.6-59919 because we got no heartbeat in
> 60 seconds" appear when the worker should've been removed already (as
> pointed out in "INFO Master: localhost:59920 got disassociated,
> removing it.")?
> Could it be that the ids are different - 192.168.1.6:59919 vs localhost:59920?
> I started master using {{./sbin/start-master.sh -h localhost}} and the
> workers {{./sbin/start-slave.sh spark://localhost:7077}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12032) Filter can't be pushed down to correct Join because of bad order of Join

2015-12-10 Thread Min Qiu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15051829#comment-15051829
 ] 

Min Qiu commented on SPARK-12032:
-

We had run into the same problem in our product development and we had also 
come up with a similar solution.
Here is the pull request: [#10258| https://github.com/apache/spark/pull/10258] 
just in case that somebody is interested in it.

> Filter can't be pushed down to correct Join because of bad order of Join
> 
>
> Key: SPARK-12032
> URL: https://issues.apache.org/jira/browse/SPARK-12032
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Critical
> Fix For: 2.0.0
>
>
> For this query:
> {code}
>   select d.d_year, count(*) cnt
>FROM store_sales, date_dim d, customer c
>WHERE ss_customer_sk = c.c_customer_sk AND c.c_first_shipto_date_sk = 
> d.d_date_sk
>group by d.d_year
> {code}
> Current optimized plan is
> {code}
> == Optimized Logical Plan ==
> Aggregate [d_year#147], [d_year#147,(count(1),mode=Complete,isDistinct=false) 
> AS cnt#425L]
>  Project [d_year#147]
>   Join Inner, Some(((ss_customer_sk#283 = c_customer_sk#101) && 
> (c_first_shipto_date_sk#106 = d_date_sk#141)))
>Project [d_date_sk#141,d_year#147,ss_customer_sk#283]
> Join Inner, None
>  Project [ss_customer_sk#283]
>   Relation[] ParquetRelation[store_sales]
>  Project [d_date_sk#141,d_year#147]
>   Relation[] ParquetRelation[date_dim]
>Project [c_customer_sk#101,c_first_shipto_date_sk#106]
> Relation[] ParquetRelation[customer]
> {code}
> It will join store_sales and date_dim together without any condition, the 
> condition c.c_first_shipto_date_sk = d.d_date_sk is not pushed to it because 
> the bad order of joins.
> The optimizer should re-order the joins, join date_dim after customer, then 
> it can pushed down the condition correctly.
> The plan should be 
> {code}
> Aggregate [d_year#147], [d_year#147,(count(1),mode=Complete,isDistinct=false) 
> AS cnt#425L]
>  Project [d_year#147]
>   Join Inner, Some((c_first_shipto_date_sk#106 = d_date_sk#141))
>Project [c_first_shipto_date_sk#106]
> Join Inner, Some((ss_customer_sk#283 = c_customer_sk#101))
>  Project [ss_customer_sk#283]
>   Relation[store_sales]
>  Project [c_first_shipto_date_sk#106,c_customer_sk#101]
>   Relation[customer]
>Project [d_year#147,d_date_sk#141]
> Relation[date_dim]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11796) Docker JDBC integration tests fail in Maven build due to dependency issue

2015-12-10 Thread Mark Grover (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15051937#comment-15051937
 ] 

Mark Grover commented on SPARK-11796:
-

Excellent, thank you!

> Docker JDBC integration tests fail in Maven build due to dependency issue
> -
>
> Key: SPARK-11796
> URL: https://issues.apache.org/jira/browse/SPARK-11796
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 1.6.0
>Reporter: Josh Rosen
>Assignee: Mark Grover
> Fix For: 1.6.0
>
>
> Our new Docker integration tests for JDBC dialects are failing in the Maven 
> builds. For now, I've disabled this for Maven by adding the 
> {{-Dtest.exclude.tags=org.apache.spark.tags.DockerTest}} flag to our Jenkins 
> builds, but we should fix this soon. The test failures seem to be related to 
> dependency or classpath issues:
> {code}
> *** RUN ABORTED ***
>   java.lang.NoSuchMethodError: 
> org.apache.http.impl.client.HttpClientBuilder.setConnectionManagerShared(Z)Lorg/apache/http/impl/client/HttpClientBuilder;
>   at 
> org.glassfish.jersey.apache.connector.ApacheConnector.(ApacheConnector.java:240)
>   at 
> org.glassfish.jersey.apache.connector.ApacheConnectorProvider.getConnector(ApacheConnectorProvider.java:115)
>   at 
> org.glassfish.jersey.client.ClientConfig$State.initRuntime(ClientConfig.java:418)
>   at 
> org.glassfish.jersey.client.ClientConfig$State.access$000(ClientConfig.java:88)
>   at 
> org.glassfish.jersey.client.ClientConfig$State$3.get(ClientConfig.java:120)
>   at 
> org.glassfish.jersey.client.ClientConfig$State$3.get(ClientConfig.java:117)
>   at 
> org.glassfish.jersey.internal.util.collection.Values$LazyValueImpl.get(Values.java:340)
>   at 
> org.glassfish.jersey.client.ClientConfig.getRuntime(ClientConfig.java:726)
>   at 
> org.glassfish.jersey.client.ClientRequest.getConfiguration(ClientRequest.java:285)
>   at 
> org.glassfish.jersey.client.JerseyInvocation.validateHttpMethodAndEntity(JerseyInvocation.java:126)
>   ...
> {code}
> To reproduce locally: {{build/mvn -pl docker-integration-tests package}}. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12258) Hive Timestamp UDF is binded with '1969-12-31 15:59:59.999999' for null value

2015-12-10 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-12258.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 10259
[https://github.com/apache/spark/pull/10259]

> Hive Timestamp UDF is binded with '1969-12-31 15:59:59.99' for null value
> -
>
> Key: SPARK-12258
> URL: https://issues.apache.org/jira/browse/SPARK-12258
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Ian
> Fix For: 1.6.0
>
>
> {code}
>   test("Timestamp UDF and Null value") {
> hiveContext.runSqlHive("CREATE TABLE ts_test (ts TIMESTAMP) STORED AS 
> TEXTFILE")
> hiveContext.runSqlHive("INSERT INTO TABLE ts_test VALUES(Null)")
> hiveContext.udf.register("dummy",
>   (ts: Timestamp) => ts
> )
> val result = hiveContext.sql("SELECT dummy(ts) FROM 
> ts_test").collect().mkString("\n")
> assertResult("[null]")(result)
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12258) Hive Timestamp UDF is binded with '1969-12-31 15:59:59.999999' for null value

2015-12-10 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-12258:
-
Assignee: Davies Liu

> Hive Timestamp UDF is binded with '1969-12-31 15:59:59.99' for null value
> -
>
> Key: SPARK-12258
> URL: https://issues.apache.org/jira/browse/SPARK-12258
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Ian
>Assignee: Davies Liu
> Fix For: 1.6.0
>
>
> {code}
>   test("Timestamp UDF and Null value") {
> hiveContext.runSqlHive("CREATE TABLE ts_test (ts TIMESTAMP) STORED AS 
> TEXTFILE")
> hiveContext.runSqlHive("INSERT INTO TABLE ts_test VALUES(Null)")
> hiveContext.udf.register("dummy",
>   (ts: Timestamp) => ts
> )
> val result = hiveContext.sql("SELECT dummy(ts) FROM 
> ts_test").collect().mkString("\n")
> assertResult("[null]")(result)
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12269) Update aws-java-sdk version

2015-12-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12269:


Assignee: Apache Spark

> Update aws-java-sdk version
> ---
>
> Key: SPARK-12269
> URL: https://issues.apache.org/jira/browse/SPARK-12269
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Brian London
>Assignee: Apache Spark
>Priority: Minor
>
> The current Spark Streaming kinesis connector references a quite old version 
> 1.9.40 of the AWS Java SDK (1.10.40 is current).  Numerous AWS features 
> including Kinesis Firehose are unavailable in 1.9.  Those two versions of  
> the AWS SDK in turn require conflicting versions of Jackson (2.4.4 and 2.5.3 
> respectively) such that one cannot include the current AWS SDK in a project 
> that also uses the Spark Streaming Kinesis ASL.
> Bumping the version of Jackson and the AWS library solves this problem and 
> will allow Firehose integrations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12217) Document invalid handling for StringIndexer

2015-12-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15051832#comment-15051832
 ] 

Apache Spark commented on SPARK-12217:
--

User 'BenFradet' has created a pull request for this issue:
https://github.com/apache/spark/pull/10257

> Document invalid handling for StringIndexer
> ---
>
> Key: SPARK-12217
> URL: https://issues.apache.org/jira/browse/SPARK-12217
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Reporter: Benjamin Fradet
>Priority: Minor
>
> Documentation is needed regarding the handling of invalid labels in 
> StringIndexer



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9578) Stemmer feature transformer

2015-12-10 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15051915#comment-15051915
 ] 

yuhao yang commented on SPARK-9578:
---

Oh, I got a porter implementation now. I'll send it today or tomorrow, then we 
can see if we can improve it to snowball if you are interested. Thanks.

> Stemmer feature transformer
> ---
>
> Key: SPARK-9578
> URL: https://issues.apache.org/jira/browse/SPARK-9578
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Transformer mentioned first in [SPARK-5571] based on suggestion from 
> [~aloknsingh].  Very standard NLP preprocessing task.
> From [~aloknsingh]:
> {quote}
> We have one scala stemmer in scalanlp%chalk 
> https://github.com/scalanlp/chalk/tree/master/src/main/scala/chalk/text/analyze
>   which can easily copied (as it is apache project) and is in scala too.
> I think this will be better alternative than lucene englishAnalyzer or 
> opennlp.
> Note: we already use the scalanlp%breeze via the maven dependency so I think 
> adding scalanlp%chalk dependency is also the options. But as you had said we 
> can copy the code as it is small.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12267) Standalone master keeps references to disassociated workers until they sent no heartbeats

2015-12-10 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15051952#comment-15051952
 ] 

Shixiong Zhu commented on SPARK-12267:
--

Could you send a PR quickly so that we can get the fix into RC2?

> Standalone master keeps references to disassociated workers until they sent 
> no heartbeats
> -
>
> Key: SPARK-12267
> URL: https://issues.apache.org/jira/browse/SPARK-12267
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Jacek Laskowski
>
> While toying with Spark Standalone I've noticed the following messages
> in the logs of the master:
> {code}
> INFO Master: Registering worker 192.168.1.6:59919 with 2 cores, 2.0 GB RAM
> INFO Master: localhost:59920 got disassociated, removing it.
> ...
> WARN Master: Removing worker-20151210090708-192.168.1.6-59919 because
> we got no heartbeat in 60 seconds
> INFO Master: Removing worker worker-20151210090708-192.168.1.6-59919
> on 192.168.1.6:59919
> {code}
> Why does the message "WARN Master: Removing
> worker-20151210090708-192.168.1.6-59919 because we got no heartbeat in
> 60 seconds" appear when the worker should've been removed already (as
> pointed out in "INFO Master: localhost:59920 got disassociated,
> removing it.")?
> Could it be that the ids are different - 192.168.1.6:59919 vs localhost:59920?
> I started master using {{./sbin/start-master.sh -h localhost}} and the
> workers {{./sbin/start-slave.sh spark://localhost:7077}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12267) Standalone master keeps references to disassociated workers until they sent no heartbeats

2015-12-10 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15051953#comment-15051953
 ] 

Shixiong Zhu commented on SPARK-12267:
--

If you don't have time, I can do it.

> Standalone master keeps references to disassociated workers until they sent 
> no heartbeats
> -
>
> Key: SPARK-12267
> URL: https://issues.apache.org/jira/browse/SPARK-12267
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Jacek Laskowski
>
> While toying with Spark Standalone I've noticed the following messages
> in the logs of the master:
> {code}
> INFO Master: Registering worker 192.168.1.6:59919 with 2 cores, 2.0 GB RAM
> INFO Master: localhost:59920 got disassociated, removing it.
> ...
> WARN Master: Removing worker-20151210090708-192.168.1.6-59919 because
> we got no heartbeat in 60 seconds
> INFO Master: Removing worker worker-20151210090708-192.168.1.6-59919
> on 192.168.1.6:59919
> {code}
> Why does the message "WARN Master: Removing
> worker-20151210090708-192.168.1.6-59919 because we got no heartbeat in
> 60 seconds" appear when the worker should've been removed already (as
> pointed out in "INFO Master: localhost:59920 got disassociated,
> removing it.")?
> Could it be that the ids are different - 192.168.1.6:59919 vs localhost:59920?
> I started master using {{./sbin/start-master.sh -h localhost}} and the
> workers {{./sbin/start-slave.sh spark://localhost:7077}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12268) pyspark shell uses execfile which breaks python3 compatibility

2015-12-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12268:


Assignee: Apache Spark

> pyspark shell uses execfile which breaks python3 compatibility
> --
>
> Key: SPARK-12268
> URL: https://issues.apache.org/jira/browse/SPARK-12268
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.2, 1.6.1
>Reporter: Erik Selin
>Assignee: Apache Spark
>
> The pyspark shell allows custom start scripts to run using the PYTHONSTARTUP 
> environment variable. The value specified there will get run at the end of 
> the shell startup by a call to execfile. However, execfile is deprecated in 
> python3 and thus this does not work for python3 users. The simply fix is to 
> follow the 2to3 recommendation and read, compile and exec the file manually 
> as per this PR: https://github.com/apache/spark/pull/10255



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12251) Document Spark 1.6's off-heap memory configurations and add config validation

2015-12-10 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-12251.
---
Resolution: Fixed

> Document Spark 1.6's off-heap memory configurations and add config validation
> -
>
> Key: SPARK-12251
> URL: https://issues.apache.org/jira/browse/SPARK-12251
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 1.6.0
>
>
> We need to document the new off-heap memory limit configurations which were 
> added in Spark 1.6, add simple configuration validation (for instance, you 
> shouldn't be able to enable off-heap execution when the off-heap memory limit 
> is zero), and alias the old and confusing `spark.unsafe.offHeap` 
> configuration to something that lives in the `spark.memory` namespace.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12251) Document Spark 1.6's off-heap memory configurations and add config validation

2015-12-10 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-12251:
--
Fix Version/s: 1.6.0

> Document Spark 1.6's off-heap memory configurations and add config validation
> -
>
> Key: SPARK-12251
> URL: https://issues.apache.org/jira/browse/SPARK-12251
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 1.6.0
>
>
> We need to document the new off-heap memory limit configurations which were 
> added in Spark 1.6, add simple configuration validation (for instance, you 
> shouldn't be able to enable off-heap execution when the off-heap memory limit 
> is zero), and alias the old and confusing `spark.unsafe.offHeap` 
> configuration to something that lives in the `spark.memory` namespace.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11796) Docker JDBC integration tests fail in Maven build due to dependency issue

2015-12-10 Thread Mark Grover (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15051931#comment-15051931
 ] 

Mark Grover commented on SPARK-11796:
-

Hey [~joshrosen], just checking if you have removed the 
{{-Dtest.exclude.tags=org.apache.spark.tags.DockerTest}} flag from the Jenkins 
builds. I am happy to do that too but I don't have the privs.

> Docker JDBC integration tests fail in Maven build due to dependency issue
> -
>
> Key: SPARK-11796
> URL: https://issues.apache.org/jira/browse/SPARK-11796
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 1.6.0
>Reporter: Josh Rosen
>Assignee: Mark Grover
> Fix For: 1.6.0
>
>
> Our new Docker integration tests for JDBC dialects are failing in the Maven 
> builds. For now, I've disabled this for Maven by adding the 
> {{-Dtest.exclude.tags=org.apache.spark.tags.DockerTest}} flag to our Jenkins 
> builds, but we should fix this soon. The test failures seem to be related to 
> dependency or classpath issues:
> {code}
> *** RUN ABORTED ***
>   java.lang.NoSuchMethodError: 
> org.apache.http.impl.client.HttpClientBuilder.setConnectionManagerShared(Z)Lorg/apache/http/impl/client/HttpClientBuilder;
>   at 
> org.glassfish.jersey.apache.connector.ApacheConnector.(ApacheConnector.java:240)
>   at 
> org.glassfish.jersey.apache.connector.ApacheConnectorProvider.getConnector(ApacheConnectorProvider.java:115)
>   at 
> org.glassfish.jersey.client.ClientConfig$State.initRuntime(ClientConfig.java:418)
>   at 
> org.glassfish.jersey.client.ClientConfig$State.access$000(ClientConfig.java:88)
>   at 
> org.glassfish.jersey.client.ClientConfig$State$3.get(ClientConfig.java:120)
>   at 
> org.glassfish.jersey.client.ClientConfig$State$3.get(ClientConfig.java:117)
>   at 
> org.glassfish.jersey.internal.util.collection.Values$LazyValueImpl.get(Values.java:340)
>   at 
> org.glassfish.jersey.client.ClientConfig.getRuntime(ClientConfig.java:726)
>   at 
> org.glassfish.jersey.client.ClientRequest.getConfiguration(ClientRequest.java:285)
>   at 
> org.glassfish.jersey.client.JerseyInvocation.validateHttpMethodAndEntity(JerseyInvocation.java:126)
>   ...
> {code}
> To reproduce locally: {{build/mvn -pl docker-integration-tests package}}. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12155) Execution OOM after a relative large dataset cached in the cluster.

2015-12-10 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or reassigned SPARK-12155:
-

Assignee: Andrew Or  (was: Josh Rosen)

> Execution OOM after a relative large dataset cached in the cluster.
> ---
>
> Key: SPARK-12155
> URL: https://issues.apache.org/jira/browse/SPARK-12155
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Reporter: Yin Huai
>Assignee: Andrew Or
>Priority: Blocker
> Fix For: 1.6.0
>
>
> I have a cluster with relative 80GB of mem. Then, I cached a 43GB dataframe. 
> When I start to consume the query. I got the following exception (I added 
> more logs to the code).
> {code}
> 15/12/05 00:33:43 INFO UnifiedMemoryManager: Creating UnifedMemoryManager for 
> 4 cores with 16929521664 maxMemory, 8464760832 storageRegionSize.
> 15/12/05 01:20:50 INFO MemoryStore: Ensuring 1048576 bytes of free space for 
> block rdd_94_37(free: 3253659951, max: 16798973952)
> 15/12/05 01:20:50 INFO MemoryStore: Ensuring 5142008 bytes of free space for 
> block rdd_94_37(free: 3252611375, max: 16798973952)
> 15/12/05 01:20:50 INFO Executor: Finished task 36.0 in stage 4.0 (TID 109). 
> 3028 bytes result sent to driver
> 15/12/05 01:20:50 INFO MemoryStore: Ensuring 98948238 bytes of free space for 
> block rdd_94_37(free: 3314840375, max: 16866344960)
> 15/12/05 01:20:50 INFO MemoryStore: Ensuring 98675713 bytes of free space for 
> block rdd_94_37(free: 3215892137, max: 16866344960)
> 15/12/05 01:20:50 INFO MemoryStore: Ensuring 197347565 bytes of free space 
> for block rdd_94_37(free: 3117216424, max: 16866344960)
> 15/12/05 01:20:50 INFO MemoryStore: Ensuring 295995553 bytes of free space 
> for block rdd_94_37(free: 2919868859, max: 16866344960)
> 15/12/05 01:20:51 INFO MemoryStore: Ensuring 394728479 bytes of free space 
> for block rdd_94_37(free: 2687050010, max: 16929521664)
> 15/12/05 01:20:51 INFO Executor: Finished task 32.0 in stage 4.0 (TID 106). 
> 3028 bytes result sent to driver
> 15/12/05 01:20:51 INFO MemoryStore: Ensuring 591258816 bytes of free space 
> for block rdd_94_37(free: 2292321531, max: 16929521664)
> 15/12/05 01:20:51 INFO MemoryStore: Ensuring 901645182 bytes of free space 
> for block rdd_94_37(free: 1701062715, max: 16929521664)
> 15/12/05 01:20:52 INFO MemoryStore: Ensuring 1302179076 bytes of free space 
> for block rdd_94_37(free: 799417533, max: 16929521664)
> 15/12/05 01:20:52 INFO MemoryStore: Will not store rdd_94_37 as it would 
> require dropping another block from the same RDD
> 15/12/05 01:20:52 WARN MemoryStore: Not enough space to cache rdd_94_37 in 
> memory! (computed 2.4 GB so far)
> 15/12/05 01:20:52 INFO MemoryStore: Memory use = 12.6 GB (blocks) + 2.4 GB 
> (scratch space shared across 13 tasks(s)) = 15.0 GB. Storage limit = 15.8 GB.
> 15/12/05 01:20:52 INFO BlockManager: Found block rdd_94_37 locally
> 15/12/05 01:20:52 INFO UnifiedMemoryManager: Try to acquire 262144 bytes 
> memory. But, on-heap execution memory poll only has 0 bytes free memory.
> 15/12/05 01:20:52 INFO UnifiedMemoryManager: memoryReclaimableFromStorage 
> 8464760832, storageMemoryPool.poolSize 16929521664, storageRegionSize 
> 8464760832.
> 15/12/05 01:20:52 INFO UnifiedMemoryManager: Try to reclaim memory space from 
> storage memory pool.
> 15/12/05 01:20:52 INFO StorageMemoryPool: Claiming 262144 bytes free memory 
> space from StorageMemoryPool.
> 15/12/05 01:20:52 INFO UnifiedMemoryManager: Reclaimed 262144 bytes of memory 
> from storage memory pool.Adding them back to onHeapExecutionMemoryPool.
> 15/12/05 01:20:52 INFO UnifiedMemoryManager: Try to acquire 67108864 bytes 
> memory. But, on-heap execution memory poll only has 0 bytes free memory.
> 15/12/05 01:20:52 INFO UnifiedMemoryManager: memoryReclaimableFromStorage 
> 8464498688, storageMemoryPool.poolSize 16929259520, storageRegionSize 
> 8464760832.
> 15/12/05 01:20:52 INFO UnifiedMemoryManager: Try to reclaim memory space from 
> storage memory pool.
> 15/12/05 01:20:52 INFO StorageMemoryPool: Claiming 67108864 bytes free memory 
> space from StorageMemoryPool.
> 15/12/05 01:20:52 INFO UnifiedMemoryManager: Reclaimed 67108864 bytes of 
> memory from storage memory pool.Adding them back to onHeapExecutionMemoryPool.
> 15/12/05 01:20:54 INFO Executor: Finished task 37.0 in stage 4.0 (TID 110). 
> 3077 bytes result sent to driver
> 15/12/05 01:20:56 INFO CoarseGrainedExecutorBackend: Got assigned task 120
> 15/12/05 01:20:56 INFO Executor: Running task 1.0 in stage 5.0 (TID 120)
> 15/12/05 01:20:56 INFO CoarseGrainedExecutorBackend: Got assigned task 124
> 15/12/05 01:20:56 INFO CoarseGrainedExecutorBackend: Got assigned task 128
> 15/12/05 01:20:56 INFO CoarseGrainedExecutorBackend: Got assigned task 132
> 15/12/05 01:20:56 

[jira] [Resolved] (SPARK-12155) Execution OOM after a relative large dataset cached in the cluster.

2015-12-10 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-12155.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

> Execution OOM after a relative large dataset cached in the cluster.
> ---
>
> Key: SPARK-12155
> URL: https://issues.apache.org/jira/browse/SPARK-12155
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Reporter: Yin Huai
>Assignee: Josh Rosen
>Priority: Blocker
> Fix For: 1.6.0
>
>
> I have a cluster with relative 80GB of mem. Then, I cached a 43GB dataframe. 
> When I start to consume the query. I got the following exception (I added 
> more logs to the code).
> {code}
> 15/12/05 00:33:43 INFO UnifiedMemoryManager: Creating UnifedMemoryManager for 
> 4 cores with 16929521664 maxMemory, 8464760832 storageRegionSize.
> 15/12/05 01:20:50 INFO MemoryStore: Ensuring 1048576 bytes of free space for 
> block rdd_94_37(free: 3253659951, max: 16798973952)
> 15/12/05 01:20:50 INFO MemoryStore: Ensuring 5142008 bytes of free space for 
> block rdd_94_37(free: 3252611375, max: 16798973952)
> 15/12/05 01:20:50 INFO Executor: Finished task 36.0 in stage 4.0 (TID 109). 
> 3028 bytes result sent to driver
> 15/12/05 01:20:50 INFO MemoryStore: Ensuring 98948238 bytes of free space for 
> block rdd_94_37(free: 3314840375, max: 16866344960)
> 15/12/05 01:20:50 INFO MemoryStore: Ensuring 98675713 bytes of free space for 
> block rdd_94_37(free: 3215892137, max: 16866344960)
> 15/12/05 01:20:50 INFO MemoryStore: Ensuring 197347565 bytes of free space 
> for block rdd_94_37(free: 3117216424, max: 16866344960)
> 15/12/05 01:20:50 INFO MemoryStore: Ensuring 295995553 bytes of free space 
> for block rdd_94_37(free: 2919868859, max: 16866344960)
> 15/12/05 01:20:51 INFO MemoryStore: Ensuring 394728479 bytes of free space 
> for block rdd_94_37(free: 2687050010, max: 16929521664)
> 15/12/05 01:20:51 INFO Executor: Finished task 32.0 in stage 4.0 (TID 106). 
> 3028 bytes result sent to driver
> 15/12/05 01:20:51 INFO MemoryStore: Ensuring 591258816 bytes of free space 
> for block rdd_94_37(free: 2292321531, max: 16929521664)
> 15/12/05 01:20:51 INFO MemoryStore: Ensuring 901645182 bytes of free space 
> for block rdd_94_37(free: 1701062715, max: 16929521664)
> 15/12/05 01:20:52 INFO MemoryStore: Ensuring 1302179076 bytes of free space 
> for block rdd_94_37(free: 799417533, max: 16929521664)
> 15/12/05 01:20:52 INFO MemoryStore: Will not store rdd_94_37 as it would 
> require dropping another block from the same RDD
> 15/12/05 01:20:52 WARN MemoryStore: Not enough space to cache rdd_94_37 in 
> memory! (computed 2.4 GB so far)
> 15/12/05 01:20:52 INFO MemoryStore: Memory use = 12.6 GB (blocks) + 2.4 GB 
> (scratch space shared across 13 tasks(s)) = 15.0 GB. Storage limit = 15.8 GB.
> 15/12/05 01:20:52 INFO BlockManager: Found block rdd_94_37 locally
> 15/12/05 01:20:52 INFO UnifiedMemoryManager: Try to acquire 262144 bytes 
> memory. But, on-heap execution memory poll only has 0 bytes free memory.
> 15/12/05 01:20:52 INFO UnifiedMemoryManager: memoryReclaimableFromStorage 
> 8464760832, storageMemoryPool.poolSize 16929521664, storageRegionSize 
> 8464760832.
> 15/12/05 01:20:52 INFO UnifiedMemoryManager: Try to reclaim memory space from 
> storage memory pool.
> 15/12/05 01:20:52 INFO StorageMemoryPool: Claiming 262144 bytes free memory 
> space from StorageMemoryPool.
> 15/12/05 01:20:52 INFO UnifiedMemoryManager: Reclaimed 262144 bytes of memory 
> from storage memory pool.Adding them back to onHeapExecutionMemoryPool.
> 15/12/05 01:20:52 INFO UnifiedMemoryManager: Try to acquire 67108864 bytes 
> memory. But, on-heap execution memory poll only has 0 bytes free memory.
> 15/12/05 01:20:52 INFO UnifiedMemoryManager: memoryReclaimableFromStorage 
> 8464498688, storageMemoryPool.poolSize 16929259520, storageRegionSize 
> 8464760832.
> 15/12/05 01:20:52 INFO UnifiedMemoryManager: Try to reclaim memory space from 
> storage memory pool.
> 15/12/05 01:20:52 INFO StorageMemoryPool: Claiming 67108864 bytes free memory 
> space from StorageMemoryPool.
> 15/12/05 01:20:52 INFO UnifiedMemoryManager: Reclaimed 67108864 bytes of 
> memory from storage memory pool.Adding them back to onHeapExecutionMemoryPool.
> 15/12/05 01:20:54 INFO Executor: Finished task 37.0 in stage 4.0 (TID 110). 
> 3077 bytes result sent to driver
> 15/12/05 01:20:56 INFO CoarseGrainedExecutorBackend: Got assigned task 120
> 15/12/05 01:20:56 INFO Executor: Running task 1.0 in stage 5.0 (TID 120)
> 15/12/05 01:20:56 INFO CoarseGrainedExecutorBackend: Got assigned task 124
> 15/12/05 01:20:56 INFO CoarseGrainedExecutorBackend: Got assigned task 128
> 15/12/05 01:20:56 INFO CoarseGrainedExecutorBackend: Got assigned task 132
> 15/12/05 

[jira] [Resolved] (SPARK-12253) UnifiedMemoryManager race condition: storage can starve new tasks

2015-12-10 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-12253.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

> UnifiedMemoryManager race condition: storage can starve new tasks
> -
>
> Key: SPARK-12253
> URL: https://issues.apache.org/jira/browse/SPARK-12253
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Blocker
> Fix For: 1.6.0
>
>
> The following race condition is possible with the existing code in unified 
> memory management:
> (1) Existing tasks collectively occupy all execution memory
> (2) New task comes in and blocks while existing tasks spill
> (3) After tasks finish spilling, another task jumps in and puts in a large 
> block, stealing the freed memory
> (4) New task still cannot acquire memory and goes back to sleep



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9578) Stemmer feature transformer

2015-12-10 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15051917#comment-15051917
 ] 

holdenk commented on SPARK-9578:


Cool :) I'll keep an eye out for the PR :)

> Stemmer feature transformer
> ---
>
> Key: SPARK-9578
> URL: https://issues.apache.org/jira/browse/SPARK-9578
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Transformer mentioned first in [SPARK-5571] based on suggestion from 
> [~aloknsingh].  Very standard NLP preprocessing task.
> From [~aloknsingh]:
> {quote}
> We have one scala stemmer in scalanlp%chalk 
> https://github.com/scalanlp/chalk/tree/master/src/main/scala/chalk/text/analyze
>   which can easily copied (as it is apache project) and is in scala too.
> I think this will be better alternative than lucene englishAnalyzer or 
> opennlp.
> Note: we already use the scalanlp%breeze via the maven dependency so I think 
> adding scalanlp%chalk dependency is also the options. But as you had said we 
> can copy the code as it is small.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11796) Docker JDBC integration tests fail in Maven build due to dependency issue

2015-12-10 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15051935#comment-15051935
 ] 

Josh Rosen commented on SPARK-11796:


Yep, I removed it from the Master and 1.6 Maven builds right after I merged 
your patch yesterday. Thanks for checking in, though!

> Docker JDBC integration tests fail in Maven build due to dependency issue
> -
>
> Key: SPARK-11796
> URL: https://issues.apache.org/jira/browse/SPARK-11796
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 1.6.0
>Reporter: Josh Rosen
>Assignee: Mark Grover
> Fix For: 1.6.0
>
>
> Our new Docker integration tests for JDBC dialects are failing in the Maven 
> builds. For now, I've disabled this for Maven by adding the 
> {{-Dtest.exclude.tags=org.apache.spark.tags.DockerTest}} flag to our Jenkins 
> builds, but we should fix this soon. The test failures seem to be related to 
> dependency or classpath issues:
> {code}
> *** RUN ABORTED ***
>   java.lang.NoSuchMethodError: 
> org.apache.http.impl.client.HttpClientBuilder.setConnectionManagerShared(Z)Lorg/apache/http/impl/client/HttpClientBuilder;
>   at 
> org.glassfish.jersey.apache.connector.ApacheConnector.(ApacheConnector.java:240)
>   at 
> org.glassfish.jersey.apache.connector.ApacheConnectorProvider.getConnector(ApacheConnectorProvider.java:115)
>   at 
> org.glassfish.jersey.client.ClientConfig$State.initRuntime(ClientConfig.java:418)
>   at 
> org.glassfish.jersey.client.ClientConfig$State.access$000(ClientConfig.java:88)
>   at 
> org.glassfish.jersey.client.ClientConfig$State$3.get(ClientConfig.java:120)
>   at 
> org.glassfish.jersey.client.ClientConfig$State$3.get(ClientConfig.java:117)
>   at 
> org.glassfish.jersey.internal.util.collection.Values$LazyValueImpl.get(Values.java:340)
>   at 
> org.glassfish.jersey.client.ClientConfig.getRuntime(ClientConfig.java:726)
>   at 
> org.glassfish.jersey.client.ClientRequest.getConfiguration(ClientRequest.java:285)
>   at 
> org.glassfish.jersey.client.JerseyInvocation.validateHttpMethodAndEntity(JerseyInvocation.java:126)
>   ...
> {code}
> To reproduce locally: {{build/mvn -pl docker-integration-tests package}}. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12260) Graceful Shutdown with In-Memory State

2015-12-10 Thread Mao, Wei (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15051949#comment-15051949
 ] 

Mao, Wei commented on SPARK-12260:
--

In order to recover from process restarting, there are several steps: 
1) dump in-memory state when StreamingContext.stop is invoked. 
2) load the dump back from HDFS or whatever source, and convent to RDD
3) pass the RDD into updateStateByKey as initial state.

As you said, step 3) is already supported with current code which is great. But 
step 1) is missing. In short, the main purpose of this JIRA is adding new 
callback function in StreamingContext.stop, so user can have chance to dump 
specified in-memory states during streaming context shutdown. 

> Graceful Shutdown with In-Memory State
> --
>
> Key: SPARK-12260
> URL: https://issues.apache.org/jira/browse/SPARK-12260
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Mao, Wei
>  Labels: streaming
>
> Users often stop and restart their streaming jobs for tasks such as 
> maintenance, software upgrades or even application logic updates. When a job 
> re-starts it should pick up where it left off i.e. any state information that 
> existed when the job stopped should be used as the initial state when the job 
> restarts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11962) Add getAsOpt[T] functions to org.apache.spark.sql.Row

2015-12-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15050493#comment-15050493
 ] 

Apache Spark commented on SPARK-11962:
--

User 'aa8y' has created a pull request for this issue:
https://github.com/apache/spark/pull/10247

> Add getAsOpt[T] functions to org.apache.spark.sql.Row
> -
>
> Key: SPARK-11962
> URL: https://issues.apache.org/jira/browse/SPARK-11962
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Arun Allamsetty
>Priority: Minor
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Add this function to enable getting a value from a Row object Optionally.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12196) Store blocks in storage devices with hierarchy way

2015-12-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15050563#comment-15050563
 ] 

Apache Spark commented on SPARK-12196:
--

User 'yucai' has created a pull request for this issue:
https://github.com/apache/spark/pull/10192

> Store blocks in storage devices with hierarchy way
> --
>
> Key: SPARK-12196
> URL: https://issues.apache.org/jira/browse/SPARK-12196
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: yucai
>
> *Problem*
> Nowadays, users have both SSDs and HDDs. 
> SSDs have great performance, but capacity is small. HDDs have good capacity, 
> but x2-x3 lower than SSDs.
> How can we get both good?
> *Solution*
> Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup 
> storage. 
> When Spark core allocates blocks for RDD (either shuffle or RDD cache), it 
> gets blocks from SSDs first, and when SSD’s useable space is less than some 
> threshold, getting blocks from HDDs.
> In our implementation, we actually go further. We support a way to build any 
> level hierarchy store access all storage medias (NVM, SSD, HDD etc.).
> *Performance*
> 1. At the best case, our solution performs the same as all SSDs.
> 2. At the worst case, like all data are spilled to HDDs, no performance 
> regression.
> 3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it 
> could be higher, CPU reaches bottleneck in our test environment).
> 4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because 
> we support both RDD cache and shuffle and no extra inter process 
> communication.
> *Usage*
> 1. Set the priority and threshold for each layer in 
> spark.storage.hierarchyStore.
> {code}
> spark.storage.hierarchyStore='nvm 50GB,ssd 80GB'
> {code}
> It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all 
> the rest form the last layer.
> 2. Configure each layer's location, user just needs put the keyword like 
> "nvm", "ssd", which are specified in step 1, into local dirs, like 
> spark.local.dir or yarn.nodemanager.local-dirs.
> {code}
> spark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others
> {code}
> After then, restart your Spark application, it will allocate blocks from nvm 
> first.
> When nvm's usable space is less than 50GB, it starts to allocate from ssd.
> When ssd's usable space is less than 80GB, it starts to allocate from the 
> last layer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12255) why "cache table a as select * from b" will do shuffle,and create 2 stages

2015-12-10 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-12255.
---
Resolution: Invalid

Questions go to u...@spark.apache.org not JIRA

> why "cache table a as select * from b" will do shuffle,and create 2 stages
> --
>
> Key: SPARK-12255
> URL: https://issues.apache.org/jira/browse/SPARK-12255
> Project: Spark
>  Issue Type: Question
> Environment: spark-1.4.1-bin-hadoop2.4
>Reporter: ant_nebula
> Attachments: pic1.jpg, pic2.jpg
>
>
> why "cache table a as select * from b" will do shuffle,and create 2 stages.
> example:
> table "ods_pay_consume" is from "KafkaUtils.createDirectStream"
> {code}
> hiveContext.sql("cache table dwd_pay_consume as select * from 
> ods_pay_consume")
> {code}
> this code will create DAG as pic1.jsp
> {code}
> hiveContext.sql(""cache table dw_game_server_recharge as select * from 
> dwd_pay_consume")
> {code}
> this code will create DAG as pic2.jsp,and this similar caculate from the 
> beginning,"cache table dwd_pay_consume" is not effect.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3106) Fix the race condition issue about Connection and ConnectionManager

2015-12-10 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-3106.
--
Resolution: Incomplete

> Fix the race condition issue about Connection and ConnectionManager
> ---
>
> Key: SPARK-3106
> URL: https://issues.apache.org/jira/browse/SPARK-3106
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Kousuke Saruta
>
> Now, when we run Spark application, error message is appear on driver's log.
> The error message includes like as follows.
> * message caused by ClosedChannelException
> * message caused by CancelledKeyException
> * "Corresponding SendingConnectionManagerId not found"
> Those are mainly caused by the race condition issue of the time Connection is 
> closed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3461) Support external groupByKey using repartitionAndSortWithinPartitions

2015-12-10 Thread Pere Ferrera Bertran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15050499#comment-15050499
 ] 

Pere Ferrera Bertran commented on SPARK-3461:
-

Hi [~rxin], does this mean that the current DataFrames have already no memory 
limitation for a key when doing a groupBy? Is the "scalable" group by + 
secondary sort achieved by dataFrame.orderBy(...).groupBy(...)? Trying to find 
some more detailed information about this.

> Support external groupByKey using repartitionAndSortWithinPartitions
> 
>
> Key: SPARK-3461
> URL: https://issues.apache.org/jira/browse/SPARK-3461
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Patrick Wendell
>Assignee: Reynold Xin
>Priority: Critical
> Fix For: 1.6.0
>
>
> Given that we have SPARK-2978, it seems like we could support an external 
> group by operator pretty easily. We'd just have to wrap the existing iterator 
> exposed by SPARK-2978 with a lookahead iterator that detects the group 
> boundaries. Also, we'd have to override the cache() operator to cache the 
> parent RDD so that if this object is cached it doesn't wind through the 
> iterator.
> I haven't totally followed all the sort-shuffle internals, but just given the 
> stated semantics of SPARK-2978 it seems like this would be possible.
> It would be really nice to externalize this because many beginner users write 
> jobs in terms of groupByKey.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3461) Support external groupByKey using repartitionAndSortWithinPartitions

2015-12-10 Thread Pere Ferrera Bertran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15050537#comment-15050537
 ] 

Pere Ferrera Bertran commented on SPARK-3461:
-

Made a question before, but I think it made not so much sense. Can you just 
elaborate more on how the new Dataset API overcomes the limitations of the 
current groupByKey() ?

> Support external groupByKey using repartitionAndSortWithinPartitions
> 
>
> Key: SPARK-3461
> URL: https://issues.apache.org/jira/browse/SPARK-3461
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Patrick Wendell
>Assignee: Reynold Xin
>Priority: Critical
> Fix For: 1.6.0
>
>
> Given that we have SPARK-2978, it seems like we could support an external 
> group by operator pretty easily. We'd just have to wrap the existing iterator 
> exposed by SPARK-2978 with a lookahead iterator that detects the group 
> boundaries. Also, we'd have to override the cache() operator to cache the 
> parent RDD so that if this object is cached it doesn't wind through the 
> iterator.
> I haven't totally followed all the sort-shuffle internals, but just given the 
> stated semantics of SPARK-2978 it seems like this would be possible.
> It would be really nice to externalize this because many beginner users write 
> jobs in terms of groupByKey.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12260) Graceful Shutdown with In-Memory State

2015-12-10 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15050906#comment-15050906
 ] 

Sean Owen commented on SPARK-12260:
---

Isn't this what updateStateByKey and similar methods already accomplish for you?

> Graceful Shutdown with In-Memory State
> --
>
> Key: SPARK-12260
> URL: https://issues.apache.org/jira/browse/SPARK-12260
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Mao, Wei
>  Labels: streaming
>
> Users often stop and restart their streaming jobs for tasks such as 
> maintenance, software upgrades or even application logic updates. When a job 
> re-starts it should pick up where it left off i.e. any state information that 
> existed when the job stopped should be used as the initial state when the job 
> restarts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12257) Non partitioned insert into a partitioned Hive table doesn't fail

2015-12-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12257:


Assignee: (was: Apache Spark)

> Non partitioned insert into a partitioned Hive table doesn't fail
> -
>
> Key: SPARK-12257
> URL: https://issues.apache.org/jira/browse/SPARK-12257
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Mark Grover
>Priority: Minor
>
> I am using Spark 1.5.1 but I anticipate this to be a problem with master as 
> well (will check later).
> I have a dataframe, and a partitioned Hive table that I want to insert the 
> contents of the data frame into.
> Let's say mytable is a non-partitioned Hive table and mytable_partitioned is 
> a partitioned Hive table. In Hive, if you try to insert from the 
> non-partitioned mytable table into mytable_partitioned without specifying the 
> partition, the query fails, as expected:
> {quote}
> INSERT INTO mytable_partitioned SELECT * FROM mytable;
> {quote}
> Error: Error while compiling statement: FAILED: SemanticException 1:12 Need 
> to specify partition columns because the destination table is partitioned. 
> Error encountered near token 'mytable_partitioned' (state=42000,code=4)
> {quote}
> However, if I do the same in Spark SQL:
> {code}
> val myDfTempTable = myDf.registerTempTable("my_df_temp_table")
> sqlContext.sql("INSERT INTO mytable_partitioned SELECT * FROM 
> my_df_temp_table")
> {code}
> This appears to succeed but does no insertion. This should fail with an error 
> stating the data is being inserted into a partitioned table without 
> specifying the name of the partition.
> Of course, the name of the partition is explicitly specified, both Hive and 
> Spark SQL do the right thing and function correctly.
> In hive:
> {code}
> INSERT INTO mytable_partitioned PARTITION (y='abc') SELECT * FROM mytable;
> {code}
> In Spark SQL:
> {code}
> val myDfTempTable = myDf.registerTempTable("my_df_temp_table")
> sqlContext.sql("INSERT INTO mytable_partitioned PARTITION (y='abc') SELECT * 
> FROM my_df_temp_table")
> {code}
> And, here are the definitions of my tables, as reference:
> {code}
> CREATE TABLE mytable(x INT);
> CREATE TABLE mytable_partitioned (x INT) PARTITIONED BY (y INT);
> {code}
> You will also need to insert some dummy data into mytable to ensure that the 
> insertion is actually not working:
> {code}
> #!/bin/bash
> rm -rf data.txt;
> for i in {0..9}; do
> echo $i >> data.txt
> done
> sudo -u hdfs hadoop fs -put data.txt /user/hive/warehouse/mytable
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12257) Non partitioned insert into a partitioned Hive table doesn't fail

2015-12-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15050587#comment-15050587
 ] 

Apache Spark commented on SPARK-12257:
--

User 'dilipbiswal' has created a pull request for this issue:
https://github.com/apache/spark/pull/10254

> Non partitioned insert into a partitioned Hive table doesn't fail
> -
>
> Key: SPARK-12257
> URL: https://issues.apache.org/jira/browse/SPARK-12257
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Mark Grover
>Priority: Minor
>
> I am using Spark 1.5.1 but I anticipate this to be a problem with master as 
> well (will check later).
> I have a dataframe, and a partitioned Hive table that I want to insert the 
> contents of the data frame into.
> Let's say mytable is a non-partitioned Hive table and mytable_partitioned is 
> a partitioned Hive table. In Hive, if you try to insert from the 
> non-partitioned mytable table into mytable_partitioned without specifying the 
> partition, the query fails, as expected:
> {quote}
> INSERT INTO mytable_partitioned SELECT * FROM mytable;
> {quote}
> Error: Error while compiling statement: FAILED: SemanticException 1:12 Need 
> to specify partition columns because the destination table is partitioned. 
> Error encountered near token 'mytable_partitioned' (state=42000,code=4)
> {quote}
> However, if I do the same in Spark SQL:
> {code}
> val myDfTempTable = myDf.registerTempTable("my_df_temp_table")
> sqlContext.sql("INSERT INTO mytable_partitioned SELECT * FROM 
> my_df_temp_table")
> {code}
> This appears to succeed but does no insertion. This should fail with an error 
> stating the data is being inserted into a partitioned table without 
> specifying the name of the partition.
> Of course, the name of the partition is explicitly specified, both Hive and 
> Spark SQL do the right thing and function correctly.
> In hive:
> {code}
> INSERT INTO mytable_partitioned PARTITION (y='abc') SELECT * FROM mytable;
> {code}
> In Spark SQL:
> {code}
> val myDfTempTable = myDf.registerTempTable("my_df_temp_table")
> sqlContext.sql("INSERT INTO mytable_partitioned PARTITION (y='abc') SELECT * 
> FROM my_df_temp_table")
> {code}
> And, here are the definitions of my tables, as reference:
> {code}
> CREATE TABLE mytable(x INT);
> CREATE TABLE mytable_partitioned (x INT) PARTITIONED BY (y INT);
> {code}
> You will also need to insert some dummy data into mytable to ensure that the 
> insertion is actually not working:
> {code}
> #!/bin/bash
> rm -rf data.txt;
> for i in {0..9}; do
> echo $i >> data.txt
> done
> sudo -u hdfs hadoop fs -put data.txt /user/hive/warehouse/mytable
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8498) Fix NullPointerException in error-handling path in UnsafeShuffleWriter

2015-12-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15050613#comment-15050613
 ] 

Apache Spark commented on SPARK-8498:
-

User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/6909

> Fix NullPointerException in error-handling path in UnsafeShuffleWriter
> --
>
> Key: SPARK-8498
> URL: https://issues.apache.org/jira/browse/SPARK-8498
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 1.4.0
>Reporter: Josh Rosen
>Assignee: holdenk
> Fix For: 1.5.0
>
>
> This bug was reported by [~prudenko] on the dev list.  When the 
> {{tungsten-sort}} shuffle manager was enabled, an executor died with the 
> following exception:
> {code}
> 15/06/19 17:53:35 WARN TaskSetManager: Lost task 38.0 in stage 41.0 (TID 
> 3176, ip-10-50-225-214.ec2.internal): java.lang.NullPointerException
> at 
> org.apache.spark.shuffle.unsafe.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:151)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:70)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> I think that this is actually due to an error-handling issue.  In the stack 
> trace, the NPE is being thrown from an error-handling branch of a `finally` 
> block:
> {code}
> public void write(scala.collection.Iterator> records) throws 
> IOException {
> boolean success = false;
> try {
>   while (records.hasNext()) {
> insertRecordIntoSorter(records.next());
>   }
>   closeAndWriteOutput();
>   success = true;
> } finally {
>   if (!success) {
> sorter.cleanupAfterError();  // < this is the line throwing the 
> error
>   }
> }
>   }
> {code}
> I suspect that what's happening is that an exception is being thrown from 
> user / upstream code in the initial call to records.next(), but the 
> error-handling block is failing because sorter == null since we haven't 
> initialized it yet.
> We should fix this bug with a {{sorter != null}} check and should also add a 
> regression test to ShuffleSuite to ensure that exceptions thrown by user code 
> at this step of the shuffle write path don't get masked by error-handling 
> bugs inside of the shuffle code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12261) pyspark crash for large dataset

2015-12-10 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-12261.
---
Resolution: Not A Problem

I think it's pretty clear: you pulled a lot of data to your driver and it 
failed. You pulled a little data and it didn't. Increase driver memory or else 
don't do that. This is not a bug.

> pyspark crash for large dataset
> ---
>
> Key: SPARK-12261
> URL: https://issues.apache.org/jira/browse/SPARK-12261
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.2
> Environment: windows
>Reporter: zihao
>
> I tried to import a local text(over 100mb) file via textFile in pyspark, when 
> i ran data.take(), it failed and gave error messages including:
> 15/12/10 17:17:43 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; 
> aborting job
> Traceback (most recent call last):
>   File "E:/spark_python/test3.py", line 9, in 
> lines.take(5)
>   File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\rdd.py", line 1299, 
> in take
> res = self.context.runJob(self, takeUpToNumLeft, p)
>   File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\context.py", line 
> 916, in runJob
> port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, 
> partitions)
>   File "C:\Anaconda2\lib\site-packages\py4j\java_gateway.py", line 813, in 
> __call__
> answer, self.gateway_client, self.target_id, self.name)
>   File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\sql\utils.py", line 
> 36, in deco
> return f(*a, **kw)
>   File "C:\Anaconda2\lib\site-packages\py4j\protocol.py", line 308, in 
> get_return_value
> format(target_id, ".", name), value)
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.runJob.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 
> (TID 0, localhost): java.net.SocketException: Connection reset by peer: 
> socket write error
> Then i ran the same code for a small text file, this time .take() worked fine.
> How can i solve this problem?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-3461) Support external groupByKey using repartitionAndSortWithinPartitions

2015-12-10 Thread Pere Ferrera Bertran (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pere Ferrera Bertran updated SPARK-3461:

Comment: was deleted

(was: Hi [~rxin], does this mean that the current DataFrames have already no 
memory limitation for a key when doing a groupBy? Is the "scalable" group by + 
secondary sort achieved by dataFrame.orderBy(...).groupBy(...)? Trying to find 
some more detailed information about this.)

> Support external groupByKey using repartitionAndSortWithinPartitions
> 
>
> Key: SPARK-3461
> URL: https://issues.apache.org/jira/browse/SPARK-3461
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Patrick Wendell
>Assignee: Reynold Xin
>Priority: Critical
> Fix For: 1.6.0
>
>
> Given that we have SPARK-2978, it seems like we could support an external 
> group by operator pretty easily. We'd just have to wrap the existing iterator 
> exposed by SPARK-2978 with a lookahead iterator that detects the group 
> boundaries. Also, we'd have to override the cache() operator to cache the 
> parent RDD so that if this object is cached it doesn't wind through the 
> iterator.
> I haven't totally followed all the sort-shuffle internals, but just given the 
> stated semantics of SPARK-2978 it seems like this would be possible.
> It would be really nice to externalize this because many beginner users write 
> jobs in terms of groupByKey.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12257) Non partitioned insert into a partitioned Hive table doesn't fail

2015-12-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12257:


Assignee: Apache Spark

> Non partitioned insert into a partitioned Hive table doesn't fail
> -
>
> Key: SPARK-12257
> URL: https://issues.apache.org/jira/browse/SPARK-12257
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Mark Grover
>Assignee: Apache Spark
>Priority: Minor
>
> I am using Spark 1.5.1 but I anticipate this to be a problem with master as 
> well (will check later).
> I have a dataframe, and a partitioned Hive table that I want to insert the 
> contents of the data frame into.
> Let's say mytable is a non-partitioned Hive table and mytable_partitioned is 
> a partitioned Hive table. In Hive, if you try to insert from the 
> non-partitioned mytable table into mytable_partitioned without specifying the 
> partition, the query fails, as expected:
> {quote}
> INSERT INTO mytable_partitioned SELECT * FROM mytable;
> {quote}
> Error: Error while compiling statement: FAILED: SemanticException 1:12 Need 
> to specify partition columns because the destination table is partitioned. 
> Error encountered near token 'mytable_partitioned' (state=42000,code=4)
> {quote}
> However, if I do the same in Spark SQL:
> {code}
> val myDfTempTable = myDf.registerTempTable("my_df_temp_table")
> sqlContext.sql("INSERT INTO mytable_partitioned SELECT * FROM 
> my_df_temp_table")
> {code}
> This appears to succeed but does no insertion. This should fail with an error 
> stating the data is being inserted into a partitioned table without 
> specifying the name of the partition.
> Of course, the name of the partition is explicitly specified, both Hive and 
> Spark SQL do the right thing and function correctly.
> In hive:
> {code}
> INSERT INTO mytable_partitioned PARTITION (y='abc') SELECT * FROM mytable;
> {code}
> In Spark SQL:
> {code}
> val myDfTempTable = myDf.registerTempTable("my_df_temp_table")
> sqlContext.sql("INSERT INTO mytable_partitioned PARTITION (y='abc') SELECT * 
> FROM my_df_temp_table")
> {code}
> And, here are the definitions of my tables, as reference:
> {code}
> CREATE TABLE mytable(x INT);
> CREATE TABLE mytable_partitioned (x INT) PARTITIONED BY (y INT);
> {code}
> You will also need to insert some dummy data into mytable to ensure that the 
> insertion is actually not working:
> {code}
> #!/bin/bash
> rm -rf data.txt;
> for i in {0..9}; do
> echo $i >> data.txt
> done
> sudo -u hdfs hadoop fs -put data.txt /user/hive/warehouse/mytable
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >