date:20170509

[jira] [Assigned] (SPARK-20590) Map default input data source formats to inlined classes

2017-05-09 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-20590:
---

Assignee: Hyukjin Kwon

> Map default input data source formats to inlined classes
> 
>
> Key: SPARK-20590
> URL: https://issues.apache.org/jira/browse/SPARK-20590
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Sameer Agarwal
>Assignee: Hyukjin Kwon
> Fix For: 2.2.1, 2.3.0
>
>
> One of the common usability problems around reading data in spark 
> (particularly CSV) is that there can often be a conflict between different 
> readers in the classpath.
> As an example, if someone launches a 2.x spark shell with the spark-csv 
> package in the classpath, Spark currently fails in an extremely unfriendly way
> {code}
> ./bin/spark-shell --packages com.databricks:spark-csv_2.11:1.5.0
> scala> val df = spark.read.csv("/foo/bar.csv")
> java.lang.RuntimeException: Multiple sources found for csv 
> (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat, 
> com.databricks.spark.csv.DefaultSource15), please specify the fully qualified 
> class name.
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:574)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource.scala:85)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:85)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:295)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
>   at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:533)
>   at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:412)
>   ... 48 elided
> {code}
> This JIRA proposes a simple way of fixing this error by always mapping 
> default input data source formats to inlined classes (that exist in Spark).
> {code}
> ./bin/spark-shell --packages com.databricks:spark-csv_2.11:1.5.0
> scala> val df = spark.read.csv("/foo/bar.csv")
> df: org.apache.spark.sql.DataFrame = [_c0: string]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20590) Map default input data source formats to inlined classes

2017-05-09 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-20590.
-
   Resolution: Fixed
Fix Version/s: 2.3.0
   2.2.1

Issue resolved by pull request 17916
[https://github.com/apache/spark/pull/17916]

> Map default input data source formats to inlined classes
> 
>
> Key: SPARK-20590
> URL: https://issues.apache.org/jira/browse/SPARK-20590
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Sameer Agarwal
> Fix For: 2.2.1, 2.3.0
>
>
> One of the common usability problems around reading data in spark 
> (particularly CSV) is that there can often be a conflict between different 
> readers in the classpath.
> As an example, if someone launches a 2.x spark shell with the spark-csv 
> package in the classpath, Spark currently fails in an extremely unfriendly way
> {code}
> ./bin/spark-shell --packages com.databricks:spark-csv_2.11:1.5.0
> scala> val df = spark.read.csv("/foo/bar.csv")
> java.lang.RuntimeException: Multiple sources found for csv 
> (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat, 
> com.databricks.spark.csv.DefaultSource15), please specify the fully qualified 
> class name.
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:574)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource.scala:85)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:85)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:295)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
>   at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:533)
>   at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:412)
>   ... 48 elided
> {code}
> This JIRA proposes a simple way of fixing this error by always mapping 
> default input data source formats to inlined classes (that exist in Spark).
> {code}
> ./bin/spark-shell --packages com.databricks:spark-csv_2.11:1.5.0
> scala> val df = spark.read.csv("/foo/bar.csv")
> df: org.apache.spark.sql.DataFrame = [_c0: string]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12837) Spark driver requires large memory space for serialized results even there are no data collected to the driver

2017-05-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16004103#comment-16004103
 ] 

Apache Spark commented on SPARK-12837:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/17931

> Spark driver requires large memory space for serialized results even there 
> are no data collected to the driver
> --
>
> Key: SPARK-12837
> URL: https://issues.apache.org/jira/browse/SPARK-12837
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Tien-Dung LE
>Assignee: Wenchen Fan
>Priority: Critical
> Fix For: 2.2.1, 2.3.0
>
>
> Executing a sql statement with a large number of partitions requires a high 
> memory space for the driver even there are no requests to collect data back 
> to the driver.
> Here are steps to re-produce the issue.
> 1. Start spark shell with a spark.driver.maxResultSize setting
> {code:java}
> bin/spark-shell --driver-memory=1g --conf spark.driver.maxResultSize=1m
> {code}
> 2. Execute the code 
> {code:java}
> case class Toto( a: Int, b: Int)
> val df = sc.parallelize( 1 to 1e6.toInt).map( i => Toto( i, i)).toDF
> sqlContext.setConf( "spark.sql.shuffle.partitions", "200" )
> df.groupBy("a").count().saveAsParquetFile( "toto1" ) // OK
> sqlContext.setConf( "spark.sql.shuffle.partitions", 1e3.toInt.toString )
> df.repartition(1e3.toInt).groupBy("a").count().repartition(1e3.toInt).saveAsParquetFile(
>  "toto2" ) // ERROR
> {code}
> The error message is 
> {code:java}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Total size of serialized results of 393 tasks (1025.9 KB) is bigger than 
> spark.driver.maxResultSize (1024.0 KB)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20688) correctly check analysis for scalar sub-queries

2017-05-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20688:


Assignee: Apache Spark  (was: Wenchen Fan)

> correctly check analysis for scalar sub-queries
> ---
>
> Key: SPARK-20688
> URL: https://issues.apache.org/jira/browse/SPARK-20688
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20688) correctly check analysis for scalar sub-queries

2017-05-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20688:


Assignee: Wenchen Fan  (was: Apache Spark)

> correctly check analysis for scalar sub-queries
> ---
>
> Key: SPARK-20688
> URL: https://issues.apache.org/jira/browse/SPARK-20688
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20688) correctly check analysis for scalar sub-queries

2017-05-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16004060#comment-16004060
 ] 

Apache Spark commented on SPARK-20688:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/17930

> correctly check analysis for scalar sub-queries
> ---
>
> Key: SPARK-20688
> URL: https://issues.apache.org/jira/browse/SPARK-20688
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20688) correctly check analysis for scalar sub-queries

2017-05-09 Thread Wenchen Fan (JIRA)

Wenchen Fan created SPARK-20688:
---

 Summary: correctly check analysis for scalar sub-queries
 Key: SPARK-20688
 URL: https://issues.apache.org/jira/browse/SPARK-20688
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-20687) mllib.Matrices.fromBreeze may crash when converting breeze CSCMatrix

2017-05-09 Thread Ignacio Bermudez Corrales (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ignacio Bermudez Corrales updated SPARK-20687:
--
Comment: was deleted

(was:   test("breeze conversion bug") {
// (2, 0, 0)
// (2, 0, 0)
val mat1Brz = Matrices.sparse(2, 3, Array(0, 2, 2, 2), Array(0, 1), 
Array(2, 2)).asBreeze
// (2, 1E-15, 1E-15)
// (2, 1E-15, 1E-15
val mat2Brz = Matrices.sparse(2, 3, Array(0, 2, 4, 6), Array(0, 0, 0, 1, 1, 
1), Array(2, 1E-15, 1E-15, 2, 1E-15, 1E-15)).asBreeze
// The following shouldn't break
val t01 = mat1Brz - mat1Brz
val t02 = mat2Brz - mat2Brz
val t02Brz = Matrices.fromBreeze(t02)
val t01Brz = Matrices.fromBreeze(t01)

val t1Brz = mat1Brz - mat2Brz
val t2Brz = mat2Brz - mat1Brz
// The following ones should break
val t1 = Matrices.fromBreeze(t1Brz)
val t2 = Matrices.fromBreeze(t2Brz)

  })

> mllib.Matrices.fromBreeze may crash when converting breeze CSCMatrix
> 
>
> Key: SPARK-20687
> URL: https://issues.apache.org/jira/browse/SPARK-20687
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.1.1
>Reporter: Ignacio Bermudez Corrales
>Priority: Critical
>
> Conversion of Breeze sparse matrices to Matrix is broken when matrices are 
> product of certain operations. This problem I think is caused by the update 
> method in Breeze CSCMatrix when they add provisional zeros to the data for 
> efficiency.
> This bug is serious and may affect at least BlockMatrix addition and 
> substraction
> http://stackoverflow.com/questions/33528555/error-thrown-when-using-blockmatrix-add/43883458#43883458
> The following code, reproduces the bug.
>   test("breeze conversion bug") {
> // (2, 0, 0)
> // (2, 0, 0)
> val mat1Brz = Matrices.sparse(2, 3, Array(0, 2, 2, 2), Array(0, 1), 
> Array(2, 2)).asBreeze
> // (2, 1E-15, 1E-15)
> // (2, 1E-15, 1E-15
> val mat2Brz = Matrices.sparse(2, 3, Array(0, 2, 4, 6), Array(0, 0, 0, 1, 
> 1, 1), Array(2, 1E-15, 1E-15, 2, 1E-15, 1E-15)).asBreeze
> // The following shouldn't break
> val t01 = mat1Brz - mat1Brz
> val t02 = mat2Brz - mat2Brz
> val t02Brz = Matrices.fromBreeze(t02)
> val t01Brz = Matrices.fromBreeze(t01)
> val t1Brz = mat1Brz - mat2Brz
> val t2Brz = mat2Brz - mat1Brz
> // The following ones should break
> val t1 = Matrices.fromBreeze(t1Brz)
> val t2 = Matrices.fromBreeze(t2Brz)
>   }



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20687) mllib.Matrices.fromBreeze may crash when converting breeze CSCMatrix

2017-05-09 Thread Ignacio Bermudez Corrales (JIRA)

Ignacio Bermudez Corrales created SPARK-20687:
-

 Summary: mllib.Matrices.fromBreeze may crash when converting 
breeze CSCMatrix
 Key: SPARK-20687
 URL: https://issues.apache.org/jira/browse/SPARK-20687
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 2.1.1
Reporter: Ignacio Bermudez Corrales
Priority: Critical


Conversion of Breeze sparse matrices to Matrix is broken when matrices are 
product of certain operations. This problem I think is caused by the update 
method in Breeze CSCMatrix when they add provisional zeros to the data for 
efficiency.

This bug is serious and may affect at least BlockMatrix addition and 
substraction

http://stackoverflow.com/questions/33528555/error-thrown-when-using-blockmatrix-add/43883458#43883458

The following code, reproduces the bug.

  test("breeze conversion bug") {
// (2, 0, 0)
// (2, 0, 0)
val mat1Brz = Matrices.sparse(2, 3, Array(0, 2, 2, 2), Array(0, 1), 
Array(2, 2)).asBreeze
// (2, 1E-15, 1E-15)
// (2, 1E-15, 1E-15
val mat2Brz = Matrices.sparse(2, 3, Array(0, 2, 4, 6), Array(0, 0, 0, 1, 1, 
1), Array(2, 1E-15, 1E-15, 2, 1E-15, 1E-15)).asBreeze
// The following shouldn't break
val t01 = mat1Brz - mat1Brz
val t02 = mat2Brz - mat2Brz
val t02Brz = Matrices.fromBreeze(t02)
val t01Brz = Matrices.fromBreeze(t01)

val t1Brz = mat1Brz - mat2Brz
val t2Brz = mat2Brz - mat1Brz
// The following ones should break
val t1 = Matrices.fromBreeze(t1Brz)
val t2 = Matrices.fromBreeze(t2Brz)

  }



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20554) Remove usage of scala.language.reflectiveCalls

2017-05-09 Thread Umesh Chaudhary (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16004020#comment-16004020
 ] 

Umesh Chaudhary commented on SPARK-20554:
-

[~srowen] Recently I did not get time to through all the warnings. Feel free to 
take it over. 

> Remove usage of scala.language.reflectiveCalls
> --
>
> Key: SPARK-20554
> URL: https://issues.apache.org/jira/browse/SPARK-20554
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, Spark Core, SQL, Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Sean Owen
>Priority: Minor
>
> In several parts of the code we have imported 
> {{scala.language.reflectiveCalls}} to suppress a warning about, well, 
> reflective calls. I know from cleaning up build warnings in 2.2 that in 
> almost all cases of this are inadvertent and masking a type problem.
> Example, in HiveDDLSuite:
> {code}
> val expectedTablePath =
>   if (dbPath.isEmpty) {
> hiveContext.sessionState.catalog.defaultTablePath(tableIdentifier)
>   } else {
> new Path(new Path(dbPath.get), tableIdentifier.table)
>   }
> val filesystemPath = new Path(expectedTablePath.toString)
> {code}
> This shouldn't really work because one branch returns a URI and the other a 
> Path. In this case it only needs an object with a toString method and can 
> make this work with structural types and reflection.
> Obviously, the intent was to add ".toURI" to the second branch though to make 
> both a URI!
> I think we should probably clean this up by taking out all imports of 
> reflectiveCalls, and re-evaluating all of the warnings. There may be a few 
> legit usages.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-14584) Improve recognition of non-nullability in Dataset transformations

2017-05-09 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16003996#comment-16003996
 ] 

Hyukjin Kwon edited comment on SPARK-14584 at 5/10/17 3:56 AM:
---

[~joshrosen] SPARK-18284 fixes this.

Before this PR:

{code}
scala> case class MyCaseClass(foo: Int)
defined class MyCaseClass

scala> sc.parallelize(Seq(0)).toDS.map(MyCaseClass).printSchema
root
 |-- foo: integer (nullable = true)
{code}


After this PR:

{code}
scala> case class MyCaseClass(foo: Int)
defined class MyCaseClass

scala> sc.parallelize(Seq(0)).toDS.map(MyCaseClass).printSchema
root
 |-- foo: integer (nullable = false)
{code}



was (Author: hyukjin.kwon):
[~joshrosen] SPARK-18284 fixes this.

Before this PR:

{code}
scala> case class MyCaseClass(foo: Int)
defined class MyCaseClass

scala> sc.parallelize(Seq(0)).toDS.map(MyCaseClass).printSchema
root
 |-- foo: integer (nullable = false)
{code}


After this PR:

{code}
scala> case class MyCaseClass(foo: Int)
defined class MyCaseClass

scala> sc.parallelize(Seq(0)).toDS.map(MyCaseClass).printSchema
root
 |-- foo: integer (nullable = false)
{code}


> Improve recognition of non-nullability in Dataset transformations
> -
>
> Key: SPARK-14584
> URL: https://issues.apache.org/jira/browse/SPARK-14584
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Josh Rosen
>
> There are many cases where we can statically know that a field will never be 
> null. For instance, a field in a case class with a primitive type will never 
> return null. However, there are currently several cases in the Dataset API 
> where we do not properly recognize this non-nullability. For instance:
> {code}
> case class MyCaseClass(foo: Int)
> sc.parallelize(Seq(0)).toDS.map(MyCaseClass).printSchema
> {code}
> claims that the {{foo}} field is nullable even though this is impossible.
> I believe that this is due to the way that we reason about nullability when 
> constructing serializer expressions in ExpressionEncoders. The following 
> assertion will currently fail if added to ExpressionEncoder:
> {code}
>   require(schema.size == serializer.size)
>   schema.fields.zip(serializer).foreach { case (field, fieldSerializer) =>
> require(field.dataType == fieldSerializer.dataType, s"Field 
> ${field.name}'s data type is " +
>   s"${field.dataType} in the schema but ${fieldSerializer.dataType} in 
> its serializer")
> require(field.nullable == fieldSerializer.nullable, s"Field 
> ${field.name}'s nullability is " +
>   s"${field.nullable} in the schema but ${fieldSerializer.nullable} in 
> its serializer")
>   }
> {code}
> Most often, the schema claims that a field is non-nullable while the encoder 
> allows for nullability, but occasionally we see a mismatch in the datatypes 
> due to disagreements over the nullability of nested structs' fields (or 
> fields of structs in arrays).
> I think the problem is that when we're reasoning about nullability in a 
> struct's schema we consider its fields' nullability to be independent of the 
> nullability of the struct itself, whereas in the serializer expressions we 
> are considering those field extraction expressions to be nullable if the 
> input objects themselves can be nullable.
> I'm not sure what's the simplest way to fix this. One proposal would be to 
> leave the serializers unchanged and have ObjectOperator derive its output 
> attributes from an explicitly-passed schema rather than using the 
> serializers' attributes. However, I worry that this might introduce bugs in 
> case the serializer and schema disagree.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14584) Improve recognition of non-nullability in Dataset transformations

2017-05-09 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16003996#comment-16003996
 ] 

Hyukjin Kwon commented on SPARK-14584:
--

[~joshrosen] SPARK-18284 fixes this.

Before this PR:

{code}
scala> case class MyCaseClass(foo: Int)
defined class MyCaseClass

scala> sc.parallelize(Seq(0)).toDS.map(MyCaseClass).printSchema
root
 |-- foo: integer (nullable = false)
{code}


After this PR:

{code}
scala> case class MyCaseClass(foo: Int)
defined class MyCaseClass

scala> sc.parallelize(Seq(0)).toDS.map(MyCaseClass).printSchema
root
 |-- foo: integer (nullable = false)
{code}


> Improve recognition of non-nullability in Dataset transformations
> -
>
> Key: SPARK-14584
> URL: https://issues.apache.org/jira/browse/SPARK-14584
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Josh Rosen
>
> There are many cases where we can statically know that a field will never be 
> null. For instance, a field in a case class with a primitive type will never 
> return null. However, there are currently several cases in the Dataset API 
> where we do not properly recognize this non-nullability. For instance:
> {code}
> case class MyCaseClass(foo: Int)
> sc.parallelize(Seq(0)).toDS.map(MyCaseClass).printSchema
> {code}
> claims that the {{foo}} field is nullable even though this is impossible.
> I believe that this is due to the way that we reason about nullability when 
> constructing serializer expressions in ExpressionEncoders. The following 
> assertion will currently fail if added to ExpressionEncoder:
> {code}
>   require(schema.size == serializer.size)
>   schema.fields.zip(serializer).foreach { case (field, fieldSerializer) =>
> require(field.dataType == fieldSerializer.dataType, s"Field 
> ${field.name}'s data type is " +
>   s"${field.dataType} in the schema but ${fieldSerializer.dataType} in 
> its serializer")
> require(field.nullable == fieldSerializer.nullable, s"Field 
> ${field.name}'s nullability is " +
>   s"${field.nullable} in the schema but ${fieldSerializer.nullable} in 
> its serializer")
>   }
> {code}
> Most often, the schema claims that a field is non-nullable while the encoder 
> allows for nullability, but occasionally we see a mismatch in the datatypes 
> due to disagreements over the nullability of nested structs' fields (or 
> fields of structs in arrays).
> I think the problem is that when we're reasoning about nullability in a 
> struct's schema we consider its fields' nullability to be independent of the 
> nullability of the struct itself, whereas in the serializer expressions we 
> are considering those field extraction expressions to be nullable if the 
> input objects themselves can be nullable.
> I'm not sure what's the simplest way to fix this. One proposal would be to 
> leave the serializers unchanged and have ObjectOperator derive its output 
> attributes from an explicitly-passed schema rather than using the 
> serializers' attributes. However, I worry that this might introduce bugs in 
> case the serializer and schema disagree.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20665) Spark-sql, "Bround" and "Round" function return NULL

2017-05-09 Thread liuxian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxian updated SPARK-20665:

Summary: Spark-sql, "Bround" and "Round" function return NULL  (was: 
Spark-sql, "Bround" function return NULL)

> Spark-sql, "Bround" and "Round" function return NULL
> 
>
> Key: SPARK-20665
> URL: https://issues.apache.org/jira/browse/SPARK-20665
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: liuxian
>
> >select bround(12.3, 2);
> >NULL
> For  this case, the expected result is 12.3, but it is null
> "Round" has the same problem:
> >select round(12.3, 2);
> >NULL



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17685) WholeStageCodegenExec throws IndexOutOfBoundsException

2017-05-09 Thread Herman van Hovell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-17685.
---
   Resolution: Fixed
 Assignee: Yuming Wang
Fix Version/s: 2.3.0
   2.2.1
   2.1.2

> WholeStageCodegenExec throws IndexOutOfBoundsException
> --
>
> Key: SPARK-17685
> URL: https://issues.apache.org/jira/browse/SPARK-17685
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Minor
> Fix For: 2.1.2, 2.2.1, 2.3.0
>
>
> The following SQL query reproduces this issue:
> {code:sql}
> CREATE TABLE tab1(int int, int2 int, str string);
> CREATE TABLE tab2(int int, int2 int, str string);
> INSERT INTO tab1 values(1,1,'str');
> INSERT INTO tab1 values(2,2,'str');
> INSERT INTO tab2 values(1,1,'str');
> INSERT INTO tab2 values(2,3,'str');
> SELECT
>   count(*)
> FROM
>   (
> SELECT t1.int, t2.int2 
> FROM (SELECT * FROM tab1 LIMIT 1310721) t1
> INNER JOIN (SELECT * FROM tab2 LIMIT 1310721) t2 
> ON (t1.int = t2.int AND t1.int2 = t2.int2)
>   ) t;
> {code}
> Exception thrown:
> {noformat}
> java.lang.IndexOutOfBoundsException: 1
>   at 
> scala.collection.LinearSeqOptimized$class.apply(LinearSeqOptimized.scala:65)
>   at scala.collection.immutable.List.apply(List.scala:84)
>   at 
> org.apache.spark.sql.catalyst.expressions.BoundReference.doGenCode(BoundAttribute.scala:64)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:104)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:101)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:101)
>   at 
> org.apache.spark.sql.execution.joins.SortMergeJoinExec$$anonfun$createJoinKey$1.apply(SortMergeJoinExec.scala:334)
>   at 
> org.apache.spark.sql.execution.joins.SortMergeJoinExec$$anonfun$createJoinKey$1.apply(SortMergeJoinExec.scala:334)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.execution.joins.SortMergeJoinExec.createJoinKey(SortMergeJoinExec.scala:334)
>   at 
> org.apache.spark.sql.execution.joins.SortMergeJoinExec.genScanner(SortMergeJoinExec.scala:369)
>   at 
> org.apache.spark.sql.execution.joins.SortMergeJoinExec.doProduce(SortMergeJoinExec.scala:512)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:78)
>   at 
> org.apache.spark.sql.execution.joins.SortMergeJoinExec.produce(SortMergeJoinExec.scala:35)
>   at 
> org.apache.spark.sql.execution.ProjectExec.doProduce(basicPhysicalOperators.scala:40)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:78)
>   at 
> org.apache.spark.sql.execution.ProjectExec.produce(basicPhysicalOperators.scala:30)
>   at 
> org.apache.spark.sql.execution.aggregate.HashAggregateExec.doProduceWithoutKeys(HashAggregateExec.scala:215)
>   at 
> org.apache.spark.sql.execution.aggregate.HashAggregateExec.doProduce(HashAggregateExec.scala:143)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
>

[jira] [Commented] (SPARK-18528) limit + groupBy leads to java.lang.NullPointerException

2017-05-09 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16003882#comment-16003882
 ] 

Takeshi Yamamuro commented on SPARK-18528:
--

You tried spark-v2.1.1? This issue is fixed in 2.0.3, 2.1.1, 2.2.0 (See Fix 
Versions).

> limit + groupBy leads to java.lang.NullPointerException
> ---
>
> Key: SPARK-18528
> URL: https://issues.apache.org/jira/browse/SPARK-18528
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.1
> Environment: CentOS release 6.6, Linux 2.6.32-504.el6.x86_64
>Reporter: Corey
>Assignee: Takeshi Yamamuro
> Fix For: 2.0.3, 2.1.1, 2.2.0
>
>
> Using limit on a DataFrame prior to groupBy will lead to a crash. 
> Repartitioning will avoid the crash.
> *will crash:* {{df.limit(3).groupBy("user_id").count().show()}}
> *will work:* {{df.limit(3).coalesce(1).groupBy('user_id').count().show()}}
> *will work:* 
> {{df.limit(3).repartition('user_id').groupBy('user_id').count().show()}}
> Here is a reproducible example along with the error message:
> {quote}
> >>> df = spark.createDataFrame([ (1, 1), (1, 3), (2, 1), (3, 2), (3, 3) ], 
> >>> ["user_id", "genre_id"])
> >>>
> >>> df.show()
> +---++
> |user_id|genre_id|
> +---++
> |  1|   1|
> |  1|   3|
> |  2|   1|
> |  3|   2|
> |  3|   3|
> +---++
> >>> df.groupBy("user_id").count().show()
> +---+-+
> |user_id|count|
> +---+-+
> |  1|2|
> |  3|2|
> |  2|1|
> +---+-+
> >>> df.limit(3).groupBy("user_id").count().show()
> [Stage 8:===>(1964 + 24) / 
> 2000]16/11/21 01:59:27 WARN TaskSetManager: Lost task 0.0 in stage 9.0 (TID 
> 8204, lvsp20hdn012.stubprod.com): java.lang.NullPointerException
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20683) Make table uncache chaining optional

2017-05-09 Thread Shea Parkes (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16003864#comment-16003864
 ] 

Shea Parkes commented on SPARK-20683:
-

For anyone that found this issue and just wants to revert to the old behavior 
in their own fork, the following change in 
{{sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala}} 
worked for us:

{code}
-  if (cd.plan.find(_.sameResult(plan)).isDefined) {
+  if (cd.plan.sameResult(plan)) {
{code}

> Make table uncache chaining optional
> 
>
> Key: SPARK-20683
> URL: https://issues.apache.org/jira/browse/SPARK-20683
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: Not particularly environment sensitive.  
> Encountered/tested on Linux and Windows.
>Reporter: Shea Parkes
>
> A recent change was made in SPARK-19765 that causes table uncaching to chain. 
>  That is, if table B is a child of table A, and they are both cached, now 
> uncaching table A will automatically uncache table B.
> At first I did not understand the need for this, but when reading the unit 
> tests, I see that it is likely that many people do not keep named references 
> to the child table (e.g. B).  Perhaps B is just made and cached as some part 
> of data exploration.  In that situation, it makes sense for B to 
> automatically be uncached when you are finished with A.
> However, we commonly utilize a different design pattern that is now harmed by 
> this automatic uncaching.  It is common for us to cache table A to then make 
> two, independent children tables (e.g. B and C).  Once those two child tables 
> are realized and cached, we'd then uncache table A (as it was no longer 
> needed and could be quite large).  After this change now, when we uncache 
> table A, we suddenly lose our cached status on both table B and C (which is 
> quite frustrating).  All of these tables are often quite large, and we view 
> what we're doing as mindful memory management.  We are maintaining named 
> references to B and C at all times, so we can always uncache them ourselves 
> when it makes sense.
> Would it be acceptable/feasible to make this table uncache chaining optional? 
>  I would be fine if the default is for the chaining to happen, as long as we 
> can turn it off via parameters.
> If acceptable, I can try to work towards making the required changes.  I am 
> most comfortable in Python (and would want the optional parameter surfaced in 
> Python), but have found the places required to make this change in Scala 
> (since I reverted the functionality in a private fork already).  Any help 
> would be greatly appreciated however.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20665) Spark-sql, "Bround" function return NULL

2017-05-09 Thread liuxian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxian updated SPARK-20665:

Description: 
>select bround(12.3, 2);
>NULL
For  this case, the expected result is 12.3, but it is null

"Round" has the same problem:
>select round(12.3, 2);
>NULL


  was:
>select bround(12.3, 2);
>NULL
For  this case, the expected result is 12.3, but it is null


> Spark-sql, "Bround" function return NULL
> 
>
> Key: SPARK-20665
> URL: https://issues.apache.org/jira/browse/SPARK-20665
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: liuxian
>
> >select bround(12.3, 2);
> >NULL
> For  this case, the expected result is 12.3, but it is null
> "Round" has the same problem:
> >select round(12.3, 2);
> >NULL



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20686) PropagateEmptyRelation incorrectly handles aggregate without grouping expressions

2017-05-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20686:


Assignee: Josh Rosen  (was: Apache Spark)

> PropagateEmptyRelation incorrectly handles aggregate without grouping 
> expressions
> -
>
> Key: SPARK-20686
> URL: https://issues.apache.org/jira/browse/SPARK-20686
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.1.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>  Labels: correctness
>
> The query
> {code}
> SELECT 1 FROM (SELECT COUNT(*) WHERE FALSE) t1
> {code}
> should return a single row of output because the subquery is an aggregate 
> without a group-by and thus should return a single row. However, Spark 
> incorrectly returns zero rows.
> This is caused by SPARK-16208, a patch which added an optimizer rule to 
> propagate EmptyRelation through operators. The logic for handling aggregates 
> is wrong: it checks whether aggregate expressions are non-empty for deciding 
> whether the output should be empty, whereas it should be checking grouping 
> expressions instead:
> An aggregate with non-empty group expression will return one output row per 
> group. If the input to the grouped aggregate is empty then all groups will be 
> empty and thus the output will be empty. It doesn't matter whether the SELECT 
> statement includes aggregate expressions since that won't affect the number 
> of output rows.
> If the grouping expressions are empty, however, then the aggregate will 
> always produce a single output row and thus we cannot propagate the 
> EmptyRelation.
> The current implementation is incorrect (since it returns a wrong answer) and 
> also misses an optimization opportunity by not propagating EmptyRelation in 
> the case where a grouped aggregate has aggregate expressions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20686) PropagateEmptyRelation incorrectly handles aggregate without grouping expressions

2017-05-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20686:


Assignee: Apache Spark  (was: Josh Rosen)

> PropagateEmptyRelation incorrectly handles aggregate without grouping 
> expressions
> -
>
> Key: SPARK-20686
> URL: https://issues.apache.org/jira/browse/SPARK-20686
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.1.0
>Reporter: Josh Rosen
>Assignee: Apache Spark
>  Labels: correctness
>
> The query
> {code}
> SELECT 1 FROM (SELECT COUNT(*) WHERE FALSE) t1
> {code}
> should return a single row of output because the subquery is an aggregate 
> without a group-by and thus should return a single row. However, Spark 
> incorrectly returns zero rows.
> This is caused by SPARK-16208, a patch which added an optimizer rule to 
> propagate EmptyRelation through operators. The logic for handling aggregates 
> is wrong: it checks whether aggregate expressions are non-empty for deciding 
> whether the output should be empty, whereas it should be checking grouping 
> expressions instead:
> An aggregate with non-empty group expression will return one output row per 
> group. If the input to the grouped aggregate is empty then all groups will be 
> empty and thus the output will be empty. It doesn't matter whether the SELECT 
> statement includes aggregate expressions since that won't affect the number 
> of output rows.
> If the grouping expressions are empty, however, then the aggregate will 
> always produce a single output row and thus we cannot propagate the 
> EmptyRelation.
> The current implementation is incorrect (since it returns a wrong answer) and 
> also misses an optimization opportunity by not propagating EmptyRelation in 
> the case where a grouped aggregate has aggregate expressions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20686) PropagateEmptyRelation incorrectly handles aggregate without grouping expressions

2017-05-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16003848#comment-16003848
 ] 

Apache Spark commented on SPARK-20686:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/17929

> PropagateEmptyRelation incorrectly handles aggregate without grouping 
> expressions
> -
>
> Key: SPARK-20686
> URL: https://issues.apache.org/jira/browse/SPARK-20686
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.1.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>  Labels: correctness
>
> The query
> {code}
> SELECT 1 FROM (SELECT COUNT(*) WHERE FALSE) t1
> {code}
> should return a single row of output because the subquery is an aggregate 
> without a group-by and thus should return a single row. However, Spark 
> incorrectly returns zero rows.
> This is caused by SPARK-16208, a patch which added an optimizer rule to 
> propagate EmptyRelation through operators. The logic for handling aggregates 
> is wrong: it checks whether aggregate expressions are non-empty for deciding 
> whether the output should be empty, whereas it should be checking grouping 
> expressions instead:
> An aggregate with non-empty group expression will return one output row per 
> group. If the input to the grouped aggregate is empty then all groups will be 
> empty and thus the output will be empty. It doesn't matter whether the SELECT 
> statement includes aggregate expressions since that won't affect the number 
> of output rows.
> If the grouping expressions are empty, however, then the aggregate will 
> always produce a single output row and thus we cannot propagate the 
> EmptyRelation.
> The current implementation is incorrect (since it returns a wrong answer) and 
> also misses an optimization opportunity by not propagating EmptyRelation in 
> the case where a grouped aggregate has aggregate expressions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20686) PropagateEmptyRelation incorrectly handles aggregate without grouping expressions

2017-05-09 Thread Josh Rosen (JIRA)

Josh Rosen created SPARK-20686:
--

 Summary: PropagateEmptyRelation incorrectly handles aggregate 
without grouping expressions
 Key: SPARK-20686
 URL: https://issues.apache.org/jira/browse/SPARK-20686
 Project: Spark
  Issue Type: Bug
  Components: Optimizer, SQL
Affects Versions: 2.1.0
Reporter: Josh Rosen
Assignee: Josh Rosen


The query

{code}
SELECT 1 FROM (SELECT COUNT(*) WHERE FALSE) t1
{code}

should return a single row of output because the subquery is an aggregate 
without a group-by and thus should return a single row. However, Spark 
incorrectly returns zero rows.

This is caused by SPARK-16208, a patch which added an optimizer rule to 
propagate EmptyRelation through operators. The logic for handling aggregates is 
wrong: it checks whether aggregate expressions are non-empty for deciding 
whether the output should be empty, whereas it should be checking grouping 
expressions instead:

An aggregate with non-empty group expression will return one output row per 
group. If the input to the grouped aggregate is empty then all groups will be 
empty and thus the output will be empty. It doesn't matter whether the SELECT 
statement includes aggregate expressions since that won't affect the number of 
output rows.

If the grouping expressions are empty, however, then the aggregate will always 
produce a single output row and thus we cannot propagate the EmptyRelation.

The current implementation is incorrect (since it returns a wrong answer) and 
also misses an optimization opportunity by not propagating EmptyRelation in the 
case where a grouped aggregate has aggregate expressions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20311) SQL "range(N) as alias" or "range(N) alias" doesn't work

2017-05-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16003841#comment-16003841
 ] 

Apache Spark commented on SPARK-20311:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/17928

> SQL "range(N) as alias" or "range(N) alias" doesn't work
> 
>
> Key: SPARK-20311
> URL: https://issues.apache.org/jira/browse/SPARK-20311
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Juliusz Sompolski
>Assignee: Takeshi Yamamuro
>Priority: Minor
>
> `select * from range(10) as A;` or `select * from range(10) A;`
> does not work.
> As a workaround, a subquery has to be used:
> `select * from (select * from range(10)) as A;`



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20500) ML, Graph 2.2 QA: API: Binary incompatible changes

2017-05-09 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-20500.
---
   Resolution: Done
Fix Version/s: 2.2.0

> ML, Graph 2.2 QA: API: Binary incompatible changes
> --
>
> Key: SPARK-20500
> URL: https://issues.apache.org/jira/browse/SPARK-20500
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Blocker
> Fix For: 2.2.0
>
>
> Generate a list of binary incompatible changes using MiMa and create new 
> JIRAs for issues found. Filter out false positives as needed.
> If you want to take this task, look at the analogous task from the previous 
> release QA, and ping the Assignee for advice.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20500) ML, Graph 2.2 QA: API: Binary incompatible changes

2017-05-09 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16003813#comment-16003813
 ] 

Joseph K. Bradley commented on SPARK-20500:
---

I checked the full MiMa output for MLlib and GraphX, as well as the new 
MimaExcludes items.  All looked fine.  Closing.

> ML, Graph 2.2 QA: API: Binary incompatible changes
> --
>
> Key: SPARK-20500
> URL: https://issues.apache.org/jira/browse/SPARK-20500
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Blocker
> Fix For: 2.2.0
>
>
> Generate a list of binary incompatible changes using MiMa and create new 
> JIRAs for issues found. Filter out false positives as needed.
> If you want to take this task, look at the analogous task from the previous 
> release QA, and ping the Assignee for advice.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20376) Make StateStoreProvider plugable

2017-05-09 Thread Michael Armbrust (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16003729#comment-16003729
 ] 

Michael Armbrust commented on SPARK-20376:
--

/cc [~tdas]

> Make StateStoreProvider plugable
> 
>
> Key: SPARK-20376
> URL: https://issues.apache.org/jira/browse/SPARK-20376
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Yogesh Mahajan
>
> Currently only option available for StateStore and StateStoreProvider is 
> HDFSBackedStateStoreProvider. There is no way to provide your own custom 
> implementation of StateStoreProvider for users of structured streaming. We 
> have built a performant in memory StateStore which we would like to plugin 
> for our use cases. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20685) BatchPythonEvaluation UDF evaluator fails for case of single UDF with repeated argument

2017-05-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16003662#comment-16003662
 ] 

Apache Spark commented on SPARK-20685:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/17927

> BatchPythonEvaluation UDF evaluator fails for case of single UDF with 
> repeated argument
> ---
>
> Key: SPARK-20685
> URL: https://issues.apache.org/jira/browse/SPARK-20685
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> There's a latent corner-case bug in PYSpark UDF evaluation where executing a 
> stage with a single UDF that takes more than one argument _where that 
> argument is repeated_ will crash at execution with a confusing error.
> Here's a repro:
> {code}
> from pyspark.sql.types import *
> spark.catalog.registerFunction("add", lambda x, y: x + y, IntegerType())
> spark.sql("SELECT add(1, 1)").first()
> {code}
> This fails with
> {code}
> Caused by: org.apache.spark.api.python.PythonException: Traceback (most 
> recent call last):
>   File 
> "/Users/joshrosen/Documents/spark/python/lib/pyspark.zip/pyspark/worker.py", 
> line 180, in main
> process()
>   File 
> "/Users/joshrosen/Documents/spark/python/lib/pyspark.zip/pyspark/worker.py", 
> line 175, in process
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File 
> "/Users/joshrosen/Documents/spark/python/lib/pyspark.zip/pyspark/worker.py", 
> line 107, in 
> func = lambda _, it: map(mapper, it)
>   File 
> "/Users/joshrosen/Documents/spark/python/lib/pyspark.zip/pyspark/worker.py", 
> line 93, in 
> mapper = lambda a: udf(*a)
>   File 
> "/Users/joshrosen/Documents/spark/python/lib/pyspark.zip/pyspark/worker.py", 
> line 71, in 
> return lambda *a: f(*a)
> TypeError: () takes exactly 2 arguments (1 given)
> {code}
> The problem was introduced by SPARK-14267: there code there has a fast path 
> for handling a "batch UDF evaluation consisting of a single Python UDF, but 
> that branch incorrectly assumes that a single UDF won't have repeated 
> arguments and therefore skips the code for unpacking arguments from the input 
> row (whose schema may not necessarily match the UDF inputs).
> I have a simple fix for this which I'll submit now.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20685) BatchPythonEvaluation UDF evaluator fails for case of single UDF with repeated argument

2017-05-09 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen reassigned SPARK-20685:
--

Assignee: Josh Rosen

> BatchPythonEvaluation UDF evaluator fails for case of single UDF with 
> repeated argument
> ---
>
> Key: SPARK-20685
> URL: https://issues.apache.org/jira/browse/SPARK-20685
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> There's a latent corner-case bug in PYSpark UDF evaluation where executing a 
> stage with a single UDF that takes more than one argument _where that 
> argument is repeated_ will crash at execution with a confusing error.
> Here's a repro:
> {code}
> from pyspark.sql.types import *
> spark.catalog.registerFunction("add", lambda x, y: x + y, IntegerType())
> spark.sql("SELECT add(1, 1)").first()
> {code}
> This fails with
> {code}
> Caused by: org.apache.spark.api.python.PythonException: Traceback (most 
> recent call last):
>   File 
> "/Users/joshrosen/Documents/spark/python/lib/pyspark.zip/pyspark/worker.py", 
> line 180, in main
> process()
>   File 
> "/Users/joshrosen/Documents/spark/python/lib/pyspark.zip/pyspark/worker.py", 
> line 175, in process
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File 
> "/Users/joshrosen/Documents/spark/python/lib/pyspark.zip/pyspark/worker.py", 
> line 107, in 
> func = lambda _, it: map(mapper, it)
>   File 
> "/Users/joshrosen/Documents/spark/python/lib/pyspark.zip/pyspark/worker.py", 
> line 93, in 
> mapper = lambda a: udf(*a)
>   File 
> "/Users/joshrosen/Documents/spark/python/lib/pyspark.zip/pyspark/worker.py", 
> line 71, in 
> return lambda *a: f(*a)
> TypeError: () takes exactly 2 arguments (1 given)
> {code}
> The problem was introduced by SPARK-14267: there code there has a fast path 
> for handling a "batch UDF evaluation consisting of a single Python UDF, but 
> that branch incorrectly assumes that a single UDF won't have repeated 
> arguments and therefore skips the code for unpacking arguments from the input 
> row (whose schema may not necessarily match the UDF inputs).
> I have a simple fix for this which I'll submit now.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20685) BatchPythonEvaluation UDF evaluator fails for case of single UDF with repeated argument

2017-05-09 Thread Josh Rosen (JIRA)

Josh Rosen created SPARK-20685:
--

 Summary: BatchPythonEvaluation UDF evaluator fails for case of 
single UDF with repeated argument
 Key: SPARK-20685
 URL: https://issues.apache.org/jira/browse/SPARK-20685
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.0.0
Reporter: Josh Rosen


There's a latent corner-case bug in PYSpark UDF evaluation where executing a 
stage with a single UDF that takes more than one argument _where that argument 
is repeated_ will crash at execution with a confusing error.

Here's a repro:

{code}
from pyspark.sql.types import *
spark.catalog.registerFunction("add", lambda x, y: x + y, IntegerType())
spark.sql("SELECT add(1, 1)").first()
{code}

This fails with

{code}
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent 
call last):
  File 
"/Users/joshrosen/Documents/spark/python/lib/pyspark.zip/pyspark/worker.py", 
line 180, in main
process()
  File 
"/Users/joshrosen/Documents/spark/python/lib/pyspark.zip/pyspark/worker.py", 
line 175, in process
serializer.dump_stream(func(split_index, iterator), outfile)
  File 
"/Users/joshrosen/Documents/spark/python/lib/pyspark.zip/pyspark/worker.py", 
line 107, in 
func = lambda _, it: map(mapper, it)
  File 
"/Users/joshrosen/Documents/spark/python/lib/pyspark.zip/pyspark/worker.py", 
line 93, in 
mapper = lambda a: udf(*a)
  File 
"/Users/joshrosen/Documents/spark/python/lib/pyspark.zip/pyspark/worker.py", 
line 71, in 
return lambda *a: f(*a)
TypeError: () takes exactly 2 arguments (1 given)
{code}

The problem was introduced by SPARK-14267: there code there has a fast path for 
handling a "batch UDF evaluation consisting of a single Python UDF, but that 
branch incorrectly assumes that a single UDF won't have repeated arguments and 
therefore skips the code for unpacking arguments from the input row (whose 
schema may not necessarily match the UDF inputs).

I have a simple fix for this which I'll submit now.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20373) Batch queries with 'Dataset/DataFrame.withWatermark()` does not execute

2017-05-09 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu reassigned SPARK-20373:


Assignee: Genmao Yu

> Batch queries with 'Dataset/DataFrame.withWatermark()` does not execute
> ---
>
> Key: SPARK-20373
> URL: https://issues.apache.org/jira/browse/SPARK-20373
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Tathagata Das
>Assignee: Genmao Yu
>Priority: Minor
> Fix For: 2.2.1, 2.3.0
>
>
> Any Dataset/DataFrame batch query with the operation `withWatermark` does not 
> execute because the batch planner does not have any rule to explicitly handle 
> the EventTimeWatermark logical plan. The right solution is to simply remove 
> the plan node, as the watermark should not affect any batch query in any way.
> {code}
> from pyspark.sql.functions import *
> eventsDF = spark.createDataFrame([("2016-03-11 09:00:07", "dev1", 
> 123)]).toDF("eventTime", "deviceId", 
> "signal").select(col("eventTime").cast("timestamp").alias("eventTime"), 
> "deviceId", "signal")
> windowedCountsDF = \
>   eventsDF \
> .withWatermark("eventTime", "10 minutes") \
> .groupBy(
>   "deviceId",
>   window("eventTime", "5 minutes")) \
> .count()
> windowedCountsDF.collect()
> {code}
> This throws as an error 
> {code}
> java.lang.AssertionError: assertion failed: No plan for EventTimeWatermark 
> eventTime#3762657: timestamp, interval 10 minutes
> +- Project [cast(_1#3762643 as timestamp) AS eventTime#3762657, _2#3762644 AS 
> deviceId#3762651]
>+- LogicalRDD [_1#3762643, _2#3762644, _3#3762645L]
>   at scala.Predef$.assert(Predef.scala:170)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:92)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:77)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:74)
>   at 
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
>   at 
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at 
> scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
>   at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:74)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:66)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:92)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:77)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:74)
>   at 
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
>   at 
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at 
> scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
>   at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:74)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:66)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:92)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:77)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:74)
>   at 
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
>   at 
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at

[jira] [Updated] (SPARK-20373) Batch queries with 'Dataset/DataFrame.withWatermark()` does not execute

2017-05-09 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-20373:
-
Fix Version/s: 2.3.0
   2.2.1

> Batch queries with 'Dataset/DataFrame.withWatermark()` does not execute
> ---
>
> Key: SPARK-20373
> URL: https://issues.apache.org/jira/browse/SPARK-20373
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Tathagata Das
>Priority: Minor
> Fix For: 2.2.1, 2.3.0
>
>
> Any Dataset/DataFrame batch query with the operation `withWatermark` does not 
> execute because the batch planner does not have any rule to explicitly handle 
> the EventTimeWatermark logical plan. The right solution is to simply remove 
> the plan node, as the watermark should not affect any batch query in any way.
> {code}
> from pyspark.sql.functions import *
> eventsDF = spark.createDataFrame([("2016-03-11 09:00:07", "dev1", 
> 123)]).toDF("eventTime", "deviceId", 
> "signal").select(col("eventTime").cast("timestamp").alias("eventTime"), 
> "deviceId", "signal")
> windowedCountsDF = \
>   eventsDF \
> .withWatermark("eventTime", "10 minutes") \
> .groupBy(
>   "deviceId",
>   window("eventTime", "5 minutes")) \
> .count()
> windowedCountsDF.collect()
> {code}
> This throws as an error 
> {code}
> java.lang.AssertionError: assertion failed: No plan for EventTimeWatermark 
> eventTime#3762657: timestamp, interval 10 minutes
> +- Project [cast(_1#3762643 as timestamp) AS eventTime#3762657, _2#3762644 AS 
> deviceId#3762651]
>+- LogicalRDD [_1#3762643, _2#3762644, _3#3762645L]
>   at scala.Predef$.assert(Predef.scala:170)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:92)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:77)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:74)
>   at 
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
>   at 
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at 
> scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
>   at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:74)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:66)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:92)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:77)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:74)
>   at 
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
>   at 
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at 
> scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
>   at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:74)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:66)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:92)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:77)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:74)
>   at 
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
>   at 
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at

[jira] [Assigned] (SPARK-20311) SQL "range(N) as alias" or "range(N) alias" doesn't work

2017-05-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20311:


Assignee: Apache Spark  (was: Takeshi Yamamuro)

> SQL "range(N) as alias" or "range(N) alias" doesn't work
> 
>
> Key: SPARK-20311
> URL: https://issues.apache.org/jira/browse/SPARK-20311
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Juliusz Sompolski
>Assignee: Apache Spark
>Priority: Minor
>
> `select * from range(10) as A;` or `select * from range(10) A;`
> does not work.
> As a workaround, a subquery has to be used:
> `select * from (select * from range(10)) as A;`



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20311) SQL "range(N) as alias" or "range(N) alias" doesn't work

2017-05-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20311:


Assignee: Takeshi Yamamuro  (was: Apache Spark)

> SQL "range(N) as alias" or "range(N) alias" doesn't work
> 
>
> Key: SPARK-20311
> URL: https://issues.apache.org/jira/browse/SPARK-20311
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Juliusz Sompolski
>Assignee: Takeshi Yamamuro
>Priority: Minor
>
> `select * from range(10) as A;` or `select * from range(10) A;`
> does not work.
> As a workaround, a subquery has to be used:
> `select * from (select * from range(10)) as A;`



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-20311) SQL "range(N) as alias" or "range(N) alias" doesn't work

2017-05-09 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai reopened SPARK-20311:
--

> SQL "range(N) as alias" or "range(N) alias" doesn't work
> 
>
> Key: SPARK-20311
> URL: https://issues.apache.org/jira/browse/SPARK-20311
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Juliusz Sompolski
>Assignee: Takeshi Yamamuro
>Priority: Minor
>
> `select * from range(10) as A;` or `select * from range(10) A;`
> does not work.
> As a workaround, a subquery has to be used:
> `select * from (select * from range(10)) as A;`



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20311) SQL "range(N) as alias" or "range(N) alias" doesn't work

2017-05-09 Thread Yin Huai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16003626#comment-16003626
 ] 

Yin Huai commented on SPARK-20311:
--

It introduced a regression 
(https://github.com/apache/spark/pull/17666#issuecomment-300309896). I have 
reverted the change.

> SQL "range(N) as alias" or "range(N) alias" doesn't work
> 
>
> Key: SPARK-20311
> URL: https://issues.apache.org/jira/browse/SPARK-20311
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Juliusz Sompolski
>Assignee: Takeshi Yamamuro
>Priority: Minor
>
> `select * from range(10) as A;` or `select * from range(10) A;`
> does not work.
> As a workaround, a subquery has to be used:
> `select * from (select * from range(10)) as A;`



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20311) SQL "range(N) as alias" or "range(N) alias" doesn't work

2017-05-09 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-20311:
-
Fix Version/s: (was: 2.2.1)
   (was: 2.3.0)

> SQL "range(N) as alias" or "range(N) alias" doesn't work
> 
>
> Key: SPARK-20311
> URL: https://issues.apache.org/jira/browse/SPARK-20311
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Juliusz Sompolski
>Assignee: Takeshi Yamamuro
>Priority: Minor
>
> `select * from range(10) as A;` or `select * from range(10) A;`
> does not work.
> As a workaround, a subquery has to be used:
> `select * from (select * from range(10)) as A;`



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20684) expose createGlobalTempView in SparkR

2017-05-09 Thread Hossein Falaki (JIRA)

Hossein Falaki created SPARK-20684:
--

 Summary: expose createGlobalTempView in SparkR
 Key: SPARK-20684
 URL: https://issues.apache.org/jira/browse/SPARK-20684
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.2.0
Reporter: Hossein Falaki


This is a useful API that is not exposed in SparkR. It will help with moving 
data between languages on a single single Spark application.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20205) DAGScheduler posts SparkListenerStageSubmitted before updating stage

2017-05-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20205:


Assignee: Apache Spark

> DAGScheduler posts SparkListenerStageSubmitted before updating stage
> 
>
> Key: SPARK-20205
> URL: https://issues.apache.org/jira/browse/SPARK-20205
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>
> Probably affects other versions, haven't checked.
> The code that submits the event to the bus is around line 991:
> {code}
> stage.makeNewStageAttempt(partitionsToCompute.size, 
> taskIdToLocations.values.toSeq)
> listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, 
> properties))
> {code}
> Later in the same method, the stage information is updated (around line 1057):
> {code}
> if (tasks.size > 0) {
>   logInfo(s"Submitting ${tasks.size} missing tasks from $stage 
> (${stage.rdd}) (first 15 " +
> s"tasks are for partitions ${tasks.take(15).map(_.partitionId)})")
>   taskScheduler.submitTasks(new TaskSet(
> tasks.toArray, stage.id, stage.latestInfo.attemptId, jobId, 
> properties))
>   stage.latestInfo.submissionTime = Some(clock.getTimeMillis())
> {code}
> That means an event handler might get a stage submitted event with an unset 
> submission time.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20205) DAGScheduler posts SparkListenerStageSubmitted before updating stage

2017-05-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20205:


Assignee: (was: Apache Spark)

> DAGScheduler posts SparkListenerStageSubmitted before updating stage
> 
>
> Key: SPARK-20205
> URL: https://issues.apache.org/jira/browse/SPARK-20205
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Marcelo Vanzin
>
> Probably affects other versions, haven't checked.
> The code that submits the event to the bus is around line 991:
> {code}
> stage.makeNewStageAttempt(partitionsToCompute.size, 
> taskIdToLocations.values.toSeq)
> listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, 
> properties))
> {code}
> Later in the same method, the stage information is updated (around line 1057):
> {code}
> if (tasks.size > 0) {
>   logInfo(s"Submitting ${tasks.size} missing tasks from $stage 
> (${stage.rdd}) (first 15 " +
> s"tasks are for partitions ${tasks.take(15).map(_.partitionId)})")
>   taskScheduler.submitTasks(new TaskSet(
> tasks.toArray, stage.id, stage.latestInfo.attemptId, jobId, 
> properties))
>   stage.latestInfo.submissionTime = Some(clock.getTimeMillis())
> {code}
> That means an event handler might get a stage submitted event with an unset 
> submission time.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20205) DAGScheduler posts SparkListenerStageSubmitted before updating stage

2017-05-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16003471#comment-16003471
 ] 

Apache Spark commented on SPARK-20205:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/17925

> DAGScheduler posts SparkListenerStageSubmitted before updating stage
> 
>
> Key: SPARK-20205
> URL: https://issues.apache.org/jira/browse/SPARK-20205
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Marcelo Vanzin
>
> Probably affects other versions, haven't checked.
> The code that submits the event to the bus is around line 991:
> {code}
> stage.makeNewStageAttempt(partitionsToCompute.size, 
> taskIdToLocations.values.toSeq)
> listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, 
> properties))
> {code}
> Later in the same method, the stage information is updated (around line 1057):
> {code}
> if (tasks.size > 0) {
>   logInfo(s"Submitting ${tasks.size} missing tasks from $stage 
> (${stage.rdd}) (first 15 " +
> s"tasks are for partitions ${tasks.take(15).map(_.partitionId)})")
>   taskScheduler.submitTasks(new TaskSet(
> tasks.toArray, stage.id, stage.latestInfo.attemptId, jobId, 
> properties))
>   stage.latestInfo.submissionTime = Some(clock.getTimeMillis())
> {code}
> That means an event handler might get a stage submitted event with an unset 
> submission time.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20683) Make table uncache chaining optional

2017-05-09 Thread Robert Kruszewski (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16003464#comment-16003464
 ] 

Robert Kruszewski commented on SPARK-20683:
---

+1, have the similar use case where we do cache management ourselves and don't 
want to invalidate dependent caches.

> Make table uncache chaining optional
> 
>
> Key: SPARK-20683
> URL: https://issues.apache.org/jira/browse/SPARK-20683
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: Not particularly environment sensitive.  
> Encountered/tested on Linux and Windows.
>Reporter: Shea Parkes
>
> A recent change was made in SPARK-19765 that causes table uncaching to chain. 
>  That is, if table B is a child of table A, and they are both cached, now 
> uncaching table A will automatically uncache table B.
> At first I did not understand the need for this, but when reading the unit 
> tests, I see that it is likely that many people do not keep named references 
> to the child table (e.g. B).  Perhaps B is just made and cached as some part 
> of data exploration.  In that situation, it makes sense for B to 
> automatically be uncached when you are finished with A.
> However, we commonly utilize a different design pattern that is now harmed by 
> this automatic uncaching.  It is common for us to cache table A to then make 
> two, independent children tables (e.g. B and C).  Once those two child tables 
> are realized and cached, we'd then uncache table A (as it was no longer 
> needed and could be quite large).  After this change now, when we uncache 
> table A, we suddenly lose our cached status on both table B and C (which is 
> quite frustrating).  All of these tables are often quite large, and we view 
> what we're doing as mindful memory management.  We are maintaining named 
> references to B and C at all times, so we can always uncache them ourselves 
> when it makes sense.
> Would it be acceptable/feasible to make this table uncache chaining optional? 
>  I would be fine if the default is for the chaining to happen, as long as we 
> can turn it off via parameters.
> If acceptable, I can try to work towards making the required changes.  I am 
> most comfortable in Python (and would want the optional parameter surfaced in 
> Python), but have found the places required to make this change in Scala 
> (since I reverted the functionality in a private fork already).  Any help 
> would be greatly appreciated however.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20683) Make table uncache chaining optional

2017-05-09 Thread Shea Parkes (JIRA)

Shea Parkes created SPARK-20683:
---

 Summary: Make table uncache chaining optional
 Key: SPARK-20683
 URL: https://issues.apache.org/jira/browse/SPARK-20683
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.1
 Environment: Not particularly environment sensitive.  
Encountered/tested on Linux and Windows.
Reporter: Shea Parkes


A recent change was made in SPARK-19765 that causes table uncaching to chain.  
That is, if table B is a child of table A, and they are both cached, now 
uncaching table A will automatically uncache table B.

At first I did not understand the need for this, but when reading the unit 
tests, I see that it is likely that many people do not keep named references to 
the child table (e.g. B).  Perhaps B is just made and cached as some part of 
data exploration.  In that situation, it makes sense for B to automatically be 
uncached when you are finished with A.

However, we commonly utilize a different design pattern that is now harmed by 
this automatic uncaching.  It is common for us to cache table A to then make 
two, independent children tables (e.g. B and C).  Once those two child tables 
are realized and cached, we'd then uncache table A (as it was no longer needed 
and could be quite large).  After this change now, when we uncache table A, we 
suddenly lose our cached status on both table B and C (which is quite 
frustrating).  All of these tables are often quite large, and we view what 
we're doing as mindful memory management.  We are maintaining named references 
to B and C at all times, so we can always uncache them ourselves when it makes 
sense.

Would it be acceptable/feasible to make this table uncache chaining optional?  
I would be fine if the default is for the chaining to happen, as long as we can 
turn it off via parameters.

If acceptable, I can try to work towards making the required changes.  I am 
most comfortable in Python (and would want the optional parameter surfaced in 
Python), but have found the places required to make this change in Scala (since 
I reverted the functionality in a private fork already).  Any help would be 
greatly appreciated however.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20682) Support a new faster ORC data source based on Apache ORC

2017-05-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20682:


Assignee: Apache Spark

> Support a new faster ORC data source based on Apache ORC
> 
>
> Key: SPARK-20682
> URL: https://issues.apache.org/jira/browse/SPARK-20682
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.2, 1.6.3, 2.1.1
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>
> Since SPARK-2883, Apache Spark supports Apache ORC inside `sql/hive` module 
> with Hive dependency. This issue aims to add a new and faster ORC data source 
> inside `sql/core` and to replace the old ORC data source eventually. In this 
> issue, the latest Apache ORC 1.4.0 (released yesterday) is used.
> There are four key benefits.
> - Speed: Use both Spark `ColumnarBatch` and ORC `RowBatch` together. This is 
> faster than the current implementation in Spark.
> - Stability: Apache ORC 1.4.0 has many fixes and we can depend on ORC 
> community more.
> - Usability: User can use `ORC` data sources without hive module, i.e, 
> `-Phive`.
> - Maintainability: Reduce the Hive dependency and can remove old legacy code 
> later.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20682) Support a new faster ORC data source based on Apache ORC

2017-05-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20682:


Assignee: (was: Apache Spark)

> Support a new faster ORC data source based on Apache ORC
> 
>
> Key: SPARK-20682
> URL: https://issues.apache.org/jira/browse/SPARK-20682
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.2, 1.6.3, 2.1.1
>Reporter: Dongjoon Hyun
>
> Since SPARK-2883, Apache Spark supports Apache ORC inside `sql/hive` module 
> with Hive dependency. This issue aims to add a new and faster ORC data source 
> inside `sql/core` and to replace the old ORC data source eventually. In this 
> issue, the latest Apache ORC 1.4.0 (released yesterday) is used.
> There are four key benefits.
> - Speed: Use both Spark `ColumnarBatch` and ORC `RowBatch` together. This is 
> faster than the current implementation in Spark.
> - Stability: Apache ORC 1.4.0 has many fixes and we can depend on ORC 
> community more.
> - Usability: User can use `ORC` data sources without hive module, i.e, 
> `-Phive`.
> - Maintainability: Reduce the Hive dependency and can remove old legacy code 
> later.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20682) Support a new faster ORC data source based on Apache ORC

2017-05-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16003369#comment-16003369
 ] 

Apache Spark commented on SPARK-20682:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/17924

> Support a new faster ORC data source based on Apache ORC
> 
>
> Key: SPARK-20682
> URL: https://issues.apache.org/jira/browse/SPARK-20682
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.2, 1.6.3, 2.1.1
>Reporter: Dongjoon Hyun
>
> Since SPARK-2883, Apache Spark supports Apache ORC inside `sql/hive` module 
> with Hive dependency. This issue aims to add a new and faster ORC data source 
> inside `sql/core` and to replace the old ORC data source eventually. In this 
> issue, the latest Apache ORC 1.4.0 (released yesterday) is used.
> There are four key benefits.
> - Speed: Use both Spark `ColumnarBatch` and ORC `RowBatch` together. This is 
> faster than the current implementation in Spark.
> - Stability: Apache ORC 1.4.0 has many fixes and we can depend on ORC 
> community more.
> - Usability: User can use `ORC` data sources without hive module, i.e, 
> `-Phive`.
> - Maintainability: Reduce the Hive dependency and can remove old legacy code 
> later.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20682) Support a new faster ORC data source based on Apache ORC

2017-05-09 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16003344#comment-16003344
 ] 

Dongjoon Hyun commented on SPARK-20682:
---

Yes. I think it's worth for Spark to have it.

> Support a new faster ORC data source based on Apache ORC
> 
>
> Key: SPARK-20682
> URL: https://issues.apache.org/jira/browse/SPARK-20682
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.2, 1.6.3, 2.1.1
>Reporter: Dongjoon Hyun
>
> Since SPARK-2883, Apache Spark supports Apache ORC inside `sql/hive` module 
> with Hive dependency. This issue aims to add a new and faster ORC data source 
> inside `sql/core` and to replace the old ORC data source eventually. In this 
> issue, the latest Apache ORC 1.4.0 (released yesterday) is used.
> There are four key benefits.
> - Speed: Use both Spark `ColumnarBatch` and ORC `RowBatch` together. This is 
> faster than the current implementation in Spark.
> - Stability: Apache ORC 1.4.0 has many fixes and we can depend on ORC 
> community more.
> - Usability: User can use `ORC` data sources without hive module, i.e, 
> `-Phive`.
> - Maintainability: Reduce the Hive dependency and can remove old legacy code 
> later.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20682) Support a new faster ORC data source based on Apache ORC

2017-05-09 Thread Robert Kruszewski (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16003339#comment-16003339
 ] 

Robert Kruszewski commented on SPARK-20682:
---

Awesome, didn't know that there are nohive artifacts, sounds like we can make 
it work without -Phive then.

> Support a new faster ORC data source based on Apache ORC
> 
>
> Key: SPARK-20682
> URL: https://issues.apache.org/jira/browse/SPARK-20682
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.2, 1.6.3, 2.1.1
>Reporter: Dongjoon Hyun
>
> Since SPARK-2883, Apache Spark supports Apache ORC inside `sql/hive` module 
> with Hive dependency. This issue aims to add a new and faster ORC data source 
> inside `sql/core` and to replace the old ORC data source eventually. In this 
> issue, the latest Apache ORC 1.4.0 (released yesterday) is used.
> There are four key benefits.
> - Speed: Use both Spark `ColumnarBatch` and ORC `RowBatch` together. This is 
> faster than the current implementation in Spark.
> - Stability: Apache ORC 1.4.0 has many fixes and we can depend on ORC 
> community more.
> - Usability: User can use `ORC` data sources without hive module, i.e, 
> `-Phive`.
> - Maintainability: Reduce the Hive dependency and can remove old legacy code 
> later.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20682) Support a new faster ORC data source based on Apache ORC

2017-05-09 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-20682:
--
Description: 
Since SPARK-2883, Apache Spark supports Apache ORC inside `sql/hive` module 
with Hive dependency. This issue aims to add a new and faster ORC data source 
inside `sql/core` and to replace the old ORC data source eventually. In this 
issue, the latest Apache ORC 1.4.0 (released yesterday) is used.

There are four key benefits.

- Speed: Use both Spark `ColumnarBatch` and ORC `RowBatch` together. This is 
faster than the current implementation in Spark.
- Stability: Apache ORC 1.4.0 has many fixes and we can depend on ORC community 
more.
- Usability: User can use `ORC` data sources without hive module, i.e, `-Phive`.
- Maintainability: Reduce the Hive dependency and can remove old legacy code 
later.

  was:
Since SPARK-2883, Apache Spark supports Apache ORC inside `sql/hive` module 
with Hive dependency. This issue aims to add a new and faster ORC data source 
inside `sql/core` and to replace the old ORC data source eventually. In this 
issue, the latest Apache ORC 1.4.0 (released yesterday) is used.

There are four key benefits.

- Speed: Use both Spark `ColumnarBatch` and ORC `RowBatch` together. This is 
faster than the current implementation in Spark.
- Stability: Apache ORC 1.4.0 has many fixes and we can depend on ORC community 
more.
- Usability: User can use `ORC` data sources with hive module, i.e, `-Phive`.
- Maintainability: Reduce the Hive dependency and can remove old legacy code 
later.


> Support a new faster ORC data source based on Apache ORC
> 
>
> Key: SPARK-20682
> URL: https://issues.apache.org/jira/browse/SPARK-20682
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.2, 1.6.3, 2.1.1
>Reporter: Dongjoon Hyun
>
> Since SPARK-2883, Apache Spark supports Apache ORC inside `sql/hive` module 
> with Hive dependency. This issue aims to add a new and faster ORC data source 
> inside `sql/core` and to replace the old ORC data source eventually. In this 
> issue, the latest Apache ORC 1.4.0 (released yesterday) is used.
> There are four key benefits.
> - Speed: Use both Spark `ColumnarBatch` and ORC `RowBatch` together. This is 
> faster than the current implementation in Spark.
> - Stability: Apache ORC 1.4.0 has many fixes and we can depend on ORC 
> community more.
> - Usability: User can use `ORC` data sources without hive module, i.e, 
> `-Phive`.
> - Maintainability: Reduce the Hive dependency and can remove old legacy code 
> later.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-20682) Support a new faster ORC data source based on Apache ORC

2017-05-09 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16003326#comment-16003326
 ] 

Dongjoon Hyun edited comment on SPARK-20682 at 5/9/17 7:10 PM:
---

I fixed the description. Yep. `hive-storage-api` was a old huddle before 1.4.0. 
1.4.0 provides `nohive` artifacts, too.

https://repo.maven.apache.org/maven2/org/apache/orc/orc-core/1.4.0/

I'll make a PR soon as a POC.


was (Author: dongjoon):
Yep. That was a old huddle before 1.4.0. 1.4.0 provides `nohive` artifacts, too.

https://repo.maven.apache.org/maven2/org/apache/orc/orc-core/1.4.0/

I'll make a PR soon as a POC.

> Support a new faster ORC data source based on Apache ORC
> 
>
> Key: SPARK-20682
> URL: https://issues.apache.org/jira/browse/SPARK-20682
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.2, 1.6.3, 2.1.1
>Reporter: Dongjoon Hyun
>
> Since SPARK-2883, Apache Spark supports Apache ORC inside `sql/hive` module 
> with Hive dependency. This issue aims to add a new and faster ORC data source 
> inside `sql/core` and to replace the old ORC data source eventually. In this 
> issue, the latest Apache ORC 1.4.0 (released yesterday) is used.
> There are four key benefits.
> - Speed: Use both Spark `ColumnarBatch` and ORC `RowBatch` together. This is 
> faster than the current implementation in Spark.
> - Stability: Apache ORC 1.4.0 has many fixes and we can depend on ORC 
> community more.
> - Usability: User can use `ORC` data sources without hive module, i.e, 
> `-Phive`.
> - Maintainability: Reduce the Hive dependency and can remove old legacy code 
> later.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20682) Support a new faster ORC data source based on Apache ORC

2017-05-09 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16003326#comment-16003326
 ] 

Dongjoon Hyun commented on SPARK-20682:
---

Yep. That was a old huddle before 1.4.0. 1.4.0 provides `nohive` artifacts, too.

https://repo.maven.apache.org/maven2/org/apache/orc/orc-core/1.4.0/

I'll make a PR soon as a POC.

> Support a new faster ORC data source based on Apache ORC
> 
>
> Key: SPARK-20682
> URL: https://issues.apache.org/jira/browse/SPARK-20682
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.2, 1.6.3, 2.1.1
>Reporter: Dongjoon Hyun
>
> Since SPARK-2883, Apache Spark supports Apache ORC inside `sql/hive` module 
> with Hive dependency. This issue aims to add a new and faster ORC data source 
> inside `sql/core` and to replace the old ORC data source eventually. In this 
> issue, the latest Apache ORC 1.4.0 (released yesterday) is used.
> There are four key benefits.
> - Speed: Use both Spark `ColumnarBatch` and ORC `RowBatch` together. This is 
> faster than the current implementation in Spark.
> - Stability: Apache ORC 1.4.0 has many fixes and we can depend on ORC 
> community more.
> - Usability: User can use `ORC` data sources with hive module, i.e, `-Phive`.
> - Maintainability: Reduce the Hive dependency and can remove old legacy code 
> later.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20682) Support a new faster ORC data source based on Apache ORC

2017-05-09 Thread Robert Kruszewski (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16003321#comment-16003321
 ] 

Robert Kruszewski commented on SPARK-20682:
---

Would it make sense to bring hive-storage-api to sql/core and allow people use 
orc without -Phive? There's a ton of cruft that hive brings like hive-exec 
which technically isn't necessary for orc. Looking at hive-storage-api it's 
mostly bean classes to define types and filters.

> Support a new faster ORC data source based on Apache ORC
> 
>
> Key: SPARK-20682
> URL: https://issues.apache.org/jira/browse/SPARK-20682
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.2, 1.6.3, 2.1.1
>Reporter: Dongjoon Hyun
>
> Since SPARK-2883, Apache Spark supports Apache ORC inside `sql/hive` module 
> with Hive dependency. This issue aims to add a new and faster ORC data source 
> inside `sql/core` and to replace the old ORC data source eventually. In this 
> issue, the latest Apache ORC 1.4.0 (released yesterday) is used.
> There are four key benefits.
> - Speed: Use both Spark `ColumnarBatch` and ORC `RowBatch` together. This is 
> faster than the current implementation in Spark.
> - Stability: Apache ORC 1.4.0 has many fixes and we can depend on ORC 
> community more.
> - Usability: User can use `ORC` data sources with hive module, i.e, `-Phive`.
> - Maintainability: Reduce the Hive dependency and can remove old legacy code 
> later.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20682) Support a new faster ORC data source based on Apache ORC

2017-05-09 Thread Dongjoon Hyun (JIRA)

Dongjoon Hyun created SPARK-20682:
-

 Summary: Support a new faster ORC data source based on Apache ORC
 Key: SPARK-20682
 URL: https://issues.apache.org/jira/browse/SPARK-20682
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.1, 1.6.3, 1.5.2, 1.4.1
Reporter: Dongjoon Hyun


Since SPARK-2883, Apache Spark supports Apache ORC inside `sql/hive` module 
with Hive dependency. This issue aims to add a new and faster ORC data source 
inside `sql/core` and to replace the old ORC data source eventually. In this 
issue, the latest Apache ORC 1.4.0 (released yesterday) is used.

There are four key benefits.

- Speed: Use both Spark `ColumnarBatch` and ORC `RowBatch` together. This is 
faster than the current implementation in Spark.
- Stability: Apache ORC 1.4.0 has many fixes and we can depend on ORC community 
more.
- Usability: User can use `ORC` data sources with hive module, i.e, `-Phive`.
- Maintainability: Reduce the Hive dependency and can remove old legacy code 
later.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-18528) limit + groupBy leads to java.lang.NullPointerException

2017-05-09 Thread Esther Kundin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16003265#comment-16003265
 ] 

Esther Kundin edited comment on SPARK-18528 at 5/9/17 6:43 PM:
---

I have just found the same bug with Spark 2.10, even though it's marked as 
fixed.  I believe you need to add 2.1.0 to the "Affected Versions" list.

I have a Dataframe df with a few columns that I read in from json, and running:
df.select('foo').distinct().count()
works fine.
df.limit(100).select('foo').distinct().count()
throws a NPE.
df.limit(100).repartition('foo').select('foo').distinct().count()
works fine too.  Seems like the same bug, still broken.


was (Author: ekundin):
I have just found the same bug with Spark 2.1, even though it's marked as fixed.

I have a Dataframe df with a few columns that I read in from json, and running:
df.select('foo').distinct().count()
works fine.
df.limit(100).select('foo').distinct().count()
throws a NPE.
df.limit(100).repartition('foo').select('foo').distinct().count()
works fine too.  Seems like the same bug, still broken.

> limit + groupBy leads to java.lang.NullPointerException
> ---
>
> Key: SPARK-18528
> URL: https://issues.apache.org/jira/browse/SPARK-18528
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.1
> Environment: CentOS release 6.6, Linux 2.6.32-504.el6.x86_64
>Reporter: Corey
>Assignee: Takeshi Yamamuro
> Fix For: 2.0.3, 2.1.1, 2.2.0
>
>
> Using limit on a DataFrame prior to groupBy will lead to a crash. 
> Repartitioning will avoid the crash.
> *will crash:* {{df.limit(3).groupBy("user_id").count().show()}}
> *will work:* {{df.limit(3).coalesce(1).groupBy('user_id').count().show()}}
> *will work:* 
> {{df.limit(3).repartition('user_id').groupBy('user_id').count().show()}}
> Here is a reproducible example along with the error message:
> {quote}
> >>> df = spark.createDataFrame([ (1, 1), (1, 3), (2, 1), (3, 2), (3, 3) ], 
> >>> ["user_id", "genre_id"])
> >>>
> >>> df.show()
> +---++
> |user_id|genre_id|
> +---++
> |  1|   1|
> |  1|   3|
> |  2|   1|
> |  3|   2|
> |  3|   3|
> +---++
> >>> df.groupBy("user_id").count().show()
> +---+-+
> |user_id|count|
> +---+-+
> |  1|2|
> |  3|2|
> |  2|1|
> +---+-+
> >>> df.limit(3).groupBy("user_id").count().show()
> [Stage 8:===>(1964 + 24) / 
> 2000]16/11/21 01:59:27 WARN TaskSetManager: Lost task 0.0 in stage 9.0 (TID 
> 8204, lvsp20hdn012.stubprod.com): java.lang.NullPointerException
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-12297) Add work-around for Parquet/Hive int96 timestamp bug.

2017-05-09 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16003267#comment-16003267
 ] 

Reynold Xin edited comment on SPARK-12297 at 5/9/17 6:40 PM:
-

I looked at the issue again and reverted the patch. If we want to resolve this 
issue, we need to look at the fundamental incompatibility (that is - the two 
data types have different semantics: timestamp without timezone and timestamp 
with timezone). The two data types have different semantics when parsing data.

Also this is not just a Parquet issue. The same issue could happen to all data 
formats. It is going to be really confusing to have something that only works 
for Parquet simply because Impala cares more about Parquet.

It seems like the purpose of this patch can be accomplished by just setting the 
session local timezone to UTC?


was (Author: rxin):
I looked at the issue again and reverted the patch. If we want to resolve this 
issue, we need to look at the fundamental incompatibility (that is - the two 
data types have different semantics: timestamp without timezone and timestamp 
with timezone). The two data types have different semantics when parsing data.

Also this is not just a Parquet issue. The same issue could happen to all data 
types. It is going to be really confusing to have something that only works for 
Parquet simply because Impala cares more about Parquet.

It seems like the purpose of this patch can be accomplished by just setting the 
session local timezone to UTC?

> Add work-around for Parquet/Hive int96 timestamp bug.
> -
>
> Key: SPARK-12297
> URL: https://issues.apache.org/jira/browse/SPARK-12297
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Reporter: Ryan Blue
>
> Spark copied Hive's behavior for parquet, but this was inconsistent with 
> other file formats, and inconsistent with Impala (which is the original 
> source of putting a timestamp as an int96 in parquet, I believe).  This made 
> timestamps in parquet act more like timestamps with timezones, while in other 
> file formats, timestamps have no time zone, they are a "floating time".
> The easiest way to see this issue is to write out a table with timestamps in 
> multiple different formats from one timezone, then try to read them back in 
> another timezone.  Eg., here I write out a few timestamps to parquet and 
> textfile hive tables, and also just as a json file, all in the 
> "America/Los_Angeles" timezone:
> {code}
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types._
> val tblPrefix = args(0)
> val schema = new StructType().add("ts", TimestampType)
> val rows = sc.parallelize(Seq(
>   "2015-12-31 23:50:59.123",
>   "2015-12-31 22:49:59.123",
>   "2016-01-01 00:39:59.123",
>   "2016-01-01 01:29:59.123"
> ).map { x => Row(java.sql.Timestamp.valueOf(x)) })
> val rawData = spark.createDataFrame(rows, schema).toDF()
> rawData.show()
> Seq("parquet", "textfile").foreach { format =>
>   val tblName = s"${tblPrefix}_$format"
>   spark.sql(s"DROP TABLE IF EXISTS $tblName")
>   spark.sql(
> raw"""CREATE TABLE $tblName (
>   |  ts timestamp
>   | )
>   | STORED AS $format
>  """.stripMargin)
>   rawData.write.insertInto(tblName)
> }
> rawData.write.json(s"${tblPrefix}_json")
> {code}
> Then I start a spark-shell in "America/New_York" timezone, and read the data 
> back from each table:
> {code}
> scala> spark.sql("select * from la_parquet").collect().foreach{println}
> [2016-01-01 02:50:59.123]
> [2016-01-01 01:49:59.123]
> [2016-01-01 03:39:59.123]
> [2016-01-01 04:29:59.123]
> scala> spark.sql("select * from la_textfile").collect().foreach{println}
> [2015-12-31 23:50:59.123]
> [2015-12-31 22:49:59.123]
> [2016-01-01 00:39:59.123]
> [2016-01-01 01:29:59.123]
> scala> spark.read.json("la_json").collect().foreach{println}
> [2015-12-31 23:50:59.123]
> [2015-12-31 22:49:59.123]
> [2016-01-01 00:39:59.123]
> [2016-01-01 01:29:59.123]
> scala> spark.read.json("la_json").join(spark.sql("select * from 
> la_textfile"), "ts").show()
> ++
> |  ts|
> ++
> |2015-12-31 23:50:...|
> |2015-12-31 22:49:...|
> |2016-01-01 00:39:...|
> |2016-01-01 01:29:...|
> ++
> scala> spark.read.json("la_json").join(spark.sql("select * from la_parquet"), 
> "ts").show()
> +---+
> | ts|
> +---+
> +---+
> {code}
> The textfile and json based data shows the same times, and can be joined 
> against each other, while the times from the parquet data have changed (and 
> obviously joins fail).
> This is a big problem for any organization that may try to read the same data 
> (say in S3) with clusters in multiple timezones.  It can also be a nasty 
> surprise as an organization tries to migrate file formats.

[jira] [Comment Edited] (SPARK-12297) Add work-around for Parquet/Hive int96 timestamp bug.

2017-05-09 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16003267#comment-16003267
 ] 

Reynold Xin edited comment on SPARK-12297 at 5/9/17 6:39 PM:
-

I looked at the issue again and reverted the patch. If we want to resolve this 
issue, we need to look at the fundamental incompatibility (that is - the two 
data types have different semantics: timestamp without timezone and timestamp 
with timezone). The two data types have different semantics when parsing data.

Also this is not just a Parquet issue. The same issue could happen to all data 
types. It is going to be really confusing to have something that only works for 
Parquet simply because Impala cares more about Parquet.

It seems like the purpose of this patch can be accomplished by just setting the 
session local timezone to UTC?


was (Author: rxin):
I looked at the issue again and reverted the patch. If we want to resolve this 
issue, we need to look at the fundamental incompatibility (that is - the two 
data types have different semantics: timestamp without timezone and timestamp 
with timezone). The two data types have different semantics when parsing data.

It seems like the purpose of this patch can be accomplished by just setting the 
session local timezone to UTC?

> Add work-around for Parquet/Hive int96 timestamp bug.
> -
>
> Key: SPARK-12297
> URL: https://issues.apache.org/jira/browse/SPARK-12297
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Reporter: Ryan Blue
>
> Spark copied Hive's behavior for parquet, but this was inconsistent with 
> other file formats, and inconsistent with Impala (which is the original 
> source of putting a timestamp as an int96 in parquet, I believe).  This made 
> timestamps in parquet act more like timestamps with timezones, while in other 
> file formats, timestamps have no time zone, they are a "floating time".
> The easiest way to see this issue is to write out a table with timestamps in 
> multiple different formats from one timezone, then try to read them back in 
> another timezone.  Eg., here I write out a few timestamps to parquet and 
> textfile hive tables, and also just as a json file, all in the 
> "America/Los_Angeles" timezone:
> {code}
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types._
> val tblPrefix = args(0)
> val schema = new StructType().add("ts", TimestampType)
> val rows = sc.parallelize(Seq(
>   "2015-12-31 23:50:59.123",
>   "2015-12-31 22:49:59.123",
>   "2016-01-01 00:39:59.123",
>   "2016-01-01 01:29:59.123"
> ).map { x => Row(java.sql.Timestamp.valueOf(x)) })
> val rawData = spark.createDataFrame(rows, schema).toDF()
> rawData.show()
> Seq("parquet", "textfile").foreach { format =>
>   val tblName = s"${tblPrefix}_$format"
>   spark.sql(s"DROP TABLE IF EXISTS $tblName")
>   spark.sql(
> raw"""CREATE TABLE $tblName (
>   |  ts timestamp
>   | )
>   | STORED AS $format
>  """.stripMargin)
>   rawData.write.insertInto(tblName)
> }
> rawData.write.json(s"${tblPrefix}_json")
> {code}
> Then I start a spark-shell in "America/New_York" timezone, and read the data 
> back from each table:
> {code}
> scala> spark.sql("select * from la_parquet").collect().foreach{println}
> [2016-01-01 02:50:59.123]
> [2016-01-01 01:49:59.123]
> [2016-01-01 03:39:59.123]
> [2016-01-01 04:29:59.123]
> scala> spark.sql("select * from la_textfile").collect().foreach{println}
> [2015-12-31 23:50:59.123]
> [2015-12-31 22:49:59.123]
> [2016-01-01 00:39:59.123]
> [2016-01-01 01:29:59.123]
> scala> spark.read.json("la_json").collect().foreach{println}
> [2015-12-31 23:50:59.123]
> [2015-12-31 22:49:59.123]
> [2016-01-01 00:39:59.123]
> [2016-01-01 01:29:59.123]
> scala> spark.read.json("la_json").join(spark.sql("select * from 
> la_textfile"), "ts").show()
> ++
> |  ts|
> ++
> |2015-12-31 23:50:...|
> |2015-12-31 22:49:...|
> |2016-01-01 00:39:...|
> |2016-01-01 01:29:...|
> ++
> scala> spark.read.json("la_json").join(spark.sql("select * from la_parquet"), 
> "ts").show()
> +---+
> | ts|
> +---+
> +---+
> {code}
> The textfile and json based data shows the same times, and can be joined 
> against each other, while the times from the parquet data have changed (and 
> obviously joins fail).
> This is a big problem for any organization that may try to read the same data 
> (say in S3) with clusters in multiple timezones.  It can also be a nasty 
> surprise as an organization tries to migrate file formats.  Finally, its a 
> source of incompatibility between Hive, Impala, and Spark.
> HIVE-12767 aims to fix this by introducing a table property which indicates 
> the "storage timezone" for the table.  Spark should add the

[jira] [Commented] (SPARK-12297) Add work-around for Parquet/Hive int96 timestamp bug.

2017-05-09 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16003267#comment-16003267
 ] 

Reynold Xin commented on SPARK-12297:
-

I looked at the issue again and reverted the patch. If we want to resolve this 
issue, we need to look at the fundamental incompatibility (that is - the two 
data types have different semantics: timestamp without timezone and timestamp 
with timezone). The two data types have different semantics when parsing data.

It seems like the purpose of this patch can be accomplished by just setting the 
session local timezone to UTC?

> Add work-around for Parquet/Hive int96 timestamp bug.
> -
>
> Key: SPARK-12297
> URL: https://issues.apache.org/jira/browse/SPARK-12297
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Reporter: Ryan Blue
>
> Spark copied Hive's behavior for parquet, but this was inconsistent with 
> other file formats, and inconsistent with Impala (which is the original 
> source of putting a timestamp as an int96 in parquet, I believe).  This made 
> timestamps in parquet act more like timestamps with timezones, while in other 
> file formats, timestamps have no time zone, they are a "floating time".
> The easiest way to see this issue is to write out a table with timestamps in 
> multiple different formats from one timezone, then try to read them back in 
> another timezone.  Eg., here I write out a few timestamps to parquet and 
> textfile hive tables, and also just as a json file, all in the 
> "America/Los_Angeles" timezone:
> {code}
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types._
> val tblPrefix = args(0)
> val schema = new StructType().add("ts", TimestampType)
> val rows = sc.parallelize(Seq(
>   "2015-12-31 23:50:59.123",
>   "2015-12-31 22:49:59.123",
>   "2016-01-01 00:39:59.123",
>   "2016-01-01 01:29:59.123"
> ).map { x => Row(java.sql.Timestamp.valueOf(x)) })
> val rawData = spark.createDataFrame(rows, schema).toDF()
> rawData.show()
> Seq("parquet", "textfile").foreach { format =>
>   val tblName = s"${tblPrefix}_$format"
>   spark.sql(s"DROP TABLE IF EXISTS $tblName")
>   spark.sql(
> raw"""CREATE TABLE $tblName (
>   |  ts timestamp
>   | )
>   | STORED AS $format
>  """.stripMargin)
>   rawData.write.insertInto(tblName)
> }
> rawData.write.json(s"${tblPrefix}_json")
> {code}
> Then I start a spark-shell in "America/New_York" timezone, and read the data 
> back from each table:
> {code}
> scala> spark.sql("select * from la_parquet").collect().foreach{println}
> [2016-01-01 02:50:59.123]
> [2016-01-01 01:49:59.123]
> [2016-01-01 03:39:59.123]
> [2016-01-01 04:29:59.123]
> scala> spark.sql("select * from la_textfile").collect().foreach{println}
> [2015-12-31 23:50:59.123]
> [2015-12-31 22:49:59.123]
> [2016-01-01 00:39:59.123]
> [2016-01-01 01:29:59.123]
> scala> spark.read.json("la_json").collect().foreach{println}
> [2015-12-31 23:50:59.123]
> [2015-12-31 22:49:59.123]
> [2016-01-01 00:39:59.123]
> [2016-01-01 01:29:59.123]
> scala> spark.read.json("la_json").join(spark.sql("select * from 
> la_textfile"), "ts").show()
> ++
> |  ts|
> ++
> |2015-12-31 23:50:...|
> |2015-12-31 22:49:...|
> |2016-01-01 00:39:...|
> |2016-01-01 01:29:...|
> ++
> scala> spark.read.json("la_json").join(spark.sql("select * from la_parquet"), 
> "ts").show()
> +---+
> | ts|
> +---+
> +---+
> {code}
> The textfile and json based data shows the same times, and can be joined 
> against each other, while the times from the parquet data have changed (and 
> obviously joins fail).
> This is a big problem for any organization that may try to read the same data 
> (say in S3) with clusters in multiple timezones.  It can also be a nasty 
> surprise as an organization tries to migrate file formats.  Finally, its a 
> source of incompatibility between Hive, Impala, and Spark.
> HIVE-12767 aims to fix this by introducing a table property which indicates 
> the "storage timezone" for the table.  Spark should add the same to ensure 
> consistency between file formats, and with Hive & Impala.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18528) limit + groupBy leads to java.lang.NullPointerException

2017-05-09 Thread Esther Kundin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16003265#comment-16003265
 ] 

Esther Kundin commented on SPARK-18528:
---

I have just found the same bug with Spark 2.1, even though it's marked as fixed.

I have a Dataframe df with a few columns that I read in from json, and running:
df.select('foo').distinct().count()
works fine.
df.limit(100).select('foo').distinct().count()
throws a NPE.
df.limit(100).repartition('foo').select('foo').distinct().count()
works fine too.  Seems like the same bug, still broken.

> limit + groupBy leads to java.lang.NullPointerException
> ---
>
> Key: SPARK-18528
> URL: https://issues.apache.org/jira/browse/SPARK-18528
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.1
> Environment: CentOS release 6.6, Linux 2.6.32-504.el6.x86_64
>Reporter: Corey
>Assignee: Takeshi Yamamuro
> Fix For: 2.0.3, 2.1.1, 2.2.0
>
>
> Using limit on a DataFrame prior to groupBy will lead to a crash. 
> Repartitioning will avoid the crash.
> *will crash:* {{df.limit(3).groupBy("user_id").count().show()}}
> *will work:* {{df.limit(3).coalesce(1).groupBy('user_id').count().show()}}
> *will work:* 
> {{df.limit(3).repartition('user_id').groupBy('user_id').count().show()}}
> Here is a reproducible example along with the error message:
> {quote}
> >>> df = spark.createDataFrame([ (1, 1), (1, 3), (2, 1), (3, 2), (3, 3) ], 
> >>> ["user_id", "genre_id"])
> >>>
> >>> df.show()
> +---++
> |user_id|genre_id|
> +---++
> |  1|   1|
> |  1|   3|
> |  2|   1|
> |  3|   2|
> |  3|   3|
> +---++
> >>> df.groupBy("user_id").count().show()
> +---+-+
> |user_id|count|
> +---+-+
> |  1|2|
> |  3|2|
> |  2|1|
> +---+-+
> >>> df.limit(3).groupBy("user_id").count().show()
> [Stage 8:===>(1964 + 24) / 
> 2000]16/11/21 01:59:27 WARN TaskSetManager: Lost task 0.0 in stage 9.0 (TID 
> 8204, lvsp20hdn012.stubprod.com): java.lang.NullPointerException
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12297) Add work-around for Parquet/Hive int96 timestamp bug.

2017-05-09 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-12297:

Fix Version/s: (was: 2.3.0)

> Add work-around for Parquet/Hive int96 timestamp bug.
> -
>
> Key: SPARK-12297
> URL: https://issues.apache.org/jira/browse/SPARK-12297
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Reporter: Ryan Blue
>
> Spark copied Hive's behavior for parquet, but this was inconsistent with 
> other file formats, and inconsistent with Impala (which is the original 
> source of putting a timestamp as an int96 in parquet, I believe).  This made 
> timestamps in parquet act more like timestamps with timezones, while in other 
> file formats, timestamps have no time zone, they are a "floating time".
> The easiest way to see this issue is to write out a table with timestamps in 
> multiple different formats from one timezone, then try to read them back in 
> another timezone.  Eg., here I write out a few timestamps to parquet and 
> textfile hive tables, and also just as a json file, all in the 
> "America/Los_Angeles" timezone:
> {code}
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types._
> val tblPrefix = args(0)
> val schema = new StructType().add("ts", TimestampType)
> val rows = sc.parallelize(Seq(
>   "2015-12-31 23:50:59.123",
>   "2015-12-31 22:49:59.123",
>   "2016-01-01 00:39:59.123",
>   "2016-01-01 01:29:59.123"
> ).map { x => Row(java.sql.Timestamp.valueOf(x)) })
> val rawData = spark.createDataFrame(rows, schema).toDF()
> rawData.show()
> Seq("parquet", "textfile").foreach { format =>
>   val tblName = s"${tblPrefix}_$format"
>   spark.sql(s"DROP TABLE IF EXISTS $tblName")
>   spark.sql(
> raw"""CREATE TABLE $tblName (
>   |  ts timestamp
>   | )
>   | STORED AS $format
>  """.stripMargin)
>   rawData.write.insertInto(tblName)
> }
> rawData.write.json(s"${tblPrefix}_json")
> {code}
> Then I start a spark-shell in "America/New_York" timezone, and read the data 
> back from each table:
> {code}
> scala> spark.sql("select * from la_parquet").collect().foreach{println}
> [2016-01-01 02:50:59.123]
> [2016-01-01 01:49:59.123]
> [2016-01-01 03:39:59.123]
> [2016-01-01 04:29:59.123]
> scala> spark.sql("select * from la_textfile").collect().foreach{println}
> [2015-12-31 23:50:59.123]
> [2015-12-31 22:49:59.123]
> [2016-01-01 00:39:59.123]
> [2016-01-01 01:29:59.123]
> scala> spark.read.json("la_json").collect().foreach{println}
> [2015-12-31 23:50:59.123]
> [2015-12-31 22:49:59.123]
> [2016-01-01 00:39:59.123]
> [2016-01-01 01:29:59.123]
> scala> spark.read.json("la_json").join(spark.sql("select * from 
> la_textfile"), "ts").show()
> ++
> |  ts|
> ++
> |2015-12-31 23:50:...|
> |2015-12-31 22:49:...|
> |2016-01-01 00:39:...|
> |2016-01-01 01:29:...|
> ++
> scala> spark.read.json("la_json").join(spark.sql("select * from la_parquet"), 
> "ts").show()
> +---+
> | ts|
> +---+
> +---+
> {code}
> The textfile and json based data shows the same times, and can be joined 
> against each other, while the times from the parquet data have changed (and 
> obviously joins fail).
> This is a big problem for any organization that may try to read the same data 
> (say in S3) with clusters in multiple timezones.  It can also be a nasty 
> surprise as an organization tries to migrate file formats.  Finally, its a 
> source of incompatibility between Hive, Impala, and Spark.
> HIVE-12767 aims to fix this by introducing a table property which indicates 
> the "storage timezone" for the table.  Spark should add the same to ensure 
> consistency between file formats, and with Hive & Impala.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-12297) Add work-around for Parquet/Hive int96 timestamp bug.

2017-05-09 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin reopened SPARK-12297:
-
  Assignee: (was: Imran Rashid)

> Add work-around for Parquet/Hive int96 timestamp bug.
> -
>
> Key: SPARK-12297
> URL: https://issues.apache.org/jira/browse/SPARK-12297
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Reporter: Ryan Blue
>
> Spark copied Hive's behavior for parquet, but this was inconsistent with 
> other file formats, and inconsistent with Impala (which is the original 
> source of putting a timestamp as an int96 in parquet, I believe).  This made 
> timestamps in parquet act more like timestamps with timezones, while in other 
> file formats, timestamps have no time zone, they are a "floating time".
> The easiest way to see this issue is to write out a table with timestamps in 
> multiple different formats from one timezone, then try to read them back in 
> another timezone.  Eg., here I write out a few timestamps to parquet and 
> textfile hive tables, and also just as a json file, all in the 
> "America/Los_Angeles" timezone:
> {code}
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types._
> val tblPrefix = args(0)
> val schema = new StructType().add("ts", TimestampType)
> val rows = sc.parallelize(Seq(
>   "2015-12-31 23:50:59.123",
>   "2015-12-31 22:49:59.123",
>   "2016-01-01 00:39:59.123",
>   "2016-01-01 01:29:59.123"
> ).map { x => Row(java.sql.Timestamp.valueOf(x)) })
> val rawData = spark.createDataFrame(rows, schema).toDF()
> rawData.show()
> Seq("parquet", "textfile").foreach { format =>
>   val tblName = s"${tblPrefix}_$format"
>   spark.sql(s"DROP TABLE IF EXISTS $tblName")
>   spark.sql(
> raw"""CREATE TABLE $tblName (
>   |  ts timestamp
>   | )
>   | STORED AS $format
>  """.stripMargin)
>   rawData.write.insertInto(tblName)
> }
> rawData.write.json(s"${tblPrefix}_json")
> {code}
> Then I start a spark-shell in "America/New_York" timezone, and read the data 
> back from each table:
> {code}
> scala> spark.sql("select * from la_parquet").collect().foreach{println}
> [2016-01-01 02:50:59.123]
> [2016-01-01 01:49:59.123]
> [2016-01-01 03:39:59.123]
> [2016-01-01 04:29:59.123]
> scala> spark.sql("select * from la_textfile").collect().foreach{println}
> [2015-12-31 23:50:59.123]
> [2015-12-31 22:49:59.123]
> [2016-01-01 00:39:59.123]
> [2016-01-01 01:29:59.123]
> scala> spark.read.json("la_json").collect().foreach{println}
> [2015-12-31 23:50:59.123]
> [2015-12-31 22:49:59.123]
> [2016-01-01 00:39:59.123]
> [2016-01-01 01:29:59.123]
> scala> spark.read.json("la_json").join(spark.sql("select * from 
> la_textfile"), "ts").show()
> ++
> |  ts|
> ++
> |2015-12-31 23:50:...|
> |2015-12-31 22:49:...|
> |2016-01-01 00:39:...|
> |2016-01-01 01:29:...|
> ++
> scala> spark.read.json("la_json").join(spark.sql("select * from la_parquet"), 
> "ts").show()
> +---+
> | ts|
> +---+
> +---+
> {code}
> The textfile and json based data shows the same times, and can be joined 
> against each other, while the times from the parquet data have changed (and 
> obviously joins fail).
> This is a big problem for any organization that may try to read the same data 
> (say in S3) with clusters in multiple timezones.  It can also be a nasty 
> surprise as an organization tries to migrate file formats.  Finally, its a 
> source of incompatibility between Hive, Impala, and Spark.
> HIVE-12767 aims to fix this by introducing a table property which indicates 
> the "storage timezone" for the table.  Spark should add the same to ensure 
> consistency between file formats, and with Hive & Impala.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-12297) Add work-around for Parquet/Hive int96 timestamp bug.

2017-05-09 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16001214#comment-16001214
 ] 

Reynold Xin edited comment on SPARK-12297 at 5/9/17 6:35 PM:
-

Sorry I'm going to revert this. I think this requires further discussions since 
it completely changes the behavior of one of the most important data types.

It'd be great to consider this more holistically and think about alternatives 
in fixing them (the current fix might end up being the best but we should at 
least explore other ones).

One of the fundamental problem is that Spark treats timestamp as timestamp with 
timezone, whereas impala treats timestamp as timestamp without timezone. The 
parquet storage is only a small piece here. Other statements such as cast would 
return different results for these types. Just fixing these piecemeal is a 
really bad idea when it comes to data type semantics.




was (Author: rxin):
Sorry I'm going to revert this. I think this requires further discussions since 
it completely changes the behavior of one of the most important data types.

It'd be great to consider this more holistically and think about alternatives 
in fixing them (the current fix might end up being the best but we should at 
least explore other ones).

One of the fundamental problem is that Spark treats timestamp as timestamp with 
timezone, whereas both impala and hive treats timestamp as timestamp without 
timezone. The parquet storage is only a small piece here. Other statements such 
as cast would return different results for these types. Just fixing these 
piecemeal is a really bad idea when it comes to data type semantics.



> Add work-around for Parquet/Hive int96 timestamp bug.
> -
>
> Key: SPARK-12297
> URL: https://issues.apache.org/jira/browse/SPARK-12297
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Reporter: Ryan Blue
>Assignee: Imran Rashid
> Fix For: 2.3.0
>
>
> Spark copied Hive's behavior for parquet, but this was inconsistent with 
> other file formats, and inconsistent with Impala (which is the original 
> source of putting a timestamp as an int96 in parquet, I believe).  This made 
> timestamps in parquet act more like timestamps with timezones, while in other 
> file formats, timestamps have no time zone, they are a "floating time".
> The easiest way to see this issue is to write out a table with timestamps in 
> multiple different formats from one timezone, then try to read them back in 
> another timezone.  Eg., here I write out a few timestamps to parquet and 
> textfile hive tables, and also just as a json file, all in the 
> "America/Los_Angeles" timezone:
> {code}
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types._
> val tblPrefix = args(0)
> val schema = new StructType().add("ts", TimestampType)
> val rows = sc.parallelize(Seq(
>   "2015-12-31 23:50:59.123",
>   "2015-12-31 22:49:59.123",
>   "2016-01-01 00:39:59.123",
>   "2016-01-01 01:29:59.123"
> ).map { x => Row(java.sql.Timestamp.valueOf(x)) })
> val rawData = spark.createDataFrame(rows, schema).toDF()
> rawData.show()
> Seq("parquet", "textfile").foreach { format =>
>   val tblName = s"${tblPrefix}_$format"
>   spark.sql(s"DROP TABLE IF EXISTS $tblName")
>   spark.sql(
> raw"""CREATE TABLE $tblName (
>   |  ts timestamp
>   | )
>   | STORED AS $format
>  """.stripMargin)
>   rawData.write.insertInto(tblName)
> }
> rawData.write.json(s"${tblPrefix}_json")
> {code}
> Then I start a spark-shell in "America/New_York" timezone, and read the data 
> back from each table:
> {code}
> scala> spark.sql("select * from la_parquet").collect().foreach{println}
> [2016-01-01 02:50:59.123]
> [2016-01-01 01:49:59.123]
> [2016-01-01 03:39:59.123]
> [2016-01-01 04:29:59.123]
> scala> spark.sql("select * from la_textfile").collect().foreach{println}
> [2015-12-31 23:50:59.123]
> [2015-12-31 22:49:59.123]
> [2016-01-01 00:39:59.123]
> [2016-01-01 01:29:59.123]
> scala> spark.read.json("la_json").collect().foreach{println}
> [2015-12-31 23:50:59.123]
> [2015-12-31 22:49:59.123]
> [2016-01-01 00:39:59.123]
> [2016-01-01 01:29:59.123]
> scala> spark.read.json("la_json").join(spark.sql("select * from 
> la_textfile"), "ts").show()
> ++
> |  ts|
> ++
> |2015-12-31 23:50:...|
> |2015-12-31 22:49:...|
> |2016-01-01 00:39:...|
> |2016-01-01 01:29:...|
> ++
> scala> spark.read.json("la_json").join(spark.sql("select * from la_parquet"), 
> "ts").show()
> +---+
> | ts|
> +---+
> +---+
> {code}
> The textfile and json based data shows the same times, and can be joined 
> against each other, while the times from the parquet data have changed (and 
> obviously joins fail).
>

[jira] [Resolved] (SPARK-20627) Remove pip local version string (PEP440)

2017-05-09 Thread holdenk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk resolved SPARK-20627.
-
   Resolution: Fixed
Fix Version/s: 2.1.2
   2.3.0
   2.2.1

Issue resolved by pull request 17885
[https://github.com/apache/spark/pull/17885]

> Remove pip local version string (PEP440)
> 
>
> Key: SPARK-20627
> URL: https://issues.apache.org/jira/browse/SPARK-20627
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.1.0, 2.1.1, 2.1.2, 2.2.0, 2.3.0
>Reporter: holdenk
> Fix For: 2.2.1, 2.3.0, 2.1.2
>
>
> In make distribution script right now we append the hadoop version string, 
> but this makes uploading to PyPI difficult and we don't cross-build for 
> multiple versions anymore.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20627) Remove pip local version string (PEP440)

2017-05-09 Thread holdenk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk reassigned SPARK-20627:
---

Assignee: holdenk

> Remove pip local version string (PEP440)
> 
>
> Key: SPARK-20627
> URL: https://issues.apache.org/jira/browse/SPARK-20627
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.1.0, 2.1.1, 2.1.2, 2.2.0, 2.3.0
>Reporter: holdenk
>Assignee: holdenk
> Fix For: 2.1.2, 2.2.1, 2.3.0
>
>
> In make distribution script right now we append the hadoop version string, 
> but this makes uploading to PyPI difficult and we don't cross-build for 
> multiple versions anymore.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20666) Flaky test - SparkListenerBus randomly failing java.lang.IllegalAccessError

2017-05-09 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-20666:
---
Summary: Flaky test - SparkListenerBus randomly failing 
java.lang.IllegalAccessError  (was: Flaky test - SparkListenerBus randomly 
failing java.lang.IllegalAccessError on Windows)

> Flaky test - SparkListenerBus randomly failing java.lang.IllegalAccessError
> ---
>
> Key: SPARK-20666
> URL: https://issues.apache.org/jira/browse/SPARK-20666
> Project: Spark
>  Issue Type: Bug
>  Components: ML, Spark Core, SparkR
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Priority: Critical
>
> seeing quite a bit of this on AppVeyor, aka Windows only, always only when 
> running ML tests, it seems
> {code}
> Exception in thread "SparkListenerBus" java.lang.IllegalAccessError: 
> Attempted to access garbage collected accumulator 159454
>   at 
> org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:265)
>   at 
> org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:261)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.util.AccumulatorContext$.get(AccumulatorV2.scala:261)
>   at org.apache.spark.util.AccumulatorV2.name(AccumulatorV2.scala:88)
>   at 
> org.apache.spark.sql.execution.metric.SQLMetric.toInfo(SQLMetrics.scala:67)
>   at 
> org.apache.spark.sql.execution.ui.SQLListener$$anonfun$onTaskEnd$1.apply(SQLListener.scala:216)
>   at 
> org.apache.spark.sql.execution.ui.SQLListener$$anonfun$onTaskEnd$1.apply(SQLListener.scala:216)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.execution.ui.SQLListener.onTaskEnd(SQLListener.scala:216)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36)
>   at 
> org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:36)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(LiveListenerBus.scala:94)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:78)
>   at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1268)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:77)
> 1
> MLlib recommendation algorithms: Spark package found in SPARK_HOME: 
> C:\projects\spark\bin\..
> {code}
> {code}
> java.lang.IllegalStateException: SparkContext has been shutdown
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2015)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2044)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2063)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2088)
>   at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>   at org.apache.spark.rdd.RDD.collect(RDD.scala:935)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:275)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:2923)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2474)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2474)
>   at

[jira] [Commented] (SPARK-20666) Flaky test - SparkListenerBus randomly failing java.lang.IllegalAccessError on Windows

2017-05-09 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16003116#comment-16003116
 ] 

Marcelo Vanzin commented on SPARK-20666:


I'm raising the severity because lots of PRs are failing because of these kinds 
of errors...

> Flaky test - SparkListenerBus randomly failing java.lang.IllegalAccessError 
> on Windows
> --
>
> Key: SPARK-20666
> URL: https://issues.apache.org/jira/browse/SPARK-20666
> Project: Spark
>  Issue Type: Bug
>  Components: ML, Spark Core, SparkR
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>
> seeing quite a bit of this on AppVeyor, aka Windows only, always only when 
> running ML tests, it seems
> {code}
> Exception in thread "SparkListenerBus" java.lang.IllegalAccessError: 
> Attempted to access garbage collected accumulator 159454
>   at 
> org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:265)
>   at 
> org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:261)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.util.AccumulatorContext$.get(AccumulatorV2.scala:261)
>   at org.apache.spark.util.AccumulatorV2.name(AccumulatorV2.scala:88)
>   at 
> org.apache.spark.sql.execution.metric.SQLMetric.toInfo(SQLMetrics.scala:67)
>   at 
> org.apache.spark.sql.execution.ui.SQLListener$$anonfun$onTaskEnd$1.apply(SQLListener.scala:216)
>   at 
> org.apache.spark.sql.execution.ui.SQLListener$$anonfun$onTaskEnd$1.apply(SQLListener.scala:216)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.execution.ui.SQLListener.onTaskEnd(SQLListener.scala:216)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36)
>   at 
> org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:36)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(LiveListenerBus.scala:94)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:78)
>   at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1268)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:77)
> 1
> MLlib recommendation algorithms: Spark package found in SPARK_HOME: 
> C:\projects\spark\bin\..
> {code}
> {code}
> java.lang.IllegalStateException: SparkContext has been shutdown
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2015)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2044)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2063)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2088)
>   at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>   at org.apache.spark.rdd.RDD.collect(RDD.scala:935)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:275)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:2923)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2474)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2474)
>   at org.apache.spark.sql.Dataset$$anonfun$57.apply(Dataset.scala:2907)
>   at 
>

[jira] [Updated] (SPARK-20666) Flaky test - SparkListenerBus randomly failing java.lang.IllegalAccessError on Windows

2017-05-09 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-20666:
---
Priority: Critical  (was: Major)

> Flaky test - SparkListenerBus randomly failing java.lang.IllegalAccessError 
> on Windows
> --
>
> Key: SPARK-20666
> URL: https://issues.apache.org/jira/browse/SPARK-20666
> Project: Spark
>  Issue Type: Bug
>  Components: ML, Spark Core, SparkR
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Priority: Critical
>
> seeing quite a bit of this on AppVeyor, aka Windows only, always only when 
> running ML tests, it seems
> {code}
> Exception in thread "SparkListenerBus" java.lang.IllegalAccessError: 
> Attempted to access garbage collected accumulator 159454
>   at 
> org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:265)
>   at 
> org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:261)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.util.AccumulatorContext$.get(AccumulatorV2.scala:261)
>   at org.apache.spark.util.AccumulatorV2.name(AccumulatorV2.scala:88)
>   at 
> org.apache.spark.sql.execution.metric.SQLMetric.toInfo(SQLMetrics.scala:67)
>   at 
> org.apache.spark.sql.execution.ui.SQLListener$$anonfun$onTaskEnd$1.apply(SQLListener.scala:216)
>   at 
> org.apache.spark.sql.execution.ui.SQLListener$$anonfun$onTaskEnd$1.apply(SQLListener.scala:216)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.execution.ui.SQLListener.onTaskEnd(SQLListener.scala:216)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36)
>   at 
> org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:36)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(LiveListenerBus.scala:94)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:78)
>   at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1268)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:77)
> 1
> MLlib recommendation algorithms: Spark package found in SPARK_HOME: 
> C:\projects\spark\bin\..
> {code}
> {code}
> java.lang.IllegalStateException: SparkContext has been shutdown
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2015)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2044)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2063)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2088)
>   at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>   at org.apache.spark.rdd.RDD.collect(RDD.scala:935)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:275)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:2923)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2474)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2474)
>   at org.apache.spark.sql.Dataset$$anonfun$57.apply(Dataset.scala:2907)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
>   at

[jira] [Commented] (SPARK-10408) Autoencoder

2017-05-09 Thread Alexander Ulanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16003110#comment-16003110
 ] 

Alexander Ulanov commented on SPARK-10408:
--

Autoencoder is implemented in the referenced pull request. I will be glad to 
follow up on the code review if anyone can do it.

> Autoencoder
> ---
>
> Key: SPARK-10408
> URL: https://issues.apache.org/jira/browse/SPARK-10408
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.5.0
>Reporter: Alexander Ulanov
>Assignee: Alexander Ulanov
>
> Goal: Implement various types of autoencoders 
> Requirements:
> 1)Basic (deep) autoencoder that supports different types of inputs: binary, 
> real in [0..1]. real in [-inf, +inf] 
> 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature 
> to the MLP and then used here 
> 3)Denoising autoencoder 
> 4)Stacked autoencoder for pre-training of deep networks. It should support 
> arbitrary network layers
> References: 
> 1. Vincent, Pascal, et al. "Extracting and composing robust features with 
> denoising autoencoders." Proceedings of the 25th international conference on 
> Machine learning. ACM, 2008. 
> http://www.iro.umontreal.ca/~vincentp/Publications/denoising_autoencoders_tr1316.pdf
>  
> 2. 
> http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf, 
> 3. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. 
> (2010). Stacked denoising autoencoders: Learning useful representations in a 
> deep network with a local denoising criterion. Journal of Machine Learning 
> Research, 11(3371–3408). 
> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.297.3484=rep1=pdf
> 4, 5, 6. Bengio, Yoshua, et al. "Greedy layer-wise training of deep 
> networks." Advances in neural information processing systems 19 (2007): 153. 
> http://www.iro.umontreal.ca/~lisa/pointeurs/dbn_supervised_tr1282.pdf



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14584) Improve recognition of non-nullability in Dataset transformations

2017-05-09 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16003082#comment-16003082
 ] 

Josh Rosen commented on SPARK-14584:


[~hyukjin.kwon], yeah, this does appear to be fixed. My only question is 
whether we have a test to prevent this behavior from regressing. Do you 
remember where it was fixed / changed?

> Improve recognition of non-nullability in Dataset transformations
> -
>
> Key: SPARK-14584
> URL: https://issues.apache.org/jira/browse/SPARK-14584
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Josh Rosen
>
> There are many cases where we can statically know that a field will never be 
> null. For instance, a field in a case class with a primitive type will never 
> return null. However, there are currently several cases in the Dataset API 
> where we do not properly recognize this non-nullability. For instance:
> {code}
> case class MyCaseClass(foo: Int)
> sc.parallelize(Seq(0)).toDS.map(MyCaseClass).printSchema
> {code}
> claims that the {{foo}} field is nullable even though this is impossible.
> I believe that this is due to the way that we reason about nullability when 
> constructing serializer expressions in ExpressionEncoders. The following 
> assertion will currently fail if added to ExpressionEncoder:
> {code}
>   require(schema.size == serializer.size)
>   schema.fields.zip(serializer).foreach { case (field, fieldSerializer) =>
> require(field.dataType == fieldSerializer.dataType, s"Field 
> ${field.name}'s data type is " +
>   s"${field.dataType} in the schema but ${fieldSerializer.dataType} in 
> its serializer")
> require(field.nullable == fieldSerializer.nullable, s"Field 
> ${field.name}'s nullability is " +
>   s"${field.nullable} in the schema but ${fieldSerializer.nullable} in 
> its serializer")
>   }
> {code}
> Most often, the schema claims that a field is non-nullable while the encoder 
> allows for nullability, but occasionally we see a mismatch in the datatypes 
> due to disagreements over the nullability of nested structs' fields (or 
> fields of structs in arrays).
> I think the problem is that when we're reasoning about nullability in a 
> struct's schema we consider its fields' nullability to be independent of the 
> nullability of the struct itself, whereas in the serializer expressions we 
> are considering those field extraction expressions to be nullable if the 
> input objects themselves can be nullable.
> I'm not sure what's the simplest way to fix this. One proposal would be to 
> leave the serializers unchanged and have ObjectOperator derive its output 
> attributes from an explicitly-passed schema rather than using the 
> serializers' attributes. However, I worry that this might introduce bugs in 
> case the serializer and schema disagree.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20666) Flaky test - SparkListenerBus randomly failing java.lang.IllegalAccessError on Windows

2017-05-09 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16003055#comment-16003055
 ] 

Marcelo Vanzin commented on SPARK-20666:


[~cloud_fan] do you think your fix for SPARK-12837 could have caused this? I've 
only noticed this error recently, and that's the only recent change in 
accumulator code...

A quick look at it didn't flag anything, though.

> Flaky test - SparkListenerBus randomly failing java.lang.IllegalAccessError 
> on Windows
> --
>
> Key: SPARK-20666
> URL: https://issues.apache.org/jira/browse/SPARK-20666
> Project: Spark
>  Issue Type: Bug
>  Components: ML, Spark Core, SparkR
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>
> seeing quite a bit of this on AppVeyor, aka Windows only, always only when 
> running ML tests, it seems
> {code}
> Exception in thread "SparkListenerBus" java.lang.IllegalAccessError: 
> Attempted to access garbage collected accumulator 159454
>   at 
> org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:265)
>   at 
> org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:261)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.util.AccumulatorContext$.get(AccumulatorV2.scala:261)
>   at org.apache.spark.util.AccumulatorV2.name(AccumulatorV2.scala:88)
>   at 
> org.apache.spark.sql.execution.metric.SQLMetric.toInfo(SQLMetrics.scala:67)
>   at 
> org.apache.spark.sql.execution.ui.SQLListener$$anonfun$onTaskEnd$1.apply(SQLListener.scala:216)
>   at 
> org.apache.spark.sql.execution.ui.SQLListener$$anonfun$onTaskEnd$1.apply(SQLListener.scala:216)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.execution.ui.SQLListener.onTaskEnd(SQLListener.scala:216)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36)
>   at 
> org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:36)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(LiveListenerBus.scala:94)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:78)
>   at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1268)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:77)
> 1
> MLlib recommendation algorithms: Spark package found in SPARK_HOME: 
> C:\projects\spark\bin\..
> {code}
> {code}
> java.lang.IllegalStateException: SparkContext has been shutdown
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2015)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2044)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2063)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2088)
>   at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>   at org.apache.spark.rdd.RDD.collect(RDD.scala:935)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:275)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:2923)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2474)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2474)
>

[jira] [Comment Edited] (SPARK-20666) Flaky test - SparkListenerBus randomly failing java.lang.IllegalAccessError on Windows

2017-05-09 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16002962#comment-16002962
 ] 

Marcelo Vanzin edited comment on SPARK-20666 at 5/9/17 4:52 PM:


This doesn't fail just on Windows. It's been failing in PR builders also (e.g. 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76605/testReport).

In a private branch of mine I ran into this problem and have this workaround:
https://github.com/vanzin/spark/pull/22/files#diff-edd374dbb96bc16363b65dab1e554793R114

But that's not cleanly "backportable" at the moment, and I don't know if it's 
the right solution either.

(BTW my fix is in SQL, which also suffers from this, so it would not help these 
mllib tests...)


was (Author: vanzin):
This doesn't fail just on Windows. It's been failing in PR builders also (e.g. 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76605/testReport).

In a private branch of mine I ran into this problem and have this workaround:
https://github.com/vanzin/spark/pull/22/files#diff-edd374dbb96bc16363b65dab1e554793R114

But that's not cleanly "backportable" at the moment, and I don't know if it's 
the right solution either.

(BTW my fix is in SQL, which also suffers from this, so it would help these 
mllib tests...)

> Flaky test - SparkListenerBus randomly failing java.lang.IllegalAccessError 
> on Windows
> --
>
> Key: SPARK-20666
> URL: https://issues.apache.org/jira/browse/SPARK-20666
> Project: Spark
>  Issue Type: Bug
>  Components: ML, Spark Core, SparkR
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>
> seeing quite a bit of this on AppVeyor, aka Windows only, always only when 
> running ML tests, it seems
> {code}
> Exception in thread "SparkListenerBus" java.lang.IllegalAccessError: 
> Attempted to access garbage collected accumulator 159454
>   at 
> org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:265)
>   at 
> org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:261)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.util.AccumulatorContext$.get(AccumulatorV2.scala:261)
>   at org.apache.spark.util.AccumulatorV2.name(AccumulatorV2.scala:88)
>   at 
> org.apache.spark.sql.execution.metric.SQLMetric.toInfo(SQLMetrics.scala:67)
>   at 
> org.apache.spark.sql.execution.ui.SQLListener$$anonfun$onTaskEnd$1.apply(SQLListener.scala:216)
>   at 
> org.apache.spark.sql.execution.ui.SQLListener$$anonfun$onTaskEnd$1.apply(SQLListener.scala:216)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.execution.ui.SQLListener.onTaskEnd(SQLListener.scala:216)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36)
>   at 
> org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:36)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(LiveListenerBus.scala:94)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:78)
>   at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1268)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:77)
> 1
> MLlib recommendation algorithms: Spark package found in SPARK_HOME: 
> C:\projects\spark\bin\..
> {code}
> {code}
> java.lang.IllegalStateException: SparkContext has been shutdown
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2015)
>   at

[jira] [Resolved] (SPARK-20674) Support registering UserDefinedFunction as named UDF

2017-05-09 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-20674.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

> Support registering UserDefinedFunction as named UDF
> 
>
> Key: SPARK-20674
> URL: https://issues.apache.org/jira/browse/SPARK-20674
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.2.0
>
>
> For some reason we don't have an API to register UserDefinedFunction as named 
> UDF. It is a no brainer to add one, in addition to the existing register 
> functions we have.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20666) Flaky test - SparkListenerBus randomly failing java.lang.IllegalAccessError on Windows

2017-05-09 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16002962#comment-16002962
 ] 

Marcelo Vanzin commented on SPARK-20666:


This doesn't fail just on Windows. It's been failing in PR builders also (e.g. 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76605/testReport).

In a private branch of mine I ran into this problem and have this workaround:
https://github.com/vanzin/spark/pull/22/files#diff-edd374dbb96bc16363b65dab1e554793R114

But that's not cleanly "backportable" at the moment, and I don't know if it's 
the right solution either.

> Flaky test - SparkListenerBus randomly failing java.lang.IllegalAccessError 
> on Windows
> --
>
> Key: SPARK-20666
> URL: https://issues.apache.org/jira/browse/SPARK-20666
> Project: Spark
>  Issue Type: Bug
>  Components: ML, Spark Core, SparkR
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>
> seeing quite a bit of this on AppVeyor, aka Windows only, always only when 
> running ML tests, it seems
> {code}
> Exception in thread "SparkListenerBus" java.lang.IllegalAccessError: 
> Attempted to access garbage collected accumulator 159454
>   at 
> org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:265)
>   at 
> org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:261)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.util.AccumulatorContext$.get(AccumulatorV2.scala:261)
>   at org.apache.spark.util.AccumulatorV2.name(AccumulatorV2.scala:88)
>   at 
> org.apache.spark.sql.execution.metric.SQLMetric.toInfo(SQLMetrics.scala:67)
>   at 
> org.apache.spark.sql.execution.ui.SQLListener$$anonfun$onTaskEnd$1.apply(SQLListener.scala:216)
>   at 
> org.apache.spark.sql.execution.ui.SQLListener$$anonfun$onTaskEnd$1.apply(SQLListener.scala:216)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.execution.ui.SQLListener.onTaskEnd(SQLListener.scala:216)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36)
>   at 
> org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:36)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(LiveListenerBus.scala:94)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:78)
>   at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1268)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:77)
> 1
> MLlib recommendation algorithms: Spark package found in SPARK_HOME: 
> C:\projects\spark\bin\..
> {code}
> {code}
> java.lang.IllegalStateException: SparkContext has been shutdown
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2015)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2044)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2063)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2088)
>   at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>   at org.apache.spark.rdd.RDD.collect(RDD.scala:935)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:275)
>   at 
>

[jira] [Comment Edited] (SPARK-20666) Flaky test - SparkListenerBus randomly failing java.lang.IllegalAccessError on Windows

2017-05-09 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16002962#comment-16002962
 ] 

Marcelo Vanzin edited comment on SPARK-20666 at 5/9/17 4:18 PM:


This doesn't fail just on Windows. It's been failing in PR builders also (e.g. 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76605/testReport).

In a private branch of mine I ran into this problem and have this workaround:
https://github.com/vanzin/spark/pull/22/files#diff-edd374dbb96bc16363b65dab1e554793R114

But that's not cleanly "backportable" at the moment, and I don't know if it's 
the right solution either.

(BTW my fix is in SQL, which also suffers from this, so it would help these 
mllib tests...)


was (Author: vanzin):
This doesn't fail just on Windows. It's been failing in PR builders also (e.g. 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76605/testReport).

In a private branch of mine I ran into this problem and have this workaround:
https://github.com/vanzin/spark/pull/22/files#diff-edd374dbb96bc16363b65dab1e554793R114

But that's not cleanly "backportable" at the moment, and I don't know if it's 
the right solution either.

> Flaky test - SparkListenerBus randomly failing java.lang.IllegalAccessError 
> on Windows
> --
>
> Key: SPARK-20666
> URL: https://issues.apache.org/jira/browse/SPARK-20666
> Project: Spark
>  Issue Type: Bug
>  Components: ML, Spark Core, SparkR
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>
> seeing quite a bit of this on AppVeyor, aka Windows only, always only when 
> running ML tests, it seems
> {code}
> Exception in thread "SparkListenerBus" java.lang.IllegalAccessError: 
> Attempted to access garbage collected accumulator 159454
>   at 
> org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:265)
>   at 
> org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:261)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.util.AccumulatorContext$.get(AccumulatorV2.scala:261)
>   at org.apache.spark.util.AccumulatorV2.name(AccumulatorV2.scala:88)
>   at 
> org.apache.spark.sql.execution.metric.SQLMetric.toInfo(SQLMetrics.scala:67)
>   at 
> org.apache.spark.sql.execution.ui.SQLListener$$anonfun$onTaskEnd$1.apply(SQLListener.scala:216)
>   at 
> org.apache.spark.sql.execution.ui.SQLListener$$anonfun$onTaskEnd$1.apply(SQLListener.scala:216)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.execution.ui.SQLListener.onTaskEnd(SQLListener.scala:216)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36)
>   at 
> org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:36)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(LiveListenerBus.scala:94)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:78)
>   at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1268)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:77)
> 1
> MLlib recommendation algorithms: Spark package found in SPARK_HOME: 
> C:\projects\spark\bin\..
> {code}
> {code}
> java.lang.IllegalStateException: SparkContext has been shutdown
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2015)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2044)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2063)
>   at

[jira] [Resolved] (SPARK-20548) Flaky Test: ReplSuite.newProductSeqEncoder with REPL defined class

2017-05-09 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-20548.
-
   Resolution: Fixed
Fix Version/s: 2.3.0
   2.2.1

Issue resolved by pull request 17844
[https://github.com/apache/spark/pull/17844]

> Flaky Test:  ReplSuite.newProductSeqEncoder with REPL defined class
> ---
>
> Key: SPARK-20548
> URL: https://issues.apache.org/jira/browse/SPARK-20548
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Sameer Agarwal
>Assignee: Sameer Agarwal
> Fix For: 2.2.1, 2.3.0
>
>
> {{newProductSeqEncoder with REPL defined class}} in {{ReplSuite}} has been 
> failing in-deterministically : https://spark-tests.appspot.com/failed-tests 
> over the last few days.
> https://spark.test.databricks.com/job/spark-master-test-sbt-hadoop-2.7/176/testReport/junit/org.apache.spark.repl/ReplSuite/newProductSeqEncoder_with_REPL_defined_class/history/



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20608) Standby namenodes should be allowed to included in yarn.spark.access.namenodes to support HDFS HA

2017-05-09 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16002956#comment-16002956
 ] 

Marcelo Vanzin commented on SPARK-20608:


bq. I think it is unreasonable that spark application fails when one of 
yarn.spark.access.namenodes cannot be accessed.

Yes, that's unreasonable. But they cannot be accessed because *you're providing 
the wrong configuration*. With HA, you should use a namespace address, not a 
namenode address. I can't think of any way to make that clearer.

> Standby namenodes should be allowed to included in 
> yarn.spark.access.namenodes to support HDFS HA
> -
>
> Key: SPARK-20608
> URL: https://issues.apache.org/jira/browse/SPARK-20608
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit, YARN
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Yuechen Chen
>Priority: Minor
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> If one Spark Application need to access remote namenodes, 
> yarn.spark.access.namenodes should be only be configged in spark-submit 
> scripts, and Spark Client(On Yarn) would fetch HDFS credential periodically.
> If one hadoop cluster is configured by HA, there would be one active namenode 
> and at least one standby namenode. 
> However, if yarn.spark.access.namenodes includes both active and standby 
> namenodes, Spark Application will be failed for the reason that the standby 
> namenode would not access by Spark for org.apache.hadoop.ipc.StandbyException.
> I think it won't cause any bad effect to config standby namenodes in 
> yarn.spark.access.namenodes, and my Spark Application can be able to sustain 
> the failover of Hadoop namenode.
> HA Examples:
> Spark-submit script: 
> yarn.spark.access.namenodes=hdfs://namenode01,hdfs://namenode02
> Spark Application Codes:
> dataframe.write.parquet(getActiveNameNode(...) + hdfsPath)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20681) DataFram.Drop doesn't take effective, neither does error

2017-05-09 Thread Xi Wang (JIRA)

Xi Wang created SPARK-20681:
---

 Summary: DataFram.Drop doesn't take effective, neither does error
 Key: SPARK-20681
 URL: https://issues.apache.org/jira/browse/SPARK-20681
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Xi Wang
Priority: Critical


I am running the following code trying to drop nested columns,  but it doesn't 
work, it doesn't return error either.

*I read the DF from this json:*
{'parent':{'child':{'grandchild':{'val':'1',' val_to_be_deleted':'0'
scala> spark.read.format("json").load("c:/tmp/spark_issue.json")
res0: org.apache.spark.sql.DataFrame = [father: struct>>]

*read the df:*
scala> res0.printSchema
root
|-- parent: struct (nullable = true)
||-- child: struct (nullable = true)
|||-- grandchild: struct (nullable = true)
||||-- val: long (nullable = true)
||||-- val_to_be_deleted: long (nullable = true)


*drop the column (I tried different ways, "quote", `back-tick`, col(object) 
...) column remains anyway:*
scala> res0.drop(col("father.child.grandchild.val_to_be_deleted")).printSchema
root
|-- father: struct (nullable = true)
||-- child: struct (nullable = true)
|||-- grandchild: struct (nullable = true)
||||-- val: long (nullable = true)
||||-- val_to_be_deleted: long (nullable = true)


scala> res0.drop("father.child.grandchild.val_to_be_deleted").printSchema
root
|-- father: struct (nullable = true)
||-- child: struct (nullable = true)
|||-- grandchild: struct (nullable = true)
||||-- val: long (nullable = true)
||||-- val_to_be_deleted: long (nullable = true)



Any help is appreciated.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13747) Concurrent execution in SQL doesn't work with Scala ForkJoinPool

2017-05-09 Thread Dmitry Naumenko (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16002852#comment-16002852
 ] 

Dmitry Naumenko commented on SPARK-13747:
-

[~revolucion09] It's a bit an off-topic discussion. My 50 cents to it - from my 
understanding, fork-join pool in Akka helps to keep all processors busy, so you 
can archive a high message throughput per second (as long as you don't use 
blocking operations). But in Spark driver program, it's not a critical, cause 
the most time it will be waiting for worker nodes anyway. 

> Concurrent execution in SQL doesn't work with Scala ForkJoinPool
> 
>
> Key: SPARK-13747
> URL: https://issues.apache.org/jira/browse/SPARK-13747
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> Run the following codes may fail
> {code}
> (1 to 100).par.foreach { _ =>
>   println(sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count())
> }
> java.lang.IllegalArgumentException: spark.sql.execution.id is already set 
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87)
>  
> at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1904) 
> at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1385) 
> {code}
> This is because SparkContext.runJob can be suspended when using a 
> ForkJoinPool (e.g.,scala.concurrent.ExecutionContext.Implicits.global) as it 
> calls Await.ready (introduced by https://github.com/apache/spark/pull/9264).
> So when SparkContext.runJob is suspended, ForkJoinPool will run another task 
> in the same thread, however, the local properties has been polluted.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13747) Concurrent execution in SQL doesn't work with Scala ForkJoinPool

2017-05-09 Thread Saif Addin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16002814#comment-16002814
 ] 

Saif Addin commented on SPARK-13747:


[~dnaumenko]
Nonetheless, if I am not mistaken, there are proofs that fork join pools 
provide significant performance boost in scalable environments, that is why 
akka uses them by default. Fixed or Cached pool threads are considered 
dangerous for production environments.

> Concurrent execution in SQL doesn't work with Scala ForkJoinPool
> 
>
> Key: SPARK-13747
> URL: https://issues.apache.org/jira/browse/SPARK-13747
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> Run the following codes may fail
> {code}
> (1 to 100).par.foreach { _ =>
>   println(sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count())
> }
> java.lang.IllegalArgumentException: spark.sql.execution.id is already set 
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87)
>  
> at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1904) 
> at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1385) 
> {code}
> This is because SparkContext.runJob can be suspended when using a 
> ForkJoinPool (e.g.,scala.concurrent.ExecutionContext.Implicits.global) as it 
> calls Await.ready (introduced by https://github.com/apache/spark/pull/9264).
> So when SparkContext.runJob is suspended, ForkJoinPool will run another task 
> in the same thread, however, the local properties has been polluted.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20355) Display Spark version on history page

2017-05-09 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-20355.
---
   Resolution: Fixed
 Assignee: Sanket Reddy
Fix Version/s: 2.3.0

> Display Spark version on history page
> -
>
> Key: SPARK-20355
> URL: https://issues.apache.org/jira/browse/SPARK-20355
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Web UI
>Affects Versions: 2.1.0
>Reporter: Sanket Reddy
>Assignee: Sanket Reddy
>Priority: Minor
> Fix For: 2.3.0
>
>
> Spark Version for a specific application is not displayed on the history page 
> now. It should be nice to switch the spark version on the UI when we click on 
> the specific application.
> Currently there seems to be way as SparkListenerLogStart records the 
> application version. So, it should be trivial to listen to this event and 
> provision this change on the UI.
> {"Event":"SparkListenerLogStart","Spark 
> Version":"1.6.2.0_2.7.2.7.1604210306_161643"}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20489) Different results in local mode and yarn mode when working with dates (silent corruption due to system timezone setting)

2017-05-09 Thread Rick Moritz (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rick Moritz updated SPARK-20489:

Summary: Different results in local mode and yarn mode when working with 
dates (silent corruption due to system timezone setting)  (was: Different 
results in local mode and yarn mode when working with dates (race condition 
with SimpleDateFormat?))

> Different results in local mode and yarn mode when working with dates (silent 
> corruption due to system timezone setting)
> 
>
> Key: SPARK-20489
> URL: https://issues.apache.org/jira/browse/SPARK-20489
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2
> Environment: yarn-client mode in Zeppelin, Cloudera 
> Spark2-distribution
>Reporter: Rick Moritz
>Priority: Critical
>
> Running the following code (in Zeppelin, or spark-shell), I get different 
> results, depending on whether I am using local[*] -mode or yarn-client mode:
> {code:title=test case|borderStyle=solid}
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types._
> import spark.implicits._
> val counter = 1 to 2
> val size = 1 to 3
> val sampleText = spark.createDataFrame(
> sc.parallelize(size)
> .map(Row(_)),
> StructType(Array(StructField("id", IntegerType, nullable=false))
> )
> )
> .withColumn("loadDTS",lit("2017-04-25T10:45:02.2"))
> 
> val rddList = counter.map(
> count => sampleText
> .withColumn("loadDTS2", 
> date_format(date_add(col("loadDTS"),count),"-MM-dd'T'HH:mm:ss.SSS"))
> .drop(col("loadDTS"))
> .withColumnRenamed("loadDTS2","loadDTS")
> .coalesce(4)
> .rdd
> )
> val resultText = spark.createDataFrame(
> spark.sparkContext.union(rddList),
> sampleText.schema
> )
> val testGrouped = resultText.groupBy("id")
> val timestamps = testGrouped.agg(
> max(unix_timestamp($"loadDTS", "-MM-dd'T'HH:mm:ss.SSS")) as 
> "timestamp"
> )
> val loadDateResult = resultText.join(timestamps, "id")
> val filteredresult = loadDateResult.filter($"timestamp" === 
> unix_timestamp($"loadDTS", "-MM-dd'T'HH:mm:ss.SSS"))
> filteredresult.count
> {code}
> The expected result, *3* is what I obtain in local mode, but as soon as I run 
> fully distributed, I get *0*. If Increase size to {{1 to 32000}}, I do get 
> some results (depending on the size of counter) - none of which makes any 
> sense.
> Up to the application of the last filter, at first glance everything looks 
> okay, but then something goes wrong. Potentially this is due to lingering 
> re-use of SimpleDateFormats, but I can't get it to happen in a 
> non-distributed mode. The generated execution plan is the same in each case, 
> as expected.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20489) Different results in local mode and yarn mode when working with dates (race condition with SimpleDateFormat?)

2017-05-09 Thread Rick Moritz (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16002762#comment-16002762
 ] 

Rick Moritz commented on SPARK-20489:
-

Okay, I had a look at timezones, and it appears my Driver was in a different 
timezone, than my executors. Having put all machines into UTC, I now get the 
expected result.
It helped, when I realized that unix timestamps aren't always in UTC, for some 
reason (thanks Jacek, 
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-sql-Expression-UnixTimestamp.html).

I wonder what could be done to avoid these quite bizarre error situations. 
Especially in big data, where in theory clusters could reasonably span across 
timezones, and in particular drivers )in client modes, for example) could be 
well "outside" the actual cluster, relying on the local timezone settings 
appears to me to be quite fragile.

As an initial measure, I wonder how much effort it would be, to make sure that 
when dates are used, any discrepancy in timezone settings would result in an 
exception being thrown. I would consider that a reasonable assertion, with not 
too much overhead.

> Different results in local mode and yarn mode when working with dates (race 
> condition with SimpleDateFormat?)
> -
>
> Key: SPARK-20489
> URL: https://issues.apache.org/jira/browse/SPARK-20489
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2
> Environment: yarn-client mode in Zeppelin, Cloudera 
> Spark2-distribution
>Reporter: Rick Moritz
>Priority: Critical
>
> Running the following code (in Zeppelin, or spark-shell), I get different 
> results, depending on whether I am using local[*] -mode or yarn-client mode:
> {code:title=test case|borderStyle=solid}
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types._
> import spark.implicits._
> val counter = 1 to 2
> val size = 1 to 3
> val sampleText = spark.createDataFrame(
> sc.parallelize(size)
> .map(Row(_)),
> StructType(Array(StructField("id", IntegerType, nullable=false))
> )
> )
> .withColumn("loadDTS",lit("2017-04-25T10:45:02.2"))
> 
> val rddList = counter.map(
> count => sampleText
> .withColumn("loadDTS2", 
> date_format(date_add(col("loadDTS"),count),"-MM-dd'T'HH:mm:ss.SSS"))
> .drop(col("loadDTS"))
> .withColumnRenamed("loadDTS2","loadDTS")
> .coalesce(4)
> .rdd
> )
> val resultText = spark.createDataFrame(
> spark.sparkContext.union(rddList),
> sampleText.schema
> )
> val testGrouped = resultText.groupBy("id")
> val timestamps = testGrouped.agg(
> max(unix_timestamp($"loadDTS", "-MM-dd'T'HH:mm:ss.SSS")) as 
> "timestamp"
> )
> val loadDateResult = resultText.join(timestamps, "id")
> val filteredresult = loadDateResult.filter($"timestamp" === 
> unix_timestamp($"loadDTS", "-MM-dd'T'HH:mm:ss.SSS"))
> filteredresult.count
> {code}
> The expected result, *3* is what I obtain in local mode, but as soon as I run 
> fully distributed, I get *0*. If Increase size to {{1 to 32000}}, I do get 
> some results (depending on the size of counter) - none of which makes any 
> sense.
> Up to the application of the last filter, at first glance everything looks 
> okay, but then something goes wrong. Potentially this is due to lingering 
> re-use of SimpleDateFormats, but I can't get it to happen in a 
> non-distributed mode. The generated execution plan is the same in each case, 
> as expected.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20591) Succeeded tasks num not equal in job page and job detail page on spark web ui when speculative task(s) exist

2017-05-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20591:


Assignee: (was: Apache Spark)

> Succeeded tasks num not equal in job page and job detail page on spark web ui 
> when speculative task(s) exist
> 
>
> Key: SPARK-20591
> URL: https://issues.apache.org/jira/browse/SPARK-20591
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.2
>Reporter: Jinhua Fu
>Priority: Minor
> Attachments: job detail page(stages).png, job page.png
>
>
> when spark.speculation is enabled,and there are some speculative tasks, then 
> we can see succeeded tasks num include speculative tasks on the job page, 
> which however not being included on the job detail page(job stages page).
> When I consider some tasks may run a little slow by the job page's  succeeded 
> tasks more than total tasks,which make me want to known which tasks and why,I 
> have to check every stage to find the speculative tasks which is beacause 
> speculative tasks not being included in the stage succeeded task num.
> Can it be improved?
> update two screenshots, succeeded task num is 557 on job page,but 550(by sum) 
> on job detail page(stages),the extra 7 tasks are speculative tasks. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20591) Succeeded tasks num not equal in job page and job detail page on spark web ui when speculative task(s) exist

2017-05-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20591:


Assignee: Apache Spark

> Succeeded tasks num not equal in job page and job detail page on spark web ui 
> when speculative task(s) exist
> 
>
> Key: SPARK-20591
> URL: https://issues.apache.org/jira/browse/SPARK-20591
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.2
>Reporter: Jinhua Fu
>Assignee: Apache Spark
>Priority: Minor
> Attachments: job detail page(stages).png, job page.png
>
>
> when spark.speculation is enabled,and there are some speculative tasks, then 
> we can see succeeded tasks num include speculative tasks on the job page, 
> which however not being included on the job detail page(job stages page).
> When I consider some tasks may run a little slow by the job page's  succeeded 
> tasks more than total tasks,which make me want to known which tasks and why,I 
> have to check every stage to find the speculative tasks which is beacause 
> speculative tasks not being included in the stage succeeded task num.
> Can it be improved?
> update two screenshots, succeeded task num is 557 on job page,but 550(by sum) 
> on job detail page(stages),the extra 7 tasks are speculative tasks. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20591) Succeeded tasks num not equal in job page and job detail page on spark web ui when speculative task(s) exist

2017-05-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16002686#comment-16002686
 ] 

Apache Spark commented on SPARK-20591:
--

User 'fjh100456' has created a pull request for this issue:
https://github.com/apache/spark/pull/17923

> Succeeded tasks num not equal in job page and job detail page on spark web ui 
> when speculative task(s) exist
> 
>
> Key: SPARK-20591
> URL: https://issues.apache.org/jira/browse/SPARK-20591
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.2
>Reporter: Jinhua Fu
>Priority: Minor
> Attachments: job detail page(stages).png, job page.png
>
>
> when spark.speculation is enabled,and there are some speculative tasks, then 
> we can see succeeded tasks num include speculative tasks on the job page, 
> which however not being included on the job detail page(job stages page).
> When I consider some tasks may run a little slow by the job page's  succeeded 
> tasks more than total tasks,which make me want to known which tasks and why,I 
> have to check every stage to find the speculative tasks which is beacause 
> speculative tasks not being included in the stage succeeded task num.
> Can it be improved?
> update two screenshots, succeeded task num is 557 on job page,but 550(by sum) 
> on job detail page(stages),the extra 7 tasks are speculative tasks. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13747) Concurrent execution in SQL doesn't work with Scala ForkJoinPool

2017-05-09 Thread Dmitry Naumenko (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16002674#comment-16002674
 ] 

Dmitry Naumenko commented on SPARK-13747:
-

[~zsxwing] I've tried to build a Spark from a branch in pull request. Didn't 
manage to make a complete build (had some problems with R dependencies), so 
I've replaced only spark-core.jar and it seems like the issue still occurs. 
Could you please provide a jar/dist for re-test? 

As a side note, the fixed-thread-pool solution works well for us. We will 
probably stick with it.

> Concurrent execution in SQL doesn't work with Scala ForkJoinPool
> 
>
> Key: SPARK-13747
> URL: https://issues.apache.org/jira/browse/SPARK-13747
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> Run the following codes may fail
> {code}
> (1 to 100).par.foreach { _ =>
>   println(sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count())
> }
> java.lang.IllegalArgumentException: spark.sql.execution.id is already set 
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87)
>  
> at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1904) 
> at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1385) 
> {code}
> This is because SparkContext.runJob can be suspended when using a 
> ForkJoinPool (e.g.,scala.concurrent.ExecutionContext.Implicits.global) as it 
> calls Await.ready (introduced by https://github.com/apache/spark/pull/9264).
> So when SparkContext.runJob is suspended, ForkJoinPool will run another task 
> in the same thread, however, the local properties has been polluted.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20542) Add an API into Bucketizer that can bin a lot of columns all at once

2017-05-09 Thread Barry Becker (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16002609#comment-16002609
 ] 

Barry Becker commented on SPARK-20542:
--

This is a great improvement, @viirya! According to your results, my case should 
go from about a minute down to 3 seconds. Probably StringIndexer would benefit 
from a similar approach.

> Add an API into Bucketizer that can bin a lot of columns all at once
> 
>
> Key: SPARK-20542
> URL: https://issues.apache.org/jira/browse/SPARK-20542
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Liang-Chi Hsieh
>
> Current ML's Bucketizer can only bin a column of continuous features. If a 
> dataset has thousands of of continuous columns needed to bin, we will result 
> in thousands of ML stages. It is very inefficient regarding query planning 
> and execution.
> We should have a type of bucketizer that can bin a lot of columns all at 
> once. It would need to accept an list of arrays of split points to correspond 
> to the columns to bin, but it might make things more efficient by replacing 
> thousands of stages with just one.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20311) SQL "range(N) as alias" or "range(N) alias" doesn't work

2017-05-09 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-20311:
---

Assignee: Takeshi Yamamuro

> SQL "range(N) as alias" or "range(N) alias" doesn't work
> 
>
> Key: SPARK-20311
> URL: https://issues.apache.org/jira/browse/SPARK-20311
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Juliusz Sompolski
>Assignee: Takeshi Yamamuro
>Priority: Minor
> Fix For: 2.2.1, 2.3.0
>
>
> `select * from range(10) as A;` or `select * from range(10) A;`
> does not work.
> As a workaround, a subquery has to be used:
> `select * from (select * from range(10)) as A;`



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20311) SQL "range(N) as alias" or "range(N) alias" doesn't work

2017-05-09 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-20311.
-
   Resolution: Fixed
Fix Version/s: 2.3.0
   2.2.1

Issue resolved by pull request 17666
[https://github.com/apache/spark/pull/17666]

> SQL "range(N) as alias" or "range(N) alias" doesn't work
> 
>
> Key: SPARK-20311
> URL: https://issues.apache.org/jira/browse/SPARK-20311
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Juliusz Sompolski
>Priority: Minor
> Fix For: 2.2.1, 2.3.0
>
>
> `select * from range(10) as A;` or `select * from range(10) A;`
> does not work.
> As a workaround, a subquery has to be used:
> `select * from (select * from range(10)) as A;`



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19581) running NaiveBayes model with 0 features can crash the executor with D rorreGEMV

2017-05-09 Thread Barry Becker (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16002582#comment-16002582
 ] 

Barry Becker commented on SPARK-19581:
--

I think its just a matter of sending a feature vector of length 0 to the 
NaiveBayes model inducer. If you do that, then you will see the error 
regardless of what data you use.

> running NaiveBayes model with 0 features can crash the executor with D 
> rorreGEMV
> 
>
> Key: SPARK-19581
> URL: https://issues.apache.org/jira/browse/SPARK-19581
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
> Environment: spark development or standalone mode on windows or linux.
>Reporter: Barry Becker
>Priority: Minor
>
> The severity of this bug is high (because nothing should cause spark to crash 
> like this) but the priority may be low (because there is an easy workaround).
> In our application, a user can select features and a target to run the 
> NaiveBayes inducer. If columns have too many values or all one value, they 
> will be removed before we call the inducer to create the model. As a result, 
> there are some cases, where all the features may get removed. When this 
> happens, executors will crash and get restarted (if on a cluster) or spark 
> will crash and need to be manually restarted (if in development mode).
> It looks like NaiveBayes uses BLAS, and BLAS does not handle this case well 
> when it is encountered. I emits this vague error :
> ** On entry to DGEMV  parameter number  6 had an illegal value
> and terminates.
> My code looks like this:
> {code}
>val predictions = model.transform(testData)  // Make predictions
> // figure out how many were correctly predicted
> val numCorrect = predictions.filter(new Column(actualTarget) === new 
> Column(PREDICTION_LABEL_COLUMN)).count()
> val numIncorrect = testRowCount - numCorrect
> {code}
> The failure is at the line that does the count, but it is not the count that 
> causes the problem, it is the model.transform step (where the model contains 
> the NaiveBayes classifier).
> Here is the stack trace (in development mode):
> {code}
> [2017-02-13 06:28:39,946] TRACE evidence.EvidenceVizModel$ [] 
> [akka://JobServer/user/context-supervisor/sql-context] -  done making 
> predictions in 232
>  ** On entry to DGEMV  parameter number  6 had an illegal value
>  ** On entry to DGEMV  parameter number  6 had an illegal value
>  ** On entry to DGEMV  parameter number  6 had an illegal value
> [2017-02-13 06:28:40,506] ERROR .scheduler.LiveListenerBus [] 
> [akka://JobServer/user/context-supervisor/sql-context] - SparkListenerBus has 
> already stopped! Dropping event SparkListenerSQLExecutionEnd(9,1486996120505)
> [2017-02-13 06:28:40,506] ERROR .scheduler.LiveListenerBus [] 
> [akka://JobServer/user/context-supervisor/sql-context] - SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerStageCompleted(org.apache.spark.scheduler.StageInfo@1f6c4a29)
> [2017-02-13 06:28:40,508] ERROR .scheduler.LiveListenerBus [] 
> [akka://JobServer/user/context-supervisor/sql-context] - SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerJobEnd(12,1486996120507,JobFailed(org.apache.spark.SparkException:
>  Job 12 cancelled because SparkContext was shut down))
> [2017-02-13 06:28:40,509] ERROR .jobserver.JobManagerActor [] 
> [akka://JobServer/user/context-supervisor/sql-context] - Got Throwable
> org.apache.spark.SparkException: Job 12 cancelled because SparkContext was 
> shut down
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:808)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:806)
> at scala.collection.mutable.HashSet.foreach(HashSet.scala:78)
> at 
> org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:806)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onStop(DAGScheduler.scala:1668)
> at org.apache.spark.util.EventLoop.stop(EventLoop.scala:83)
> at 
> org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1587)
> at 
> org.apache.spark.SparkContext$$anonfun$stop$8.apply$mcV$sp(SparkContext.scala:1826)
> at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1283)
> at org.apache.spark.SparkContext.stop(SparkContext.scala:1825)
> at 
> org.apache.spark.SparkContext$$anonfun$2.apply$mcV$sp(SparkContext.scala:581)
> at 
> org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216)
> at 
>

[jira] [Resolved] (SPARK-20667) Cleanup the cataloged metadata after completing the package of sql/core and sql/hive

2017-05-09 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-20667.
-
   Resolution: Fixed
Fix Version/s: 2.3.0
   2.2.1

Issue resolved by pull request 17908
[https://github.com/apache/spark/pull/17908]

> Cleanup the cataloged metadata after completing the package of sql/core and 
> sql/hive
> 
>
> Key: SPARK-20667
> URL: https://issues.apache.org/jira/browse/SPARK-20667
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.2.1, 2.3.0
>
>
> So far, we do not drop all the cataloged tables after each package. 
> Sometimes, we might hit strange test case errors because the previous test 
> suite did not drop the tables/functions/database. At least, we can first 
> clean up the environment when completing the package of sql/core and sql/hive.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20601) Python API Changes for Constrained Logistic Regression Params

2017-05-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20601:


Assignee: Apache Spark

> Python API Changes for Constrained Logistic Regression Params
> -
>
> Key: SPARK-20601
> URL: https://issues.apache.org/jira/browse/SPARK-20601
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Bryan Cutler
>Assignee: Apache Spark
>
> With the addition of SPARK-20047 for constrained logistic regression, there 
> are 4 new params not in PySpark lr
> * lowerBoundsOnCoefficients
> * upperBoundsOnCoefficients
> * lowerBoundsOnIntercepts
> * upperBoundsOnIntercepts



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20601) Python API Changes for Constrained Logistic Regression Params

2017-05-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20601:


Assignee: (was: Apache Spark)

> Python API Changes for Constrained Logistic Regression Params
> -
>
> Key: SPARK-20601
> URL: https://issues.apache.org/jira/browse/SPARK-20601
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Bryan Cutler
>
> With the addition of SPARK-20047 for constrained logistic regression, there 
> are 4 new params not in PySpark lr
> * lowerBoundsOnCoefficients
> * upperBoundsOnCoefficients
> * lowerBoundsOnIntercepts
> * upperBoundsOnIntercepts



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20601) Python API Changes for Constrained Logistic Regression Params

2017-05-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16002493#comment-16002493
 ] 

Apache Spark commented on SPARK-20601:
--

User 'zero323' has created a pull request for this issue:
https://github.com/apache/spark/pull/17922

> Python API Changes for Constrained Logistic Regression Params
> -
>
> Key: SPARK-20601
> URL: https://issues.apache.org/jira/browse/SPARK-20601
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Bryan Cutler
>
> With the addition of SPARK-20047 for constrained logistic regression, there 
> are 4 new params not in PySpark lr
> * lowerBoundsOnCoefficients
> * upperBoundsOnCoefficients
> * lowerBoundsOnIntercepts
> * upperBoundsOnIntercepts



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2060) Querying JSON Datasets with SQL and DSL in Spark SQL

2017-05-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16002490#comment-16002490
 ] 

Apache Spark commented on SPARK-2060:
-

User 'zero323' has created a pull request for this issue:
https://github.com/apache/spark/pull/17922

> Querying JSON Datasets with SQL and DSL in Spark SQL
> 
>
> Key: SPARK-2060
> URL: https://issues.apache.org/jira/browse/SPARK-2060
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
> Fix For: 1.0.1, 1.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20680) Spark-sql do not support for void column datatype of view

2017-05-09 Thread Lantao Jin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated SPARK-20680:
---
Summary: Spark-sql do not support for void column datatype of view  (was: 
Spark-sql do not support for column datatype of void in a HIVE view)

> Spark-sql do not support for void column datatype of view
> -
>
> Key: SPARK-20680
> URL: https://issues.apache.org/jira/browse/SPARK-20680
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.1.1
>Reporter: Lantao Jin
>
> Create a HIVE view:
> {quote}
> hive> create table bad as select 1 x, null z from dual;
> {quote}
> Because there's no type, Hive gives it the VOID type:
> {quote}
> hive> describe bad;
> OK
> x int 
> z void
> {quote}
> In Spark2.0.x, the behaviour to read this view is normal:
> {quote}
> spark-sql> describe bad;
> x   int NULL
> z   voidNULL
> Time taken: 4.431 seconds, Fetched 2 row(s)
> {quote}
> But in Spark2.1.x, it failed with SparkException: Cannot recognize hive type 
> string: void
> {quote}
> spark-sql> describe bad;
> 17/05/09 03:12:08 INFO execution.SparkSqlParser: Parsing command: describe bad
> 17/05/09 03:12:08 INFO parser.CatalystSqlParser: Parsing command: int
> 17/05/09 03:12:08 INFO parser.CatalystSqlParser: Parsing command: void
> 17/05/09 03:12:08 ERROR thriftserver.SparkSQLDriver: Failed in [describe bad]
> org.apache.spark.SparkException: Cannot recognize hive type string: void
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.org$apache$spark$sql$hive$client$HiveClientImpl$$fromHiveColumn(HiveClientImpl.scala:789)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:365)
>   
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:365)
>   
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.AbstractTraversable.map(Traversable.scala:104)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:365)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:361)
> Caused by: org.apache.spark.sql.catalyst.parser.ParseException:
> DataType void() is not supported.(line 1, pos 0)
> == SQL ==  
> void   
> ^^^
> ... 61 more
> org.apache.spark.SparkException: Cannot recognize hive type string: void
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20680) Spark-sql do not support for column datatype of void in a HIVE view

2017-05-09 Thread Lantao Jin (JIRA)

Lantao Jin created SPARK-20680:
--

 Summary: Spark-sql do not support for column datatype of void in a 
HIVE view
 Key: SPARK-20680
 URL: https://issues.apache.org/jira/browse/SPARK-20680
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.1, 2.1.0
Reporter: Lantao Jin


Create a HIVE view:
{quote}
hive> create table bad as select 1 x, null z from dual;
{quote}

Because there's no type, Hive gives it the VOID type:
{quote}
hive> describe bad;
OK
x   int 
z   void
{quote}

In Spark2.0.x, the behaviour to read this view is normal:
{quote}
spark-sql> describe bad;
x   int NULL
z   voidNULL
Time taken: 4.431 seconds, Fetched 2 row(s)
{quote}

But in Spark2.1.x, it failed with SparkException: Cannot recognize hive type 
string: void
{quote}
spark-sql> describe bad;
17/05/09 03:12:08 INFO execution.SparkSqlParser: Parsing command: describe bad
17/05/09 03:12:08 INFO parser.CatalystSqlParser: Parsing command: int
17/05/09 03:12:08 INFO parser.CatalystSqlParser: Parsing command: void
17/05/09 03:12:08 ERROR thriftserver.SparkSQLDriver: Failed in [describe bad]
org.apache.spark.SparkException: Cannot recognize hive type string: void
at 
org.apache.spark.sql.hive.client.HiveClientImpl.org$apache$spark$sql$hive$client$HiveClientImpl$$fromHiveColumn(HiveClientImpl.scala:789)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:365)
  
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:365)
  
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:365)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:361)

Caused by: org.apache.spark.sql.catalyst.parser.ParseException:
DataType void() is not supported.(line 1, pos 0)

== SQL ==  
void   
^^^

... 61 more
org.apache.spark.SparkException: Cannot recognize hive type string: void


{quote}




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19364) Stream Blocks in Storage Persists Forever when Kinesis Checkpoints are enabled and an exception is thrown

2017-05-09 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-19364:
-
Priority: Major  (was: Blocker)

I am lowering the priority to {{Major}} as higher priorities over this are 
usually reserved for committers and I guess this was not set by a committer.

> Stream Blocks in Storage Persists Forever when Kinesis Checkpoints are 
> enabled and an exception is thrown 
> --
>
> Key: SPARK-19364
> URL: https://issues.apache.org/jira/browse/SPARK-19364
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
> Environment: ubuntu unix
> spark 2.0.2
> application is java
>Reporter: Andrew Milkowski
>
> -- update --- we found that below situation occurs when we encounter
> "com.amazonaws.services.kinesis.clientlibrary.exceptions.ShutdownException: 
> Can't update checkpoint - instance doesn't hold the lease for this shard"
> https://github.com/awslabs/amazon-kinesis-client/issues/108
> we use s3 directory (and dynamodb) to store checkpoints, but if such occurs 
> blocks should not get stuck but continue to be evicted gracefully from 
> memory, obviously kinesis library race condition is a problem onto itself...
> -- exception leading to a block not being freed up --
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in 
> [jar:file:/mnt/yarn/usercache/hadoop/filecache/24/__spark_libs__7928020266533182031.zip/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> 17/01/26 13:52:00 ERROR KinesisRecordProcessor: ShutdownException:  Caught 
> shutdown exception, skipping checkpoint.
> com.amazonaws.services.kinesis.clientlibrary.exceptions.ShutdownException: 
> Can't update checkpoint - instance doesn't hold the lease for this shard
>   at 
> com.amazonaws.services.kinesis.clientlibrary.lib.worker.KinesisClientLibLeaseCoordinator.setCheckpoint(KinesisClientLibLeaseCoordinator.java:120)
>   at 
> com.amazonaws.services.kinesis.clientlibrary.lib.worker.RecordProcessorCheckpointer.advancePosition(RecordProcessorCheckpointer.java:216)
>   at 
> com.amazonaws.services.kinesis.clientlibrary.lib.worker.RecordProcessorCheckpointer.checkpoint(RecordProcessorCheckpointer.java:137)
>   at 
> com.amazonaws.services.kinesis.clientlibrary.lib.worker.RecordProcessorCheckpointer.checkpoint(RecordProcessorCheckpointer.java:103)
>   at 
> org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1$$anonfun$apply$1.apply$mcV$sp(KinesisCheckpointer.scala:81)
>   at 
> org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1$$anonfun$apply$1.apply(KinesisCheckpointer.scala:81)
>   at 
> org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1$$anonfun$apply$1.apply(KinesisCheckpointer.scala:81)
>   at scala.util.Try$.apply(Try.scala:192)
>   at 
> org.apache.spark.streaming.kinesis.KinesisRecordProcessor$.retryRandom(KinesisRecordProcessor.scala:144)
>   at 
> org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1.apply(KinesisCheckpointer.scala:81)
>   at 
> org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1.apply(KinesisCheckpointer.scala:75)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.streaming.kinesis.KinesisCheckpointer.checkpoint(KinesisCheckpointer.scala:75)
>   at 
> org.apache.spark.streaming.kinesis.KinesisCheckpointer.org$apache$spark$streaming$kinesis$KinesisCheckpointer$$checkpointAll(KinesisCheckpointer.scala:103)
>   at 
> org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$1.apply$mcVJ$sp(KinesisCheckpointer.scala:117)
>   at 
> org.apache.spark.streaming.util.RecurringTimer.triggerActionForNextInterval(RecurringTimer.scala:94)
>   at 
> org.apache.spark.streaming.util.RecurringTimer.org$apache$spark$streaming$util$RecurringTimer$$loop(RecurringTimer.scala:106)
>   at 
> org.apache.spark.streaming.util.RecurringTimer$$anon$1.run(RecurringTimer.scala:29)
> running standard kinesis stream ingestion with a java spark app and creating 
> dstream after running for some time some block streams seem to persist 
> forever and never cleaned up and this eventually leads to memory depletion on 
> workers
> we even tried cleaning RDD's with the following:
> cleaner = ssc.sparkContext().sc().cleaner().get();
> filtered.foreachRDD(new

[jira] [Updated] (SPARK-20415) SPARK job hangs while writing DataFrame to HDFS

2017-05-09 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-20415:
-
Priority: Major  (was: Blocker)

I am lowering the priority to {{Major}} as higher priorities over this are 
usually reserved for committers and I guess this was not set by a committer.

> SPARK job hangs while writing DataFrame to HDFS
> ---
>
> Key: SPARK-20415
> URL: https://issues.apache.org/jira/browse/SPARK-20415
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, YARN
>Affects Versions: 2.1.0
> Environment: EMR 5.4.0
>Reporter: P K
>
> We are in POC phase with Spark. One of the Steps is reading compressed json 
> files that come from sources, "explode" them into tabular format and then 
> write them to HDFS. This worked for about three weeks until a few days ago, 
> for a particular dataset, the writer just hangs. I logged in to the worker 
> machines and see this stack trace:
> "Executor task launch worker-0" #39 daemon prio=5 os_prio=0 
> tid=0x7f6210352800 nid=0x4542 runnable [0x7f61f52b3000]
>java.lang.Thread.State: RUNNABLE
> at org.apache.spark.unsafe.Platform.copyMemory(Platform.java:210)
> at 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData.writeToMemory(UnsafeArrayData.java:311)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply6_2$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.GenerateExec$$anonfun$doExecute$1$$anonfun$apply$9.apply(GenerateExec.scala:111)
> at 
> org.apache.spark.sql.execution.GenerateExec$$anonfun$doExecute$1$$anonfun$apply$9.apply(GenerateExec.scala:109)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
> at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:211)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:243)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188)
> at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:99)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> The last messages ever printed in stderr before the hang are:
> 17/04/18 01:41:14 INFO DAGScheduler: Final stage: ResultStage 4 (save at 
> NativeMethodAccessorImpl.java:0)
> 17/04/18 01:41:14 INFO DAGScheduler: Parents of final stage: List()
> 17/04/18 01:41:14 INFO DAGScheduler: Missing parents: List()
> 17/04/18 01:41:14 INFO DAGScheduler: Submitting ResultStage 4 
> (MapPartitionsRDD[31] at save at NativeMethodAccessorImpl.java:0), which has 
> no missing parents
> 17/04/18 01:41:14 INFO MemoryStore: Block broadcast_9 stored as values in 
>

[jira] [Commented] (SPARK-19876) Add OneTime trigger executor

2017-05-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16002393#comment-16002393
 ] 

Apache Spark commented on SPARK-19876:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/17921

> Add OneTime trigger executor
> 
>
> Key: SPARK-19876
> URL: https://issues.apache.org/jira/browse/SPARK-19876
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Tyson Condie
>Assignee: Tyson Condie
> Fix For: 2.2.0
>
>
> The goal is to add a new trigger executor that will process a single trigger 
> then stop. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20554) Remove usage of scala.language.reflectiveCalls

2017-05-09 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16002378#comment-16002378
 ] 

Sean Owen commented on SPARK-20554:
---

[~umesh9...@gmail.com] are you working on this?

> Remove usage of scala.language.reflectiveCalls
> --
>
> Key: SPARK-20554
> URL: https://issues.apache.org/jira/browse/SPARK-20554
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, Spark Core, SQL, Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Sean Owen
>Priority: Minor
>
> In several parts of the code we have imported 
> {{scala.language.reflectiveCalls}} to suppress a warning about, well, 
> reflective calls. I know from cleaning up build warnings in 2.2 that in 
> almost all cases of this are inadvertent and masking a type problem.
> Example, in HiveDDLSuite:
> {code}
> val expectedTablePath =
>   if (dbPath.isEmpty) {
> hiveContext.sessionState.catalog.defaultTablePath(tableIdentifier)
>   } else {
> new Path(new Path(dbPath.get), tableIdentifier.table)
>   }
> val filesystemPath = new Path(expectedTablePath.toString)
> {code}
> This shouldn't really work because one branch returns a URI and the other a 
> Path. In this case it only needs an object with a toString method and can 
> make this work with structural types and reflection.
> Obviously, the intent was to add ".toURI" to the second branch though to make 
> both a URI!
> I think we should probably clean this up by taking out all imports of 
> reflectiveCalls, and re-evaluating all of the warnings. There may be a few 
> legit usages.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20580) Allow RDD cache with unserializable objects

2017-05-09 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-20580.
---
Resolution: Not A Problem

Although map() wouldn't generally entail serializing anything, it could cause 
other computations to trigger, which do. Think of earlier stages which might 
entail shuffles, or caching, or checkpointing. This is generally an 
implementation detail. I don't think it would be possible to support 
unserializable objects in Spark, as most operations wouldn't work, and 
committing to make even a subset work is difficult and probably still confusing 
semantics for a user. In Java-land you can always implement your own 
serialization of complex third-party types anyway, if you must.

> Allow RDD cache with unserializable objects
> ---
>
> Key: SPARK-20580
> URL: https://issues.apache.org/jira/browse/SPARK-20580
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.3.0
>Reporter: Fernando Pereira
>Priority: Minor
>
> In my current scenario we load complex Python objects in the worker nodes 
> that are not completely serializable. We then apply map certain operations to 
> the RDD which at some point we collect. In this basic usage all works well.
> However, if we cache() the RDD (which defaults to memory) suddenly it fails 
> to execute the transformations after the caching step. Apparently caching 
> serializes the RDD data and deserializes it whenever more transformations are 
> required.
> It would be nice to avoid serialization of the objects if they are to be 
> cached to memory, and keep the original object



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20606) ML 2.2 QA: Remove deprecated methods for ML

2017-05-09 Thread Yanbo Liang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang resolved SPARK-20606.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

> ML 2.2 QA: Remove deprecated methods for ML
> ---
>
> Key: SPARK-20606
> URL: https://issues.apache.org/jira/browse/SPARK-20606
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.1.0
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>Priority: Minor
> Fix For: 2.2.0
>
>
> Remove ML methods we deprecated in 2.1.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 153 matches

Mail list logo