[jira] [Assigned] (SPARK-20590) Map default input data source formats to inlined classes
[ https://issues.apache.org/jira/browse/SPARK-20590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-20590: --- Assignee: Hyukjin Kwon > Map default input data source formats to inlined classes > > > Key: SPARK-20590 > URL: https://issues.apache.org/jira/browse/SPARK-20590 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Sameer Agarwal >Assignee: Hyukjin Kwon > Fix For: 2.2.1, 2.3.0 > > > One of the common usability problems around reading data in spark > (particularly CSV) is that there can often be a conflict between different > readers in the classpath. > As an example, if someone launches a 2.x spark shell with the spark-csv > package in the classpath, Spark currently fails in an extremely unfriendly way > {code} > ./bin/spark-shell --packages com.databricks:spark-csv_2.11:1.5.0 > scala> val df = spark.read.csv("/foo/bar.csv") > java.lang.RuntimeException: Multiple sources found for csv > (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat, > com.databricks.spark.csv.DefaultSource15), please specify the fully qualified > class name. > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:574) > at > org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource.scala:85) > at > org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:85) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:295) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178) > at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:533) > at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:412) > ... 48 elided > {code} > This JIRA proposes a simple way of fixing this error by always mapping > default input data source formats to inlined classes (that exist in Spark). > {code} > ./bin/spark-shell --packages com.databricks:spark-csv_2.11:1.5.0 > scala> val df = spark.read.csv("/foo/bar.csv") > df: org.apache.spark.sql.DataFrame = [_c0: string] > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20590) Map default input data source formats to inlined classes
[ https://issues.apache.org/jira/browse/SPARK-20590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-20590. - Resolution: Fixed Fix Version/s: 2.3.0 2.2.1 Issue resolved by pull request 17916 [https://github.com/apache/spark/pull/17916] > Map default input data source formats to inlined classes > > > Key: SPARK-20590 > URL: https://issues.apache.org/jira/browse/SPARK-20590 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Sameer Agarwal > Fix For: 2.2.1, 2.3.0 > > > One of the common usability problems around reading data in spark > (particularly CSV) is that there can often be a conflict between different > readers in the classpath. > As an example, if someone launches a 2.x spark shell with the spark-csv > package in the classpath, Spark currently fails in an extremely unfriendly way > {code} > ./bin/spark-shell --packages com.databricks:spark-csv_2.11:1.5.0 > scala> val df = spark.read.csv("/foo/bar.csv") > java.lang.RuntimeException: Multiple sources found for csv > (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat, > com.databricks.spark.csv.DefaultSource15), please specify the fully qualified > class name. > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:574) > at > org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource.scala:85) > at > org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:85) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:295) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178) > at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:533) > at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:412) > ... 48 elided > {code} > This JIRA proposes a simple way of fixing this error by always mapping > default input data source formats to inlined classes (that exist in Spark). > {code} > ./bin/spark-shell --packages com.databricks:spark-csv_2.11:1.5.0 > scala> val df = spark.read.csv("/foo/bar.csv") > df: org.apache.spark.sql.DataFrame = [_c0: string] > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12837) Spark driver requires large memory space for serialized results even there are no data collected to the driver
[ https://issues.apache.org/jira/browse/SPARK-12837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16004103#comment-16004103 ] Apache Spark commented on SPARK-12837: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/17931 > Spark driver requires large memory space for serialized results even there > are no data collected to the driver > -- > > Key: SPARK-12837 > URL: https://issues.apache.org/jira/browse/SPARK-12837 > Project: Spark > Issue Type: Question > Components: SQL >Affects Versions: 1.5.2, 1.6.0 >Reporter: Tien-Dung LE >Assignee: Wenchen Fan >Priority: Critical > Fix For: 2.2.1, 2.3.0 > > > Executing a sql statement with a large number of partitions requires a high > memory space for the driver even there are no requests to collect data back > to the driver. > Here are steps to re-produce the issue. > 1. Start spark shell with a spark.driver.maxResultSize setting > {code:java} > bin/spark-shell --driver-memory=1g --conf spark.driver.maxResultSize=1m > {code} > 2. Execute the code > {code:java} > case class Toto( a: Int, b: Int) > val df = sc.parallelize( 1 to 1e6.toInt).map( i => Toto( i, i)).toDF > sqlContext.setConf( "spark.sql.shuffle.partitions", "200" ) > df.groupBy("a").count().saveAsParquetFile( "toto1" ) // OK > sqlContext.setConf( "spark.sql.shuffle.partitions", 1e3.toInt.toString ) > df.repartition(1e3.toInt).groupBy("a").count().repartition(1e3.toInt).saveAsParquetFile( > "toto2" ) // ERROR > {code} > The error message is > {code:java} > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Total size of serialized results of 393 tasks (1025.9 KB) is bigger than > spark.driver.maxResultSize (1024.0 KB) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20688) correctly check analysis for scalar sub-queries
[ https://issues.apache.org/jira/browse/SPARK-20688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20688: Assignee: Apache Spark (was: Wenchen Fan) > correctly check analysis for scalar sub-queries > --- > > Key: SPARK-20688 > URL: https://issues.apache.org/jira/browse/SPARK-20688 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Wenchen Fan >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20688) correctly check analysis for scalar sub-queries
[ https://issues.apache.org/jira/browse/SPARK-20688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20688: Assignee: Wenchen Fan (was: Apache Spark) > correctly check analysis for scalar sub-queries > --- > > Key: SPARK-20688 > URL: https://issues.apache.org/jira/browse/SPARK-20688 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20688) correctly check analysis for scalar sub-queries
[ https://issues.apache.org/jira/browse/SPARK-20688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16004060#comment-16004060 ] Apache Spark commented on SPARK-20688: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/17930 > correctly check analysis for scalar sub-queries > --- > > Key: SPARK-20688 > URL: https://issues.apache.org/jira/browse/SPARK-20688 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20688) correctly check analysis for scalar sub-queries
Wenchen Fan created SPARK-20688: --- Summary: correctly check analysis for scalar sub-queries Key: SPARK-20688 URL: https://issues.apache.org/jira/browse/SPARK-20688 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-20687) mllib.Matrices.fromBreeze may crash when converting breeze CSCMatrix
[ https://issues.apache.org/jira/browse/SPARK-20687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ignacio Bermudez Corrales updated SPARK-20687: -- Comment: was deleted (was: test("breeze conversion bug") { // (2, 0, 0) // (2, 0, 0) val mat1Brz = Matrices.sparse(2, 3, Array(0, 2, 2, 2), Array(0, 1), Array(2, 2)).asBreeze // (2, 1E-15, 1E-15) // (2, 1E-15, 1E-15 val mat2Brz = Matrices.sparse(2, 3, Array(0, 2, 4, 6), Array(0, 0, 0, 1, 1, 1), Array(2, 1E-15, 1E-15, 2, 1E-15, 1E-15)).asBreeze // The following shouldn't break val t01 = mat1Brz - mat1Brz val t02 = mat2Brz - mat2Brz val t02Brz = Matrices.fromBreeze(t02) val t01Brz = Matrices.fromBreeze(t01) val t1Brz = mat1Brz - mat2Brz val t2Brz = mat2Brz - mat1Brz // The following ones should break val t1 = Matrices.fromBreeze(t1Brz) val t2 = Matrices.fromBreeze(t2Brz) }) > mllib.Matrices.fromBreeze may crash when converting breeze CSCMatrix > > > Key: SPARK-20687 > URL: https://issues.apache.org/jira/browse/SPARK-20687 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.1.1 >Reporter: Ignacio Bermudez Corrales >Priority: Critical > > Conversion of Breeze sparse matrices to Matrix is broken when matrices are > product of certain operations. This problem I think is caused by the update > method in Breeze CSCMatrix when they add provisional zeros to the data for > efficiency. > This bug is serious and may affect at least BlockMatrix addition and > substraction > http://stackoverflow.com/questions/33528555/error-thrown-when-using-blockmatrix-add/43883458#43883458 > The following code, reproduces the bug. > test("breeze conversion bug") { > // (2, 0, 0) > // (2, 0, 0) > val mat1Brz = Matrices.sparse(2, 3, Array(0, 2, 2, 2), Array(0, 1), > Array(2, 2)).asBreeze > // (2, 1E-15, 1E-15) > // (2, 1E-15, 1E-15 > val mat2Brz = Matrices.sparse(2, 3, Array(0, 2, 4, 6), Array(0, 0, 0, 1, > 1, 1), Array(2, 1E-15, 1E-15, 2, 1E-15, 1E-15)).asBreeze > // The following shouldn't break > val t01 = mat1Brz - mat1Brz > val t02 = mat2Brz - mat2Brz > val t02Brz = Matrices.fromBreeze(t02) > val t01Brz = Matrices.fromBreeze(t01) > val t1Brz = mat1Brz - mat2Brz > val t2Brz = mat2Brz - mat1Brz > // The following ones should break > val t1 = Matrices.fromBreeze(t1Brz) > val t2 = Matrices.fromBreeze(t2Brz) > } -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20687) mllib.Matrices.fromBreeze may crash when converting breeze CSCMatrix
Ignacio Bermudez Corrales created SPARK-20687: - Summary: mllib.Matrices.fromBreeze may crash when converting breeze CSCMatrix Key: SPARK-20687 URL: https://issues.apache.org/jira/browse/SPARK-20687 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 2.1.1 Reporter: Ignacio Bermudez Corrales Priority: Critical Conversion of Breeze sparse matrices to Matrix is broken when matrices are product of certain operations. This problem I think is caused by the update method in Breeze CSCMatrix when they add provisional zeros to the data for efficiency. This bug is serious and may affect at least BlockMatrix addition and substraction http://stackoverflow.com/questions/33528555/error-thrown-when-using-blockmatrix-add/43883458#43883458 The following code, reproduces the bug. test("breeze conversion bug") { // (2, 0, 0) // (2, 0, 0) val mat1Brz = Matrices.sparse(2, 3, Array(0, 2, 2, 2), Array(0, 1), Array(2, 2)).asBreeze // (2, 1E-15, 1E-15) // (2, 1E-15, 1E-15 val mat2Brz = Matrices.sparse(2, 3, Array(0, 2, 4, 6), Array(0, 0, 0, 1, 1, 1), Array(2, 1E-15, 1E-15, 2, 1E-15, 1E-15)).asBreeze // The following shouldn't break val t01 = mat1Brz - mat1Brz val t02 = mat2Brz - mat2Brz val t02Brz = Matrices.fromBreeze(t02) val t01Brz = Matrices.fromBreeze(t01) val t1Brz = mat1Brz - mat2Brz val t2Brz = mat2Brz - mat1Brz // The following ones should break val t1 = Matrices.fromBreeze(t1Brz) val t2 = Matrices.fromBreeze(t2Brz) } -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20554) Remove usage of scala.language.reflectiveCalls
[ https://issues.apache.org/jira/browse/SPARK-20554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16004020#comment-16004020 ] Umesh Chaudhary commented on SPARK-20554: - [~srowen] Recently I did not get time to through all the warnings. Feel free to take it over. > Remove usage of scala.language.reflectiveCalls > -- > > Key: SPARK-20554 > URL: https://issues.apache.org/jira/browse/SPARK-20554 > Project: Spark > Issue Type: Improvement > Components: ML, Spark Core, SQL, Structured Streaming >Affects Versions: 2.1.0 >Reporter: Sean Owen >Priority: Minor > > In several parts of the code we have imported > {{scala.language.reflectiveCalls}} to suppress a warning about, well, > reflective calls. I know from cleaning up build warnings in 2.2 that in > almost all cases of this are inadvertent and masking a type problem. > Example, in HiveDDLSuite: > {code} > val expectedTablePath = > if (dbPath.isEmpty) { > hiveContext.sessionState.catalog.defaultTablePath(tableIdentifier) > } else { > new Path(new Path(dbPath.get), tableIdentifier.table) > } > val filesystemPath = new Path(expectedTablePath.toString) > {code} > This shouldn't really work because one branch returns a URI and the other a > Path. In this case it only needs an object with a toString method and can > make this work with structural types and reflection. > Obviously, the intent was to add ".toURI" to the second branch though to make > both a URI! > I think we should probably clean this up by taking out all imports of > reflectiveCalls, and re-evaluating all of the warnings. There may be a few > legit usages. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14584) Improve recognition of non-nullability in Dataset transformations
[ https://issues.apache.org/jira/browse/SPARK-14584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16003996#comment-16003996 ] Hyukjin Kwon edited comment on SPARK-14584 at 5/10/17 3:56 AM: --- [~joshrosen] SPARK-18284 fixes this. Before this PR: {code} scala> case class MyCaseClass(foo: Int) defined class MyCaseClass scala> sc.parallelize(Seq(0)).toDS.map(MyCaseClass).printSchema root |-- foo: integer (nullable = true) {code} After this PR: {code} scala> case class MyCaseClass(foo: Int) defined class MyCaseClass scala> sc.parallelize(Seq(0)).toDS.map(MyCaseClass).printSchema root |-- foo: integer (nullable = false) {code} was (Author: hyukjin.kwon): [~joshrosen] SPARK-18284 fixes this. Before this PR: {code} scala> case class MyCaseClass(foo: Int) defined class MyCaseClass scala> sc.parallelize(Seq(0)).toDS.map(MyCaseClass).printSchema root |-- foo: integer (nullable = false) {code} After this PR: {code} scala> case class MyCaseClass(foo: Int) defined class MyCaseClass scala> sc.parallelize(Seq(0)).toDS.map(MyCaseClass).printSchema root |-- foo: integer (nullable = false) {code} > Improve recognition of non-nullability in Dataset transformations > - > > Key: SPARK-14584 > URL: https://issues.apache.org/jira/browse/SPARK-14584 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Josh Rosen > > There are many cases where we can statically know that a field will never be > null. For instance, a field in a case class with a primitive type will never > return null. However, there are currently several cases in the Dataset API > where we do not properly recognize this non-nullability. For instance: > {code} > case class MyCaseClass(foo: Int) > sc.parallelize(Seq(0)).toDS.map(MyCaseClass).printSchema > {code} > claims that the {{foo}} field is nullable even though this is impossible. > I believe that this is due to the way that we reason about nullability when > constructing serializer expressions in ExpressionEncoders. The following > assertion will currently fail if added to ExpressionEncoder: > {code} > require(schema.size == serializer.size) > schema.fields.zip(serializer).foreach { case (field, fieldSerializer) => > require(field.dataType == fieldSerializer.dataType, s"Field > ${field.name}'s data type is " + > s"${field.dataType} in the schema but ${fieldSerializer.dataType} in > its serializer") > require(field.nullable == fieldSerializer.nullable, s"Field > ${field.name}'s nullability is " + > s"${field.nullable} in the schema but ${fieldSerializer.nullable} in > its serializer") > } > {code} > Most often, the schema claims that a field is non-nullable while the encoder > allows for nullability, but occasionally we see a mismatch in the datatypes > due to disagreements over the nullability of nested structs' fields (or > fields of structs in arrays). > I think the problem is that when we're reasoning about nullability in a > struct's schema we consider its fields' nullability to be independent of the > nullability of the struct itself, whereas in the serializer expressions we > are considering those field extraction expressions to be nullable if the > input objects themselves can be nullable. > I'm not sure what's the simplest way to fix this. One proposal would be to > leave the serializers unchanged and have ObjectOperator derive its output > attributes from an explicitly-passed schema rather than using the > serializers' attributes. However, I worry that this might introduce bugs in > case the serializer and schema disagree. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14584) Improve recognition of non-nullability in Dataset transformations
[ https://issues.apache.org/jira/browse/SPARK-14584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16003996#comment-16003996 ] Hyukjin Kwon commented on SPARK-14584: -- [~joshrosen] SPARK-18284 fixes this. Before this PR: {code} scala> case class MyCaseClass(foo: Int) defined class MyCaseClass scala> sc.parallelize(Seq(0)).toDS.map(MyCaseClass).printSchema root |-- foo: integer (nullable = false) {code} After this PR: {code} scala> case class MyCaseClass(foo: Int) defined class MyCaseClass scala> sc.parallelize(Seq(0)).toDS.map(MyCaseClass).printSchema root |-- foo: integer (nullable = false) {code} > Improve recognition of non-nullability in Dataset transformations > - > > Key: SPARK-14584 > URL: https://issues.apache.org/jira/browse/SPARK-14584 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Josh Rosen > > There are many cases where we can statically know that a field will never be > null. For instance, a field in a case class with a primitive type will never > return null. However, there are currently several cases in the Dataset API > where we do not properly recognize this non-nullability. For instance: > {code} > case class MyCaseClass(foo: Int) > sc.parallelize(Seq(0)).toDS.map(MyCaseClass).printSchema > {code} > claims that the {{foo}} field is nullable even though this is impossible. > I believe that this is due to the way that we reason about nullability when > constructing serializer expressions in ExpressionEncoders. The following > assertion will currently fail if added to ExpressionEncoder: > {code} > require(schema.size == serializer.size) > schema.fields.zip(serializer).foreach { case (field, fieldSerializer) => > require(field.dataType == fieldSerializer.dataType, s"Field > ${field.name}'s data type is " + > s"${field.dataType} in the schema but ${fieldSerializer.dataType} in > its serializer") > require(field.nullable == fieldSerializer.nullable, s"Field > ${field.name}'s nullability is " + > s"${field.nullable} in the schema but ${fieldSerializer.nullable} in > its serializer") > } > {code} > Most often, the schema claims that a field is non-nullable while the encoder > allows for nullability, but occasionally we see a mismatch in the datatypes > due to disagreements over the nullability of nested structs' fields (or > fields of structs in arrays). > I think the problem is that when we're reasoning about nullability in a > struct's schema we consider its fields' nullability to be independent of the > nullability of the struct itself, whereas in the serializer expressions we > are considering those field extraction expressions to be nullable if the > input objects themselves can be nullable. > I'm not sure what's the simplest way to fix this. One proposal would be to > leave the serializers unchanged and have ObjectOperator derive its output > attributes from an explicitly-passed schema rather than using the > serializers' attributes. However, I worry that this might introduce bugs in > case the serializer and schema disagree. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20665) Spark-sql, "Bround" and "Round" function return NULL
[ https://issues.apache.org/jira/browse/SPARK-20665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuxian updated SPARK-20665: Summary: Spark-sql, "Bround" and "Round" function return NULL (was: Spark-sql, "Bround" function return NULL) > Spark-sql, "Bround" and "Round" function return NULL > > > Key: SPARK-20665 > URL: https://issues.apache.org/jira/browse/SPARK-20665 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: liuxian > > >select bround(12.3, 2); > >NULL > For this case, the expected result is 12.3, but it is null > "Round" has the same problem: > >select round(12.3, 2); > >NULL -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17685) WholeStageCodegenExec throws IndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/SPARK-17685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-17685. --- Resolution: Fixed Assignee: Yuming Wang Fix Version/s: 2.3.0 2.2.1 2.1.2 > WholeStageCodegenExec throws IndexOutOfBoundsException > -- > > Key: SPARK-17685 > URL: https://issues.apache.org/jira/browse/SPARK-17685 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Minor > Fix For: 2.1.2, 2.2.1, 2.3.0 > > > The following SQL query reproduces this issue: > {code:sql} > CREATE TABLE tab1(int int, int2 int, str string); > CREATE TABLE tab2(int int, int2 int, str string); > INSERT INTO tab1 values(1,1,'str'); > INSERT INTO tab1 values(2,2,'str'); > INSERT INTO tab2 values(1,1,'str'); > INSERT INTO tab2 values(2,3,'str'); > SELECT > count(*) > FROM > ( > SELECT t1.int, t2.int2 > FROM (SELECT * FROM tab1 LIMIT 1310721) t1 > INNER JOIN (SELECT * FROM tab2 LIMIT 1310721) t2 > ON (t1.int = t2.int AND t1.int2 = t2.int2) > ) t; > {code} > Exception thrown: > {noformat} > java.lang.IndexOutOfBoundsException: 1 > at > scala.collection.LinearSeqOptimized$class.apply(LinearSeqOptimized.scala:65) > at scala.collection.immutable.List.apply(List.scala:84) > at > org.apache.spark.sql.catalyst.expressions.BoundReference.doGenCode(BoundAttribute.scala:64) > at > org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:104) > at > org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:101) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:101) > at > org.apache.spark.sql.execution.joins.SortMergeJoinExec$$anonfun$createJoinKey$1.apply(SortMergeJoinExec.scala:334) > at > org.apache.spark.sql.execution.joins.SortMergeJoinExec$$anonfun$createJoinKey$1.apply(SortMergeJoinExec.scala:334) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.execution.joins.SortMergeJoinExec.createJoinKey(SortMergeJoinExec.scala:334) > at > org.apache.spark.sql.execution.joins.SortMergeJoinExec.genScanner(SortMergeJoinExec.scala:369) > at > org.apache.spark.sql.execution.joins.SortMergeJoinExec.doProduce(SortMergeJoinExec.scala:512) > at > org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83) > at > org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at > org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:78) > at > org.apache.spark.sql.execution.joins.SortMergeJoinExec.produce(SortMergeJoinExec.scala:35) > at > org.apache.spark.sql.execution.ProjectExec.doProduce(basicPhysicalOperators.scala:40) > at > org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83) > at > org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at > org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:78) > at > org.apache.spark.sql.execution.ProjectExec.produce(basicPhysicalOperators.scala:30) > at > org.apache.spark.sql.execution.aggregate.HashAggregateExec.doProduceWithoutKeys(HashAggregateExec.scala:215) > at > org.apache.spark.sql.execution.aggregate.HashAggregateExec.doProduce(HashAggregateExec.scala:143) > at > org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83) >
[jira] [Commented] (SPARK-18528) limit + groupBy leads to java.lang.NullPointerException
[ https://issues.apache.org/jira/browse/SPARK-18528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16003882#comment-16003882 ] Takeshi Yamamuro commented on SPARK-18528: -- You tried spark-v2.1.1? This issue is fixed in 2.0.3, 2.1.1, 2.2.0 (See Fix Versions). > limit + groupBy leads to java.lang.NullPointerException > --- > > Key: SPARK-18528 > URL: https://issues.apache.org/jira/browse/SPARK-18528 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.0.1 > Environment: CentOS release 6.6, Linux 2.6.32-504.el6.x86_64 >Reporter: Corey >Assignee: Takeshi Yamamuro > Fix For: 2.0.3, 2.1.1, 2.2.0 > > > Using limit on a DataFrame prior to groupBy will lead to a crash. > Repartitioning will avoid the crash. > *will crash:* {{df.limit(3).groupBy("user_id").count().show()}} > *will work:* {{df.limit(3).coalesce(1).groupBy('user_id').count().show()}} > *will work:* > {{df.limit(3).repartition('user_id').groupBy('user_id').count().show()}} > Here is a reproducible example along with the error message: > {quote} > >>> df = spark.createDataFrame([ (1, 1), (1, 3), (2, 1), (3, 2), (3, 3) ], > >>> ["user_id", "genre_id"]) > >>> > >>> df.show() > +---++ > |user_id|genre_id| > +---++ > | 1| 1| > | 1| 3| > | 2| 1| > | 3| 2| > | 3| 3| > +---++ > >>> df.groupBy("user_id").count().show() > +---+-+ > |user_id|count| > +---+-+ > | 1|2| > | 3|2| > | 2|1| > +---+-+ > >>> df.limit(3).groupBy("user_id").count().show() > [Stage 8:===>(1964 + 24) / > 2000]16/11/21 01:59:27 WARN TaskSetManager: Lost task 0.0 in stage 9.0 (TID > 8204, lvsp20hdn012.stubprod.com): java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {quote} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20683) Make table uncache chaining optional
[ https://issues.apache.org/jira/browse/SPARK-20683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16003864#comment-16003864 ] Shea Parkes commented on SPARK-20683: - For anyone that found this issue and just wants to revert to the old behavior in their own fork, the following change in {{sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala}} worked for us: {code} - if (cd.plan.find(_.sameResult(plan)).isDefined) { + if (cd.plan.sameResult(plan)) { {code} > Make table uncache chaining optional > > > Key: SPARK-20683 > URL: https://issues.apache.org/jira/browse/SPARK-20683 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 > Environment: Not particularly environment sensitive. > Encountered/tested on Linux and Windows. >Reporter: Shea Parkes > > A recent change was made in SPARK-19765 that causes table uncaching to chain. > That is, if table B is a child of table A, and they are both cached, now > uncaching table A will automatically uncache table B. > At first I did not understand the need for this, but when reading the unit > tests, I see that it is likely that many people do not keep named references > to the child table (e.g. B). Perhaps B is just made and cached as some part > of data exploration. In that situation, it makes sense for B to > automatically be uncached when you are finished with A. > However, we commonly utilize a different design pattern that is now harmed by > this automatic uncaching. It is common for us to cache table A to then make > two, independent children tables (e.g. B and C). Once those two child tables > are realized and cached, we'd then uncache table A (as it was no longer > needed and could be quite large). After this change now, when we uncache > table A, we suddenly lose our cached status on both table B and C (which is > quite frustrating). All of these tables are often quite large, and we view > what we're doing as mindful memory management. We are maintaining named > references to B and C at all times, so we can always uncache them ourselves > when it makes sense. > Would it be acceptable/feasible to make this table uncache chaining optional? > I would be fine if the default is for the chaining to happen, as long as we > can turn it off via parameters. > If acceptable, I can try to work towards making the required changes. I am > most comfortable in Python (and would want the optional parameter surfaced in > Python), but have found the places required to make this change in Scala > (since I reverted the functionality in a private fork already). Any help > would be greatly appreciated however. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20665) Spark-sql, "Bround" function return NULL
[ https://issues.apache.org/jira/browse/SPARK-20665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuxian updated SPARK-20665: Description: >select bround(12.3, 2); >NULL For this case, the expected result is 12.3, but it is null "Round" has the same problem: >select round(12.3, 2); >NULL was: >select bround(12.3, 2); >NULL For this case, the expected result is 12.3, but it is null > Spark-sql, "Bround" function return NULL > > > Key: SPARK-20665 > URL: https://issues.apache.org/jira/browse/SPARK-20665 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: liuxian > > >select bround(12.3, 2); > >NULL > For this case, the expected result is 12.3, but it is null > "Round" has the same problem: > >select round(12.3, 2); > >NULL -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20686) PropagateEmptyRelation incorrectly handles aggregate without grouping expressions
[ https://issues.apache.org/jira/browse/SPARK-20686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20686: Assignee: Josh Rosen (was: Apache Spark) > PropagateEmptyRelation incorrectly handles aggregate without grouping > expressions > - > > Key: SPARK-20686 > URL: https://issues.apache.org/jira/browse/SPARK-20686 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 2.1.0 >Reporter: Josh Rosen >Assignee: Josh Rosen > Labels: correctness > > The query > {code} > SELECT 1 FROM (SELECT COUNT(*) WHERE FALSE) t1 > {code} > should return a single row of output because the subquery is an aggregate > without a group-by and thus should return a single row. However, Spark > incorrectly returns zero rows. > This is caused by SPARK-16208, a patch which added an optimizer rule to > propagate EmptyRelation through operators. The logic for handling aggregates > is wrong: it checks whether aggregate expressions are non-empty for deciding > whether the output should be empty, whereas it should be checking grouping > expressions instead: > An aggregate with non-empty group expression will return one output row per > group. If the input to the grouped aggregate is empty then all groups will be > empty and thus the output will be empty. It doesn't matter whether the SELECT > statement includes aggregate expressions since that won't affect the number > of output rows. > If the grouping expressions are empty, however, then the aggregate will > always produce a single output row and thus we cannot propagate the > EmptyRelation. > The current implementation is incorrect (since it returns a wrong answer) and > also misses an optimization opportunity by not propagating EmptyRelation in > the case where a grouped aggregate has aggregate expressions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20686) PropagateEmptyRelation incorrectly handles aggregate without grouping expressions
[ https://issues.apache.org/jira/browse/SPARK-20686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20686: Assignee: Apache Spark (was: Josh Rosen) > PropagateEmptyRelation incorrectly handles aggregate without grouping > expressions > - > > Key: SPARK-20686 > URL: https://issues.apache.org/jira/browse/SPARK-20686 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 2.1.0 >Reporter: Josh Rosen >Assignee: Apache Spark > Labels: correctness > > The query > {code} > SELECT 1 FROM (SELECT COUNT(*) WHERE FALSE) t1 > {code} > should return a single row of output because the subquery is an aggregate > without a group-by and thus should return a single row. However, Spark > incorrectly returns zero rows. > This is caused by SPARK-16208, a patch which added an optimizer rule to > propagate EmptyRelation through operators. The logic for handling aggregates > is wrong: it checks whether aggregate expressions are non-empty for deciding > whether the output should be empty, whereas it should be checking grouping > expressions instead: > An aggregate with non-empty group expression will return one output row per > group. If the input to the grouped aggregate is empty then all groups will be > empty and thus the output will be empty. It doesn't matter whether the SELECT > statement includes aggregate expressions since that won't affect the number > of output rows. > If the grouping expressions are empty, however, then the aggregate will > always produce a single output row and thus we cannot propagate the > EmptyRelation. > The current implementation is incorrect (since it returns a wrong answer) and > also misses an optimization opportunity by not propagating EmptyRelation in > the case where a grouped aggregate has aggregate expressions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20686) PropagateEmptyRelation incorrectly handles aggregate without grouping expressions
[ https://issues.apache.org/jira/browse/SPARK-20686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16003848#comment-16003848 ] Apache Spark commented on SPARK-20686: -- User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/17929 > PropagateEmptyRelation incorrectly handles aggregate without grouping > expressions > - > > Key: SPARK-20686 > URL: https://issues.apache.org/jira/browse/SPARK-20686 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 2.1.0 >Reporter: Josh Rosen >Assignee: Josh Rosen > Labels: correctness > > The query > {code} > SELECT 1 FROM (SELECT COUNT(*) WHERE FALSE) t1 > {code} > should return a single row of output because the subquery is an aggregate > without a group-by and thus should return a single row. However, Spark > incorrectly returns zero rows. > This is caused by SPARK-16208, a patch which added an optimizer rule to > propagate EmptyRelation through operators. The logic for handling aggregates > is wrong: it checks whether aggregate expressions are non-empty for deciding > whether the output should be empty, whereas it should be checking grouping > expressions instead: > An aggregate with non-empty group expression will return one output row per > group. If the input to the grouped aggregate is empty then all groups will be > empty and thus the output will be empty. It doesn't matter whether the SELECT > statement includes aggregate expressions since that won't affect the number > of output rows. > If the grouping expressions are empty, however, then the aggregate will > always produce a single output row and thus we cannot propagate the > EmptyRelation. > The current implementation is incorrect (since it returns a wrong answer) and > also misses an optimization opportunity by not propagating EmptyRelation in > the case where a grouped aggregate has aggregate expressions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20686) PropagateEmptyRelation incorrectly handles aggregate without grouping expressions
Josh Rosen created SPARK-20686: -- Summary: PropagateEmptyRelation incorrectly handles aggregate without grouping expressions Key: SPARK-20686 URL: https://issues.apache.org/jira/browse/SPARK-20686 Project: Spark Issue Type: Bug Components: Optimizer, SQL Affects Versions: 2.1.0 Reporter: Josh Rosen Assignee: Josh Rosen The query {code} SELECT 1 FROM (SELECT COUNT(*) WHERE FALSE) t1 {code} should return a single row of output because the subquery is an aggregate without a group-by and thus should return a single row. However, Spark incorrectly returns zero rows. This is caused by SPARK-16208, a patch which added an optimizer rule to propagate EmptyRelation through operators. The logic for handling aggregates is wrong: it checks whether aggregate expressions are non-empty for deciding whether the output should be empty, whereas it should be checking grouping expressions instead: An aggregate with non-empty group expression will return one output row per group. If the input to the grouped aggregate is empty then all groups will be empty and thus the output will be empty. It doesn't matter whether the SELECT statement includes aggregate expressions since that won't affect the number of output rows. If the grouping expressions are empty, however, then the aggregate will always produce a single output row and thus we cannot propagate the EmptyRelation. The current implementation is incorrect (since it returns a wrong answer) and also misses an optimization opportunity by not propagating EmptyRelation in the case where a grouped aggregate has aggregate expressions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20311) SQL "range(N) as alias" or "range(N) alias" doesn't work
[ https://issues.apache.org/jira/browse/SPARK-20311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16003841#comment-16003841 ] Apache Spark commented on SPARK-20311: -- User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/17928 > SQL "range(N) as alias" or "range(N) alias" doesn't work > > > Key: SPARK-20311 > URL: https://issues.apache.org/jira/browse/SPARK-20311 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Juliusz Sompolski >Assignee: Takeshi Yamamuro >Priority: Minor > > `select * from range(10) as A;` or `select * from range(10) A;` > does not work. > As a workaround, a subquery has to be used: > `select * from (select * from range(10)) as A;` -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20500) ML, Graph 2.2 QA: API: Binary incompatible changes
[ https://issues.apache.org/jira/browse/SPARK-20500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-20500. --- Resolution: Done Fix Version/s: 2.2.0 > ML, Graph 2.2 QA: API: Binary incompatible changes > -- > > Key: SPARK-20500 > URL: https://issues.apache.org/jira/browse/SPARK-20500 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Blocker > Fix For: 2.2.0 > > > Generate a list of binary incompatible changes using MiMa and create new > JIRAs for issues found. Filter out false positives as needed. > If you want to take this task, look at the analogous task from the previous > release QA, and ping the Assignee for advice. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20500) ML, Graph 2.2 QA: API: Binary incompatible changes
[ https://issues.apache.org/jira/browse/SPARK-20500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16003813#comment-16003813 ] Joseph K. Bradley commented on SPARK-20500: --- I checked the full MiMa output for MLlib and GraphX, as well as the new MimaExcludes items. All looked fine. Closing. > ML, Graph 2.2 QA: API: Binary incompatible changes > -- > > Key: SPARK-20500 > URL: https://issues.apache.org/jira/browse/SPARK-20500 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Blocker > Fix For: 2.2.0 > > > Generate a list of binary incompatible changes using MiMa and create new > JIRAs for issues found. Filter out false positives as needed. > If you want to take this task, look at the analogous task from the previous > release QA, and ping the Assignee for advice. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20376) Make StateStoreProvider plugable
[ https://issues.apache.org/jira/browse/SPARK-20376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16003729#comment-16003729 ] Michael Armbrust commented on SPARK-20376: -- /cc [~tdas] > Make StateStoreProvider plugable > > > Key: SPARK-20376 > URL: https://issues.apache.org/jira/browse/SPARK-20376 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Yogesh Mahajan > > Currently only option available for StateStore and StateStoreProvider is > HDFSBackedStateStoreProvider. There is no way to provide your own custom > implementation of StateStoreProvider for users of structured streaming. We > have built a performant in memory StateStore which we would like to plugin > for our use cases. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20685) BatchPythonEvaluation UDF evaluator fails for case of single UDF with repeated argument
[ https://issues.apache.org/jira/browse/SPARK-20685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16003662#comment-16003662 ] Apache Spark commented on SPARK-20685: -- User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/17927 > BatchPythonEvaluation UDF evaluator fails for case of single UDF with > repeated argument > --- > > Key: SPARK-20685 > URL: https://issues.apache.org/jira/browse/SPARK-20685 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Josh Rosen >Assignee: Josh Rosen > > There's a latent corner-case bug in PYSpark UDF evaluation where executing a > stage with a single UDF that takes more than one argument _where that > argument is repeated_ will crash at execution with a confusing error. > Here's a repro: > {code} > from pyspark.sql.types import * > spark.catalog.registerFunction("add", lambda x, y: x + y, IntegerType()) > spark.sql("SELECT add(1, 1)").first() > {code} > This fails with > {code} > Caused by: org.apache.spark.api.python.PythonException: Traceback (most > recent call last): > File > "/Users/joshrosen/Documents/spark/python/lib/pyspark.zip/pyspark/worker.py", > line 180, in main > process() > File > "/Users/joshrosen/Documents/spark/python/lib/pyspark.zip/pyspark/worker.py", > line 175, in process > serializer.dump_stream(func(split_index, iterator), outfile) > File > "/Users/joshrosen/Documents/spark/python/lib/pyspark.zip/pyspark/worker.py", > line 107, in > func = lambda _, it: map(mapper, it) > File > "/Users/joshrosen/Documents/spark/python/lib/pyspark.zip/pyspark/worker.py", > line 93, in > mapper = lambda a: udf(*a) > File > "/Users/joshrosen/Documents/spark/python/lib/pyspark.zip/pyspark/worker.py", > line 71, in > return lambda *a: f(*a) > TypeError: () takes exactly 2 arguments (1 given) > {code} > The problem was introduced by SPARK-14267: there code there has a fast path > for handling a "batch UDF evaluation consisting of a single Python UDF, but > that branch incorrectly assumes that a single UDF won't have repeated > arguments and therefore skips the code for unpacking arguments from the input > row (whose schema may not necessarily match the UDF inputs). > I have a simple fix for this which I'll submit now. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20685) BatchPythonEvaluation UDF evaluator fails for case of single UDF with repeated argument
[ https://issues.apache.org/jira/browse/SPARK-20685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen reassigned SPARK-20685: -- Assignee: Josh Rosen > BatchPythonEvaluation UDF evaluator fails for case of single UDF with > repeated argument > --- > > Key: SPARK-20685 > URL: https://issues.apache.org/jira/browse/SPARK-20685 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Josh Rosen >Assignee: Josh Rosen > > There's a latent corner-case bug in PYSpark UDF evaluation where executing a > stage with a single UDF that takes more than one argument _where that > argument is repeated_ will crash at execution with a confusing error. > Here's a repro: > {code} > from pyspark.sql.types import * > spark.catalog.registerFunction("add", lambda x, y: x + y, IntegerType()) > spark.sql("SELECT add(1, 1)").first() > {code} > This fails with > {code} > Caused by: org.apache.spark.api.python.PythonException: Traceback (most > recent call last): > File > "/Users/joshrosen/Documents/spark/python/lib/pyspark.zip/pyspark/worker.py", > line 180, in main > process() > File > "/Users/joshrosen/Documents/spark/python/lib/pyspark.zip/pyspark/worker.py", > line 175, in process > serializer.dump_stream(func(split_index, iterator), outfile) > File > "/Users/joshrosen/Documents/spark/python/lib/pyspark.zip/pyspark/worker.py", > line 107, in > func = lambda _, it: map(mapper, it) > File > "/Users/joshrosen/Documents/spark/python/lib/pyspark.zip/pyspark/worker.py", > line 93, in > mapper = lambda a: udf(*a) > File > "/Users/joshrosen/Documents/spark/python/lib/pyspark.zip/pyspark/worker.py", > line 71, in > return lambda *a: f(*a) > TypeError: () takes exactly 2 arguments (1 given) > {code} > The problem was introduced by SPARK-14267: there code there has a fast path > for handling a "batch UDF evaluation consisting of a single Python UDF, but > that branch incorrectly assumes that a single UDF won't have repeated > arguments and therefore skips the code for unpacking arguments from the input > row (whose schema may not necessarily match the UDF inputs). > I have a simple fix for this which I'll submit now. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20685) BatchPythonEvaluation UDF evaluator fails for case of single UDF with repeated argument
Josh Rosen created SPARK-20685: -- Summary: BatchPythonEvaluation UDF evaluator fails for case of single UDF with repeated argument Key: SPARK-20685 URL: https://issues.apache.org/jira/browse/SPARK-20685 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.0.0 Reporter: Josh Rosen There's a latent corner-case bug in PYSpark UDF evaluation where executing a stage with a single UDF that takes more than one argument _where that argument is repeated_ will crash at execution with a confusing error. Here's a repro: {code} from pyspark.sql.types import * spark.catalog.registerFunction("add", lambda x, y: x + y, IntegerType()) spark.sql("SELECT add(1, 1)").first() {code} This fails with {code} Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/Users/joshrosen/Documents/spark/python/lib/pyspark.zip/pyspark/worker.py", line 180, in main process() File "/Users/joshrosen/Documents/spark/python/lib/pyspark.zip/pyspark/worker.py", line 175, in process serializer.dump_stream(func(split_index, iterator), outfile) File "/Users/joshrosen/Documents/spark/python/lib/pyspark.zip/pyspark/worker.py", line 107, in func = lambda _, it: map(mapper, it) File "/Users/joshrosen/Documents/spark/python/lib/pyspark.zip/pyspark/worker.py", line 93, in mapper = lambda a: udf(*a) File "/Users/joshrosen/Documents/spark/python/lib/pyspark.zip/pyspark/worker.py", line 71, in return lambda *a: f(*a) TypeError: () takes exactly 2 arguments (1 given) {code} The problem was introduced by SPARK-14267: there code there has a fast path for handling a "batch UDF evaluation consisting of a single Python UDF, but that branch incorrectly assumes that a single UDF won't have repeated arguments and therefore skips the code for unpacking arguments from the input row (whose schema may not necessarily match the UDF inputs). I have a simple fix for this which I'll submit now. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20373) Batch queries with 'Dataset/DataFrame.withWatermark()` does not execute
[ https://issues.apache.org/jira/browse/SPARK-20373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu reassigned SPARK-20373: Assignee: Genmao Yu > Batch queries with 'Dataset/DataFrame.withWatermark()` does not execute > --- > > Key: SPARK-20373 > URL: https://issues.apache.org/jira/browse/SPARK-20373 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0, 2.2.0 >Reporter: Tathagata Das >Assignee: Genmao Yu >Priority: Minor > Fix For: 2.2.1, 2.3.0 > > > Any Dataset/DataFrame batch query with the operation `withWatermark` does not > execute because the batch planner does not have any rule to explicitly handle > the EventTimeWatermark logical plan. The right solution is to simply remove > the plan node, as the watermark should not affect any batch query in any way. > {code} > from pyspark.sql.functions import * > eventsDF = spark.createDataFrame([("2016-03-11 09:00:07", "dev1", > 123)]).toDF("eventTime", "deviceId", > "signal").select(col("eventTime").cast("timestamp").alias("eventTime"), > "deviceId", "signal") > windowedCountsDF = \ > eventsDF \ > .withWatermark("eventTime", "10 minutes") \ > .groupBy( > "deviceId", > window("eventTime", "5 minutes")) \ > .count() > windowedCountsDF.collect() > {code} > This throws as an error > {code} > java.lang.AssertionError: assertion failed: No plan for EventTimeWatermark > eventTime#3762657: timestamp, interval 10 minutes > +- Project [cast(_1#3762643 as timestamp) AS eventTime#3762657, _2#3762644 AS > deviceId#3762651] >+- LogicalRDD [_1#3762643, _2#3762644, _3#3762645L] > at scala.Predef$.assert(Predef.scala:170) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:92) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:77) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:74) > at > scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157) > at > scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157) > at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:74) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:66) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:92) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:77) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:74) > at > scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157) > at > scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157) > at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:74) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:66) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:92) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:77) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:74) > at > scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157) > at > scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at
[jira] [Updated] (SPARK-20373) Batch queries with 'Dataset/DataFrame.withWatermark()` does not execute
[ https://issues.apache.org/jira/browse/SPARK-20373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-20373: - Fix Version/s: 2.3.0 2.2.1 > Batch queries with 'Dataset/DataFrame.withWatermark()` does not execute > --- > > Key: SPARK-20373 > URL: https://issues.apache.org/jira/browse/SPARK-20373 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0, 2.2.0 >Reporter: Tathagata Das >Priority: Minor > Fix For: 2.2.1, 2.3.0 > > > Any Dataset/DataFrame batch query with the operation `withWatermark` does not > execute because the batch planner does not have any rule to explicitly handle > the EventTimeWatermark logical plan. The right solution is to simply remove > the plan node, as the watermark should not affect any batch query in any way. > {code} > from pyspark.sql.functions import * > eventsDF = spark.createDataFrame([("2016-03-11 09:00:07", "dev1", > 123)]).toDF("eventTime", "deviceId", > "signal").select(col("eventTime").cast("timestamp").alias("eventTime"), > "deviceId", "signal") > windowedCountsDF = \ > eventsDF \ > .withWatermark("eventTime", "10 minutes") \ > .groupBy( > "deviceId", > window("eventTime", "5 minutes")) \ > .count() > windowedCountsDF.collect() > {code} > This throws as an error > {code} > java.lang.AssertionError: assertion failed: No plan for EventTimeWatermark > eventTime#3762657: timestamp, interval 10 minutes > +- Project [cast(_1#3762643 as timestamp) AS eventTime#3762657, _2#3762644 AS > deviceId#3762651] >+- LogicalRDD [_1#3762643, _2#3762644, _3#3762645L] > at scala.Predef$.assert(Predef.scala:170) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:92) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:77) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:74) > at > scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157) > at > scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157) > at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:74) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:66) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:92) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:77) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:74) > at > scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157) > at > scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157) > at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:74) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:66) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:92) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:77) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:74) > at > scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157) > at > scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at
[jira] [Assigned] (SPARK-20311) SQL "range(N) as alias" or "range(N) alias" doesn't work
[ https://issues.apache.org/jira/browse/SPARK-20311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20311: Assignee: Apache Spark (was: Takeshi Yamamuro) > SQL "range(N) as alias" or "range(N) alias" doesn't work > > > Key: SPARK-20311 > URL: https://issues.apache.org/jira/browse/SPARK-20311 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Juliusz Sompolski >Assignee: Apache Spark >Priority: Minor > > `select * from range(10) as A;` or `select * from range(10) A;` > does not work. > As a workaround, a subquery has to be used: > `select * from (select * from range(10)) as A;` -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20311) SQL "range(N) as alias" or "range(N) alias" doesn't work
[ https://issues.apache.org/jira/browse/SPARK-20311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20311: Assignee: Takeshi Yamamuro (was: Apache Spark) > SQL "range(N) as alias" or "range(N) alias" doesn't work > > > Key: SPARK-20311 > URL: https://issues.apache.org/jira/browse/SPARK-20311 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Juliusz Sompolski >Assignee: Takeshi Yamamuro >Priority: Minor > > `select * from range(10) as A;` or `select * from range(10) A;` > does not work. > As a workaround, a subquery has to be used: > `select * from (select * from range(10)) as A;` -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-20311) SQL "range(N) as alias" or "range(N) alias" doesn't work
[ https://issues.apache.org/jira/browse/SPARK-20311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai reopened SPARK-20311: -- > SQL "range(N) as alias" or "range(N) alias" doesn't work > > > Key: SPARK-20311 > URL: https://issues.apache.org/jira/browse/SPARK-20311 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Juliusz Sompolski >Assignee: Takeshi Yamamuro >Priority: Minor > > `select * from range(10) as A;` or `select * from range(10) A;` > does not work. > As a workaround, a subquery has to be used: > `select * from (select * from range(10)) as A;` -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20311) SQL "range(N) as alias" or "range(N) alias" doesn't work
[ https://issues.apache.org/jira/browse/SPARK-20311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16003626#comment-16003626 ] Yin Huai commented on SPARK-20311: -- It introduced a regression (https://github.com/apache/spark/pull/17666#issuecomment-300309896). I have reverted the change. > SQL "range(N) as alias" or "range(N) alias" doesn't work > > > Key: SPARK-20311 > URL: https://issues.apache.org/jira/browse/SPARK-20311 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Juliusz Sompolski >Assignee: Takeshi Yamamuro >Priority: Minor > > `select * from range(10) as A;` or `select * from range(10) A;` > does not work. > As a workaround, a subquery has to be used: > `select * from (select * from range(10)) as A;` -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20311) SQL "range(N) as alias" or "range(N) alias" doesn't work
[ https://issues.apache.org/jira/browse/SPARK-20311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-20311: - Fix Version/s: (was: 2.2.1) (was: 2.3.0) > SQL "range(N) as alias" or "range(N) alias" doesn't work > > > Key: SPARK-20311 > URL: https://issues.apache.org/jira/browse/SPARK-20311 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Juliusz Sompolski >Assignee: Takeshi Yamamuro >Priority: Minor > > `select * from range(10) as A;` or `select * from range(10) A;` > does not work. > As a workaround, a subquery has to be used: > `select * from (select * from range(10)) as A;` -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20684) expose createGlobalTempView in SparkR
Hossein Falaki created SPARK-20684: -- Summary: expose createGlobalTempView in SparkR Key: SPARK-20684 URL: https://issues.apache.org/jira/browse/SPARK-20684 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 2.2.0 Reporter: Hossein Falaki This is a useful API that is not exposed in SparkR. It will help with moving data between languages on a single single Spark application. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20205) DAGScheduler posts SparkListenerStageSubmitted before updating stage
[ https://issues.apache.org/jira/browse/SPARK-20205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20205: Assignee: Apache Spark > DAGScheduler posts SparkListenerStageSubmitted before updating stage > > > Key: SPARK-20205 > URL: https://issues.apache.org/jira/browse/SPARK-20205 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Marcelo Vanzin >Assignee: Apache Spark > > Probably affects other versions, haven't checked. > The code that submits the event to the bus is around line 991: > {code} > stage.makeNewStageAttempt(partitionsToCompute.size, > taskIdToLocations.values.toSeq) > listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, > properties)) > {code} > Later in the same method, the stage information is updated (around line 1057): > {code} > if (tasks.size > 0) { > logInfo(s"Submitting ${tasks.size} missing tasks from $stage > (${stage.rdd}) (first 15 " + > s"tasks are for partitions ${tasks.take(15).map(_.partitionId)})") > taskScheduler.submitTasks(new TaskSet( > tasks.toArray, stage.id, stage.latestInfo.attemptId, jobId, > properties)) > stage.latestInfo.submissionTime = Some(clock.getTimeMillis()) > {code} > That means an event handler might get a stage submitted event with an unset > submission time. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20205) DAGScheduler posts SparkListenerStageSubmitted before updating stage
[ https://issues.apache.org/jira/browse/SPARK-20205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20205: Assignee: (was: Apache Spark) > DAGScheduler posts SparkListenerStageSubmitted before updating stage > > > Key: SPARK-20205 > URL: https://issues.apache.org/jira/browse/SPARK-20205 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Marcelo Vanzin > > Probably affects other versions, haven't checked. > The code that submits the event to the bus is around line 991: > {code} > stage.makeNewStageAttempt(partitionsToCompute.size, > taskIdToLocations.values.toSeq) > listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, > properties)) > {code} > Later in the same method, the stage information is updated (around line 1057): > {code} > if (tasks.size > 0) { > logInfo(s"Submitting ${tasks.size} missing tasks from $stage > (${stage.rdd}) (first 15 " + > s"tasks are for partitions ${tasks.take(15).map(_.partitionId)})") > taskScheduler.submitTasks(new TaskSet( > tasks.toArray, stage.id, stage.latestInfo.attemptId, jobId, > properties)) > stage.latestInfo.submissionTime = Some(clock.getTimeMillis()) > {code} > That means an event handler might get a stage submitted event with an unset > submission time. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20205) DAGScheduler posts SparkListenerStageSubmitted before updating stage
[ https://issues.apache.org/jira/browse/SPARK-20205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16003471#comment-16003471 ] Apache Spark commented on SPARK-20205: -- User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/17925 > DAGScheduler posts SparkListenerStageSubmitted before updating stage > > > Key: SPARK-20205 > URL: https://issues.apache.org/jira/browse/SPARK-20205 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Marcelo Vanzin > > Probably affects other versions, haven't checked. > The code that submits the event to the bus is around line 991: > {code} > stage.makeNewStageAttempt(partitionsToCompute.size, > taskIdToLocations.values.toSeq) > listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, > properties)) > {code} > Later in the same method, the stage information is updated (around line 1057): > {code} > if (tasks.size > 0) { > logInfo(s"Submitting ${tasks.size} missing tasks from $stage > (${stage.rdd}) (first 15 " + > s"tasks are for partitions ${tasks.take(15).map(_.partitionId)})") > taskScheduler.submitTasks(new TaskSet( > tasks.toArray, stage.id, stage.latestInfo.attemptId, jobId, > properties)) > stage.latestInfo.submissionTime = Some(clock.getTimeMillis()) > {code} > That means an event handler might get a stage submitted event with an unset > submission time. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20683) Make table uncache chaining optional
[ https://issues.apache.org/jira/browse/SPARK-20683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16003464#comment-16003464 ] Robert Kruszewski commented on SPARK-20683: --- +1, have the similar use case where we do cache management ourselves and don't want to invalidate dependent caches. > Make table uncache chaining optional > > > Key: SPARK-20683 > URL: https://issues.apache.org/jira/browse/SPARK-20683 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 > Environment: Not particularly environment sensitive. > Encountered/tested on Linux and Windows. >Reporter: Shea Parkes > > A recent change was made in SPARK-19765 that causes table uncaching to chain. > That is, if table B is a child of table A, and they are both cached, now > uncaching table A will automatically uncache table B. > At first I did not understand the need for this, but when reading the unit > tests, I see that it is likely that many people do not keep named references > to the child table (e.g. B). Perhaps B is just made and cached as some part > of data exploration. In that situation, it makes sense for B to > automatically be uncached when you are finished with A. > However, we commonly utilize a different design pattern that is now harmed by > this automatic uncaching. It is common for us to cache table A to then make > two, independent children tables (e.g. B and C). Once those two child tables > are realized and cached, we'd then uncache table A (as it was no longer > needed and could be quite large). After this change now, when we uncache > table A, we suddenly lose our cached status on both table B and C (which is > quite frustrating). All of these tables are often quite large, and we view > what we're doing as mindful memory management. We are maintaining named > references to B and C at all times, so we can always uncache them ourselves > when it makes sense. > Would it be acceptable/feasible to make this table uncache chaining optional? > I would be fine if the default is for the chaining to happen, as long as we > can turn it off via parameters. > If acceptable, I can try to work towards making the required changes. I am > most comfortable in Python (and would want the optional parameter surfaced in > Python), but have found the places required to make this change in Scala > (since I reverted the functionality in a private fork already). Any help > would be greatly appreciated however. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20683) Make table uncache chaining optional
Shea Parkes created SPARK-20683: --- Summary: Make table uncache chaining optional Key: SPARK-20683 URL: https://issues.apache.org/jira/browse/SPARK-20683 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.1 Environment: Not particularly environment sensitive. Encountered/tested on Linux and Windows. Reporter: Shea Parkes A recent change was made in SPARK-19765 that causes table uncaching to chain. That is, if table B is a child of table A, and they are both cached, now uncaching table A will automatically uncache table B. At first I did not understand the need for this, but when reading the unit tests, I see that it is likely that many people do not keep named references to the child table (e.g. B). Perhaps B is just made and cached as some part of data exploration. In that situation, it makes sense for B to automatically be uncached when you are finished with A. However, we commonly utilize a different design pattern that is now harmed by this automatic uncaching. It is common for us to cache table A to then make two, independent children tables (e.g. B and C). Once those two child tables are realized and cached, we'd then uncache table A (as it was no longer needed and could be quite large). After this change now, when we uncache table A, we suddenly lose our cached status on both table B and C (which is quite frustrating). All of these tables are often quite large, and we view what we're doing as mindful memory management. We are maintaining named references to B and C at all times, so we can always uncache them ourselves when it makes sense. Would it be acceptable/feasible to make this table uncache chaining optional? I would be fine if the default is for the chaining to happen, as long as we can turn it off via parameters. If acceptable, I can try to work towards making the required changes. I am most comfortable in Python (and would want the optional parameter surfaced in Python), but have found the places required to make this change in Scala (since I reverted the functionality in a private fork already). Any help would be greatly appreciated however. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20682) Support a new faster ORC data source based on Apache ORC
[ https://issues.apache.org/jira/browse/SPARK-20682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20682: Assignee: Apache Spark > Support a new faster ORC data source based on Apache ORC > > > Key: SPARK-20682 > URL: https://issues.apache.org/jira/browse/SPARK-20682 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1, 1.5.2, 1.6.3, 2.1.1 >Reporter: Dongjoon Hyun >Assignee: Apache Spark > > Since SPARK-2883, Apache Spark supports Apache ORC inside `sql/hive` module > with Hive dependency. This issue aims to add a new and faster ORC data source > inside `sql/core` and to replace the old ORC data source eventually. In this > issue, the latest Apache ORC 1.4.0 (released yesterday) is used. > There are four key benefits. > - Speed: Use both Spark `ColumnarBatch` and ORC `RowBatch` together. This is > faster than the current implementation in Spark. > - Stability: Apache ORC 1.4.0 has many fixes and we can depend on ORC > community more. > - Usability: User can use `ORC` data sources without hive module, i.e, > `-Phive`. > - Maintainability: Reduce the Hive dependency and can remove old legacy code > later. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20682) Support a new faster ORC data source based on Apache ORC
[ https://issues.apache.org/jira/browse/SPARK-20682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20682: Assignee: (was: Apache Spark) > Support a new faster ORC data source based on Apache ORC > > > Key: SPARK-20682 > URL: https://issues.apache.org/jira/browse/SPARK-20682 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1, 1.5.2, 1.6.3, 2.1.1 >Reporter: Dongjoon Hyun > > Since SPARK-2883, Apache Spark supports Apache ORC inside `sql/hive` module > with Hive dependency. This issue aims to add a new and faster ORC data source > inside `sql/core` and to replace the old ORC data source eventually. In this > issue, the latest Apache ORC 1.4.0 (released yesterday) is used. > There are four key benefits. > - Speed: Use both Spark `ColumnarBatch` and ORC `RowBatch` together. This is > faster than the current implementation in Spark. > - Stability: Apache ORC 1.4.0 has many fixes and we can depend on ORC > community more. > - Usability: User can use `ORC` data sources without hive module, i.e, > `-Phive`. > - Maintainability: Reduce the Hive dependency and can remove old legacy code > later. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20682) Support a new faster ORC data source based on Apache ORC
[ https://issues.apache.org/jira/browse/SPARK-20682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16003369#comment-16003369 ] Apache Spark commented on SPARK-20682: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/17924 > Support a new faster ORC data source based on Apache ORC > > > Key: SPARK-20682 > URL: https://issues.apache.org/jira/browse/SPARK-20682 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1, 1.5.2, 1.6.3, 2.1.1 >Reporter: Dongjoon Hyun > > Since SPARK-2883, Apache Spark supports Apache ORC inside `sql/hive` module > with Hive dependency. This issue aims to add a new and faster ORC data source > inside `sql/core` and to replace the old ORC data source eventually. In this > issue, the latest Apache ORC 1.4.0 (released yesterday) is used. > There are four key benefits. > - Speed: Use both Spark `ColumnarBatch` and ORC `RowBatch` together. This is > faster than the current implementation in Spark. > - Stability: Apache ORC 1.4.0 has many fixes and we can depend on ORC > community more. > - Usability: User can use `ORC` data sources without hive module, i.e, > `-Phive`. > - Maintainability: Reduce the Hive dependency and can remove old legacy code > later. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20682) Support a new faster ORC data source based on Apache ORC
[ https://issues.apache.org/jira/browse/SPARK-20682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16003344#comment-16003344 ] Dongjoon Hyun commented on SPARK-20682: --- Yes. I think it's worth for Spark to have it. > Support a new faster ORC data source based on Apache ORC > > > Key: SPARK-20682 > URL: https://issues.apache.org/jira/browse/SPARK-20682 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1, 1.5.2, 1.6.3, 2.1.1 >Reporter: Dongjoon Hyun > > Since SPARK-2883, Apache Spark supports Apache ORC inside `sql/hive` module > with Hive dependency. This issue aims to add a new and faster ORC data source > inside `sql/core` and to replace the old ORC data source eventually. In this > issue, the latest Apache ORC 1.4.0 (released yesterday) is used. > There are four key benefits. > - Speed: Use both Spark `ColumnarBatch` and ORC `RowBatch` together. This is > faster than the current implementation in Spark. > - Stability: Apache ORC 1.4.0 has many fixes and we can depend on ORC > community more. > - Usability: User can use `ORC` data sources without hive module, i.e, > `-Phive`. > - Maintainability: Reduce the Hive dependency and can remove old legacy code > later. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20682) Support a new faster ORC data source based on Apache ORC
[ https://issues.apache.org/jira/browse/SPARK-20682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16003339#comment-16003339 ] Robert Kruszewski commented on SPARK-20682: --- Awesome, didn't know that there are nohive artifacts, sounds like we can make it work without -Phive then. > Support a new faster ORC data source based on Apache ORC > > > Key: SPARK-20682 > URL: https://issues.apache.org/jira/browse/SPARK-20682 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1, 1.5.2, 1.6.3, 2.1.1 >Reporter: Dongjoon Hyun > > Since SPARK-2883, Apache Spark supports Apache ORC inside `sql/hive` module > with Hive dependency. This issue aims to add a new and faster ORC data source > inside `sql/core` and to replace the old ORC data source eventually. In this > issue, the latest Apache ORC 1.4.0 (released yesterday) is used. > There are four key benefits. > - Speed: Use both Spark `ColumnarBatch` and ORC `RowBatch` together. This is > faster than the current implementation in Spark. > - Stability: Apache ORC 1.4.0 has many fixes and we can depend on ORC > community more. > - Usability: User can use `ORC` data sources without hive module, i.e, > `-Phive`. > - Maintainability: Reduce the Hive dependency and can remove old legacy code > later. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20682) Support a new faster ORC data source based on Apache ORC
[ https://issues.apache.org/jira/browse/SPARK-20682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-20682: -- Description: Since SPARK-2883, Apache Spark supports Apache ORC inside `sql/hive` module with Hive dependency. This issue aims to add a new and faster ORC data source inside `sql/core` and to replace the old ORC data source eventually. In this issue, the latest Apache ORC 1.4.0 (released yesterday) is used. There are four key benefits. - Speed: Use both Spark `ColumnarBatch` and ORC `RowBatch` together. This is faster than the current implementation in Spark. - Stability: Apache ORC 1.4.0 has many fixes and we can depend on ORC community more. - Usability: User can use `ORC` data sources without hive module, i.e, `-Phive`. - Maintainability: Reduce the Hive dependency and can remove old legacy code later. was: Since SPARK-2883, Apache Spark supports Apache ORC inside `sql/hive` module with Hive dependency. This issue aims to add a new and faster ORC data source inside `sql/core` and to replace the old ORC data source eventually. In this issue, the latest Apache ORC 1.4.0 (released yesterday) is used. There are four key benefits. - Speed: Use both Spark `ColumnarBatch` and ORC `RowBatch` together. This is faster than the current implementation in Spark. - Stability: Apache ORC 1.4.0 has many fixes and we can depend on ORC community more. - Usability: User can use `ORC` data sources with hive module, i.e, `-Phive`. - Maintainability: Reduce the Hive dependency and can remove old legacy code later. > Support a new faster ORC data source based on Apache ORC > > > Key: SPARK-20682 > URL: https://issues.apache.org/jira/browse/SPARK-20682 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1, 1.5.2, 1.6.3, 2.1.1 >Reporter: Dongjoon Hyun > > Since SPARK-2883, Apache Spark supports Apache ORC inside `sql/hive` module > with Hive dependency. This issue aims to add a new and faster ORC data source > inside `sql/core` and to replace the old ORC data source eventually. In this > issue, the latest Apache ORC 1.4.0 (released yesterday) is used. > There are four key benefits. > - Speed: Use both Spark `ColumnarBatch` and ORC `RowBatch` together. This is > faster than the current implementation in Spark. > - Stability: Apache ORC 1.4.0 has many fixes and we can depend on ORC > community more. > - Usability: User can use `ORC` data sources without hive module, i.e, > `-Phive`. > - Maintainability: Reduce the Hive dependency and can remove old legacy code > later. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20682) Support a new faster ORC data source based on Apache ORC
[ https://issues.apache.org/jira/browse/SPARK-20682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16003326#comment-16003326 ] Dongjoon Hyun edited comment on SPARK-20682 at 5/9/17 7:10 PM: --- I fixed the description. Yep. `hive-storage-api` was a old huddle before 1.4.0. 1.4.0 provides `nohive` artifacts, too. https://repo.maven.apache.org/maven2/org/apache/orc/orc-core/1.4.0/ I'll make a PR soon as a POC. was (Author: dongjoon): Yep. That was a old huddle before 1.4.0. 1.4.0 provides `nohive` artifacts, too. https://repo.maven.apache.org/maven2/org/apache/orc/orc-core/1.4.0/ I'll make a PR soon as a POC. > Support a new faster ORC data source based on Apache ORC > > > Key: SPARK-20682 > URL: https://issues.apache.org/jira/browse/SPARK-20682 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1, 1.5.2, 1.6.3, 2.1.1 >Reporter: Dongjoon Hyun > > Since SPARK-2883, Apache Spark supports Apache ORC inside `sql/hive` module > with Hive dependency. This issue aims to add a new and faster ORC data source > inside `sql/core` and to replace the old ORC data source eventually. In this > issue, the latest Apache ORC 1.4.0 (released yesterday) is used. > There are four key benefits. > - Speed: Use both Spark `ColumnarBatch` and ORC `RowBatch` together. This is > faster than the current implementation in Spark. > - Stability: Apache ORC 1.4.0 has many fixes and we can depend on ORC > community more. > - Usability: User can use `ORC` data sources without hive module, i.e, > `-Phive`. > - Maintainability: Reduce the Hive dependency and can remove old legacy code > later. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20682) Support a new faster ORC data source based on Apache ORC
[ https://issues.apache.org/jira/browse/SPARK-20682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16003326#comment-16003326 ] Dongjoon Hyun commented on SPARK-20682: --- Yep. That was a old huddle before 1.4.0. 1.4.0 provides `nohive` artifacts, too. https://repo.maven.apache.org/maven2/org/apache/orc/orc-core/1.4.0/ I'll make a PR soon as a POC. > Support a new faster ORC data source based on Apache ORC > > > Key: SPARK-20682 > URL: https://issues.apache.org/jira/browse/SPARK-20682 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1, 1.5.2, 1.6.3, 2.1.1 >Reporter: Dongjoon Hyun > > Since SPARK-2883, Apache Spark supports Apache ORC inside `sql/hive` module > with Hive dependency. This issue aims to add a new and faster ORC data source > inside `sql/core` and to replace the old ORC data source eventually. In this > issue, the latest Apache ORC 1.4.0 (released yesterday) is used. > There are four key benefits. > - Speed: Use both Spark `ColumnarBatch` and ORC `RowBatch` together. This is > faster than the current implementation in Spark. > - Stability: Apache ORC 1.4.0 has many fixes and we can depend on ORC > community more. > - Usability: User can use `ORC` data sources with hive module, i.e, `-Phive`. > - Maintainability: Reduce the Hive dependency and can remove old legacy code > later. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20682) Support a new faster ORC data source based on Apache ORC
[ https://issues.apache.org/jira/browse/SPARK-20682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16003321#comment-16003321 ] Robert Kruszewski commented on SPARK-20682: --- Would it make sense to bring hive-storage-api to sql/core and allow people use orc without -Phive? There's a ton of cruft that hive brings like hive-exec which technically isn't necessary for orc. Looking at hive-storage-api it's mostly bean classes to define types and filters. > Support a new faster ORC data source based on Apache ORC > > > Key: SPARK-20682 > URL: https://issues.apache.org/jira/browse/SPARK-20682 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1, 1.5.2, 1.6.3, 2.1.1 >Reporter: Dongjoon Hyun > > Since SPARK-2883, Apache Spark supports Apache ORC inside `sql/hive` module > with Hive dependency. This issue aims to add a new and faster ORC data source > inside `sql/core` and to replace the old ORC data source eventually. In this > issue, the latest Apache ORC 1.4.0 (released yesterday) is used. > There are four key benefits. > - Speed: Use both Spark `ColumnarBatch` and ORC `RowBatch` together. This is > faster than the current implementation in Spark. > - Stability: Apache ORC 1.4.0 has many fixes and we can depend on ORC > community more. > - Usability: User can use `ORC` data sources with hive module, i.e, `-Phive`. > - Maintainability: Reduce the Hive dependency and can remove old legacy code > later. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20682) Support a new faster ORC data source based on Apache ORC
Dongjoon Hyun created SPARK-20682: - Summary: Support a new faster ORC data source based on Apache ORC Key: SPARK-20682 URL: https://issues.apache.org/jira/browse/SPARK-20682 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.1, 1.6.3, 1.5.2, 1.4.1 Reporter: Dongjoon Hyun Since SPARK-2883, Apache Spark supports Apache ORC inside `sql/hive` module with Hive dependency. This issue aims to add a new and faster ORC data source inside `sql/core` and to replace the old ORC data source eventually. In this issue, the latest Apache ORC 1.4.0 (released yesterday) is used. There are four key benefits. - Speed: Use both Spark `ColumnarBatch` and ORC `RowBatch` together. This is faster than the current implementation in Spark. - Stability: Apache ORC 1.4.0 has many fixes and we can depend on ORC community more. - Usability: User can use `ORC` data sources with hive module, i.e, `-Phive`. - Maintainability: Reduce the Hive dependency and can remove old legacy code later. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18528) limit + groupBy leads to java.lang.NullPointerException
[ https://issues.apache.org/jira/browse/SPARK-18528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16003265#comment-16003265 ] Esther Kundin edited comment on SPARK-18528 at 5/9/17 6:43 PM: --- I have just found the same bug with Spark 2.10, even though it's marked as fixed. I believe you need to add 2.1.0 to the "Affected Versions" list. I have a Dataframe df with a few columns that I read in from json, and running: df.select('foo').distinct().count() works fine. df.limit(100).select('foo').distinct().count() throws a NPE. df.limit(100).repartition('foo').select('foo').distinct().count() works fine too. Seems like the same bug, still broken. was (Author: ekundin): I have just found the same bug with Spark 2.1, even though it's marked as fixed. I have a Dataframe df with a few columns that I read in from json, and running: df.select('foo').distinct().count() works fine. df.limit(100).select('foo').distinct().count() throws a NPE. df.limit(100).repartition('foo').select('foo').distinct().count() works fine too. Seems like the same bug, still broken. > limit + groupBy leads to java.lang.NullPointerException > --- > > Key: SPARK-18528 > URL: https://issues.apache.org/jira/browse/SPARK-18528 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.0.1 > Environment: CentOS release 6.6, Linux 2.6.32-504.el6.x86_64 >Reporter: Corey >Assignee: Takeshi Yamamuro > Fix For: 2.0.3, 2.1.1, 2.2.0 > > > Using limit on a DataFrame prior to groupBy will lead to a crash. > Repartitioning will avoid the crash. > *will crash:* {{df.limit(3).groupBy("user_id").count().show()}} > *will work:* {{df.limit(3).coalesce(1).groupBy('user_id').count().show()}} > *will work:* > {{df.limit(3).repartition('user_id').groupBy('user_id').count().show()}} > Here is a reproducible example along with the error message: > {quote} > >>> df = spark.createDataFrame([ (1, 1), (1, 3), (2, 1), (3, 2), (3, 3) ], > >>> ["user_id", "genre_id"]) > >>> > >>> df.show() > +---++ > |user_id|genre_id| > +---++ > | 1| 1| > | 1| 3| > | 2| 1| > | 3| 2| > | 3| 3| > +---++ > >>> df.groupBy("user_id").count().show() > +---+-+ > |user_id|count| > +---+-+ > | 1|2| > | 3|2| > | 2|1| > +---+-+ > >>> df.limit(3).groupBy("user_id").count().show() > [Stage 8:===>(1964 + 24) / > 2000]16/11/21 01:59:27 WARN TaskSetManager: Lost task 0.0 in stage 9.0 (TID > 8204, lvsp20hdn012.stubprod.com): java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {quote} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12297) Add work-around for Parquet/Hive int96 timestamp bug.
[ https://issues.apache.org/jira/browse/SPARK-12297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16003267#comment-16003267 ] Reynold Xin edited comment on SPARK-12297 at 5/9/17 6:40 PM: - I looked at the issue again and reverted the patch. If we want to resolve this issue, we need to look at the fundamental incompatibility (that is - the two data types have different semantics: timestamp without timezone and timestamp with timezone). The two data types have different semantics when parsing data. Also this is not just a Parquet issue. The same issue could happen to all data formats. It is going to be really confusing to have something that only works for Parquet simply because Impala cares more about Parquet. It seems like the purpose of this patch can be accomplished by just setting the session local timezone to UTC? was (Author: rxin): I looked at the issue again and reverted the patch. If we want to resolve this issue, we need to look at the fundamental incompatibility (that is - the two data types have different semantics: timestamp without timezone and timestamp with timezone). The two data types have different semantics when parsing data. Also this is not just a Parquet issue. The same issue could happen to all data types. It is going to be really confusing to have something that only works for Parquet simply because Impala cares more about Parquet. It seems like the purpose of this patch can be accomplished by just setting the session local timezone to UTC? > Add work-around for Parquet/Hive int96 timestamp bug. > - > > Key: SPARK-12297 > URL: https://issues.apache.org/jira/browse/SPARK-12297 > Project: Spark > Issue Type: Task > Components: Spark Core >Reporter: Ryan Blue > > Spark copied Hive's behavior for parquet, but this was inconsistent with > other file formats, and inconsistent with Impala (which is the original > source of putting a timestamp as an int96 in parquet, I believe). This made > timestamps in parquet act more like timestamps with timezones, while in other > file formats, timestamps have no time zone, they are a "floating time". > The easiest way to see this issue is to write out a table with timestamps in > multiple different formats from one timezone, then try to read them back in > another timezone. Eg., here I write out a few timestamps to parquet and > textfile hive tables, and also just as a json file, all in the > "America/Los_Angeles" timezone: > {code} > import org.apache.spark.sql.Row > import org.apache.spark.sql.types._ > val tblPrefix = args(0) > val schema = new StructType().add("ts", TimestampType) > val rows = sc.parallelize(Seq( > "2015-12-31 23:50:59.123", > "2015-12-31 22:49:59.123", > "2016-01-01 00:39:59.123", > "2016-01-01 01:29:59.123" > ).map { x => Row(java.sql.Timestamp.valueOf(x)) }) > val rawData = spark.createDataFrame(rows, schema).toDF() > rawData.show() > Seq("parquet", "textfile").foreach { format => > val tblName = s"${tblPrefix}_$format" > spark.sql(s"DROP TABLE IF EXISTS $tblName") > spark.sql( > raw"""CREATE TABLE $tblName ( > | ts timestamp > | ) > | STORED AS $format > """.stripMargin) > rawData.write.insertInto(tblName) > } > rawData.write.json(s"${tblPrefix}_json") > {code} > Then I start a spark-shell in "America/New_York" timezone, and read the data > back from each table: > {code} > scala> spark.sql("select * from la_parquet").collect().foreach{println} > [2016-01-01 02:50:59.123] > [2016-01-01 01:49:59.123] > [2016-01-01 03:39:59.123] > [2016-01-01 04:29:59.123] > scala> spark.sql("select * from la_textfile").collect().foreach{println} > [2015-12-31 23:50:59.123] > [2015-12-31 22:49:59.123] > [2016-01-01 00:39:59.123] > [2016-01-01 01:29:59.123] > scala> spark.read.json("la_json").collect().foreach{println} > [2015-12-31 23:50:59.123] > [2015-12-31 22:49:59.123] > [2016-01-01 00:39:59.123] > [2016-01-01 01:29:59.123] > scala> spark.read.json("la_json").join(spark.sql("select * from > la_textfile"), "ts").show() > ++ > | ts| > ++ > |2015-12-31 23:50:...| > |2015-12-31 22:49:...| > |2016-01-01 00:39:...| > |2016-01-01 01:29:...| > ++ > scala> spark.read.json("la_json").join(spark.sql("select * from la_parquet"), > "ts").show() > +---+ > | ts| > +---+ > +---+ > {code} > The textfile and json based data shows the same times, and can be joined > against each other, while the times from the parquet data have changed (and > obviously joins fail). > This is a big problem for any organization that may try to read the same data > (say in S3) with clusters in multiple timezones. It can also be a nasty > surprise as an organization tries to migrate file formats.
[jira] [Comment Edited] (SPARK-12297) Add work-around for Parquet/Hive int96 timestamp bug.
[ https://issues.apache.org/jira/browse/SPARK-12297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16003267#comment-16003267 ] Reynold Xin edited comment on SPARK-12297 at 5/9/17 6:39 PM: - I looked at the issue again and reverted the patch. If we want to resolve this issue, we need to look at the fundamental incompatibility (that is - the two data types have different semantics: timestamp without timezone and timestamp with timezone). The two data types have different semantics when parsing data. Also this is not just a Parquet issue. The same issue could happen to all data types. It is going to be really confusing to have something that only works for Parquet simply because Impala cares more about Parquet. It seems like the purpose of this patch can be accomplished by just setting the session local timezone to UTC? was (Author: rxin): I looked at the issue again and reverted the patch. If we want to resolve this issue, we need to look at the fundamental incompatibility (that is - the two data types have different semantics: timestamp without timezone and timestamp with timezone). The two data types have different semantics when parsing data. It seems like the purpose of this patch can be accomplished by just setting the session local timezone to UTC? > Add work-around for Parquet/Hive int96 timestamp bug. > - > > Key: SPARK-12297 > URL: https://issues.apache.org/jira/browse/SPARK-12297 > Project: Spark > Issue Type: Task > Components: Spark Core >Reporter: Ryan Blue > > Spark copied Hive's behavior for parquet, but this was inconsistent with > other file formats, and inconsistent with Impala (which is the original > source of putting a timestamp as an int96 in parquet, I believe). This made > timestamps in parquet act more like timestamps with timezones, while in other > file formats, timestamps have no time zone, they are a "floating time". > The easiest way to see this issue is to write out a table with timestamps in > multiple different formats from one timezone, then try to read them back in > another timezone. Eg., here I write out a few timestamps to parquet and > textfile hive tables, and also just as a json file, all in the > "America/Los_Angeles" timezone: > {code} > import org.apache.spark.sql.Row > import org.apache.spark.sql.types._ > val tblPrefix = args(0) > val schema = new StructType().add("ts", TimestampType) > val rows = sc.parallelize(Seq( > "2015-12-31 23:50:59.123", > "2015-12-31 22:49:59.123", > "2016-01-01 00:39:59.123", > "2016-01-01 01:29:59.123" > ).map { x => Row(java.sql.Timestamp.valueOf(x)) }) > val rawData = spark.createDataFrame(rows, schema).toDF() > rawData.show() > Seq("parquet", "textfile").foreach { format => > val tblName = s"${tblPrefix}_$format" > spark.sql(s"DROP TABLE IF EXISTS $tblName") > spark.sql( > raw"""CREATE TABLE $tblName ( > | ts timestamp > | ) > | STORED AS $format > """.stripMargin) > rawData.write.insertInto(tblName) > } > rawData.write.json(s"${tblPrefix}_json") > {code} > Then I start a spark-shell in "America/New_York" timezone, and read the data > back from each table: > {code} > scala> spark.sql("select * from la_parquet").collect().foreach{println} > [2016-01-01 02:50:59.123] > [2016-01-01 01:49:59.123] > [2016-01-01 03:39:59.123] > [2016-01-01 04:29:59.123] > scala> spark.sql("select * from la_textfile").collect().foreach{println} > [2015-12-31 23:50:59.123] > [2015-12-31 22:49:59.123] > [2016-01-01 00:39:59.123] > [2016-01-01 01:29:59.123] > scala> spark.read.json("la_json").collect().foreach{println} > [2015-12-31 23:50:59.123] > [2015-12-31 22:49:59.123] > [2016-01-01 00:39:59.123] > [2016-01-01 01:29:59.123] > scala> spark.read.json("la_json").join(spark.sql("select * from > la_textfile"), "ts").show() > ++ > | ts| > ++ > |2015-12-31 23:50:...| > |2015-12-31 22:49:...| > |2016-01-01 00:39:...| > |2016-01-01 01:29:...| > ++ > scala> spark.read.json("la_json").join(spark.sql("select * from la_parquet"), > "ts").show() > +---+ > | ts| > +---+ > +---+ > {code} > The textfile and json based data shows the same times, and can be joined > against each other, while the times from the parquet data have changed (and > obviously joins fail). > This is a big problem for any organization that may try to read the same data > (say in S3) with clusters in multiple timezones. It can also be a nasty > surprise as an organization tries to migrate file formats. Finally, its a > source of incompatibility between Hive, Impala, and Spark. > HIVE-12767 aims to fix this by introducing a table property which indicates > the "storage timezone" for the table. Spark should add the
[jira] [Commented] (SPARK-12297) Add work-around for Parquet/Hive int96 timestamp bug.
[ https://issues.apache.org/jira/browse/SPARK-12297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16003267#comment-16003267 ] Reynold Xin commented on SPARK-12297: - I looked at the issue again and reverted the patch. If we want to resolve this issue, we need to look at the fundamental incompatibility (that is - the two data types have different semantics: timestamp without timezone and timestamp with timezone). The two data types have different semantics when parsing data. It seems like the purpose of this patch can be accomplished by just setting the session local timezone to UTC? > Add work-around for Parquet/Hive int96 timestamp bug. > - > > Key: SPARK-12297 > URL: https://issues.apache.org/jira/browse/SPARK-12297 > Project: Spark > Issue Type: Task > Components: Spark Core >Reporter: Ryan Blue > > Spark copied Hive's behavior for parquet, but this was inconsistent with > other file formats, and inconsistent with Impala (which is the original > source of putting a timestamp as an int96 in parquet, I believe). This made > timestamps in parquet act more like timestamps with timezones, while in other > file formats, timestamps have no time zone, they are a "floating time". > The easiest way to see this issue is to write out a table with timestamps in > multiple different formats from one timezone, then try to read them back in > another timezone. Eg., here I write out a few timestamps to parquet and > textfile hive tables, and also just as a json file, all in the > "America/Los_Angeles" timezone: > {code} > import org.apache.spark.sql.Row > import org.apache.spark.sql.types._ > val tblPrefix = args(0) > val schema = new StructType().add("ts", TimestampType) > val rows = sc.parallelize(Seq( > "2015-12-31 23:50:59.123", > "2015-12-31 22:49:59.123", > "2016-01-01 00:39:59.123", > "2016-01-01 01:29:59.123" > ).map { x => Row(java.sql.Timestamp.valueOf(x)) }) > val rawData = spark.createDataFrame(rows, schema).toDF() > rawData.show() > Seq("parquet", "textfile").foreach { format => > val tblName = s"${tblPrefix}_$format" > spark.sql(s"DROP TABLE IF EXISTS $tblName") > spark.sql( > raw"""CREATE TABLE $tblName ( > | ts timestamp > | ) > | STORED AS $format > """.stripMargin) > rawData.write.insertInto(tblName) > } > rawData.write.json(s"${tblPrefix}_json") > {code} > Then I start a spark-shell in "America/New_York" timezone, and read the data > back from each table: > {code} > scala> spark.sql("select * from la_parquet").collect().foreach{println} > [2016-01-01 02:50:59.123] > [2016-01-01 01:49:59.123] > [2016-01-01 03:39:59.123] > [2016-01-01 04:29:59.123] > scala> spark.sql("select * from la_textfile").collect().foreach{println} > [2015-12-31 23:50:59.123] > [2015-12-31 22:49:59.123] > [2016-01-01 00:39:59.123] > [2016-01-01 01:29:59.123] > scala> spark.read.json("la_json").collect().foreach{println} > [2015-12-31 23:50:59.123] > [2015-12-31 22:49:59.123] > [2016-01-01 00:39:59.123] > [2016-01-01 01:29:59.123] > scala> spark.read.json("la_json").join(spark.sql("select * from > la_textfile"), "ts").show() > ++ > | ts| > ++ > |2015-12-31 23:50:...| > |2015-12-31 22:49:...| > |2016-01-01 00:39:...| > |2016-01-01 01:29:...| > ++ > scala> spark.read.json("la_json").join(spark.sql("select * from la_parquet"), > "ts").show() > +---+ > | ts| > +---+ > +---+ > {code} > The textfile and json based data shows the same times, and can be joined > against each other, while the times from the parquet data have changed (and > obviously joins fail). > This is a big problem for any organization that may try to read the same data > (say in S3) with clusters in multiple timezones. It can also be a nasty > surprise as an organization tries to migrate file formats. Finally, its a > source of incompatibility between Hive, Impala, and Spark. > HIVE-12767 aims to fix this by introducing a table property which indicates > the "storage timezone" for the table. Spark should add the same to ensure > consistency between file formats, and with Hive & Impala. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18528) limit + groupBy leads to java.lang.NullPointerException
[ https://issues.apache.org/jira/browse/SPARK-18528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16003265#comment-16003265 ] Esther Kundin commented on SPARK-18528: --- I have just found the same bug with Spark 2.1, even though it's marked as fixed. I have a Dataframe df with a few columns that I read in from json, and running: df.select('foo').distinct().count() works fine. df.limit(100).select('foo').distinct().count() throws a NPE. df.limit(100).repartition('foo').select('foo').distinct().count() works fine too. Seems like the same bug, still broken. > limit + groupBy leads to java.lang.NullPointerException > --- > > Key: SPARK-18528 > URL: https://issues.apache.org/jira/browse/SPARK-18528 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.0.1 > Environment: CentOS release 6.6, Linux 2.6.32-504.el6.x86_64 >Reporter: Corey >Assignee: Takeshi Yamamuro > Fix For: 2.0.3, 2.1.1, 2.2.0 > > > Using limit on a DataFrame prior to groupBy will lead to a crash. > Repartitioning will avoid the crash. > *will crash:* {{df.limit(3).groupBy("user_id").count().show()}} > *will work:* {{df.limit(3).coalesce(1).groupBy('user_id').count().show()}} > *will work:* > {{df.limit(3).repartition('user_id').groupBy('user_id').count().show()}} > Here is a reproducible example along with the error message: > {quote} > >>> df = spark.createDataFrame([ (1, 1), (1, 3), (2, 1), (3, 2), (3, 3) ], > >>> ["user_id", "genre_id"]) > >>> > >>> df.show() > +---++ > |user_id|genre_id| > +---++ > | 1| 1| > | 1| 3| > | 2| 1| > | 3| 2| > | 3| 3| > +---++ > >>> df.groupBy("user_id").count().show() > +---+-+ > |user_id|count| > +---+-+ > | 1|2| > | 3|2| > | 2|1| > +---+-+ > >>> df.limit(3).groupBy("user_id").count().show() > [Stage 8:===>(1964 + 24) / > 2000]16/11/21 01:59:27 WARN TaskSetManager: Lost task 0.0 in stage 9.0 (TID > 8204, lvsp20hdn012.stubprod.com): java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {quote} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12297) Add work-around for Parquet/Hive int96 timestamp bug.
[ https://issues.apache.org/jira/browse/SPARK-12297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-12297: Fix Version/s: (was: 2.3.0) > Add work-around for Parquet/Hive int96 timestamp bug. > - > > Key: SPARK-12297 > URL: https://issues.apache.org/jira/browse/SPARK-12297 > Project: Spark > Issue Type: Task > Components: Spark Core >Reporter: Ryan Blue > > Spark copied Hive's behavior for parquet, but this was inconsistent with > other file formats, and inconsistent with Impala (which is the original > source of putting a timestamp as an int96 in parquet, I believe). This made > timestamps in parquet act more like timestamps with timezones, while in other > file formats, timestamps have no time zone, they are a "floating time". > The easiest way to see this issue is to write out a table with timestamps in > multiple different formats from one timezone, then try to read them back in > another timezone. Eg., here I write out a few timestamps to parquet and > textfile hive tables, and also just as a json file, all in the > "America/Los_Angeles" timezone: > {code} > import org.apache.spark.sql.Row > import org.apache.spark.sql.types._ > val tblPrefix = args(0) > val schema = new StructType().add("ts", TimestampType) > val rows = sc.parallelize(Seq( > "2015-12-31 23:50:59.123", > "2015-12-31 22:49:59.123", > "2016-01-01 00:39:59.123", > "2016-01-01 01:29:59.123" > ).map { x => Row(java.sql.Timestamp.valueOf(x)) }) > val rawData = spark.createDataFrame(rows, schema).toDF() > rawData.show() > Seq("parquet", "textfile").foreach { format => > val tblName = s"${tblPrefix}_$format" > spark.sql(s"DROP TABLE IF EXISTS $tblName") > spark.sql( > raw"""CREATE TABLE $tblName ( > | ts timestamp > | ) > | STORED AS $format > """.stripMargin) > rawData.write.insertInto(tblName) > } > rawData.write.json(s"${tblPrefix}_json") > {code} > Then I start a spark-shell in "America/New_York" timezone, and read the data > back from each table: > {code} > scala> spark.sql("select * from la_parquet").collect().foreach{println} > [2016-01-01 02:50:59.123] > [2016-01-01 01:49:59.123] > [2016-01-01 03:39:59.123] > [2016-01-01 04:29:59.123] > scala> spark.sql("select * from la_textfile").collect().foreach{println} > [2015-12-31 23:50:59.123] > [2015-12-31 22:49:59.123] > [2016-01-01 00:39:59.123] > [2016-01-01 01:29:59.123] > scala> spark.read.json("la_json").collect().foreach{println} > [2015-12-31 23:50:59.123] > [2015-12-31 22:49:59.123] > [2016-01-01 00:39:59.123] > [2016-01-01 01:29:59.123] > scala> spark.read.json("la_json").join(spark.sql("select * from > la_textfile"), "ts").show() > ++ > | ts| > ++ > |2015-12-31 23:50:...| > |2015-12-31 22:49:...| > |2016-01-01 00:39:...| > |2016-01-01 01:29:...| > ++ > scala> spark.read.json("la_json").join(spark.sql("select * from la_parquet"), > "ts").show() > +---+ > | ts| > +---+ > +---+ > {code} > The textfile and json based data shows the same times, and can be joined > against each other, while the times from the parquet data have changed (and > obviously joins fail). > This is a big problem for any organization that may try to read the same data > (say in S3) with clusters in multiple timezones. It can also be a nasty > surprise as an organization tries to migrate file formats. Finally, its a > source of incompatibility between Hive, Impala, and Spark. > HIVE-12767 aims to fix this by introducing a table property which indicates > the "storage timezone" for the table. Spark should add the same to ensure > consistency between file formats, and with Hive & Impala. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-12297) Add work-around for Parquet/Hive int96 timestamp bug.
[ https://issues.apache.org/jira/browse/SPARK-12297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin reopened SPARK-12297: - Assignee: (was: Imran Rashid) > Add work-around for Parquet/Hive int96 timestamp bug. > - > > Key: SPARK-12297 > URL: https://issues.apache.org/jira/browse/SPARK-12297 > Project: Spark > Issue Type: Task > Components: Spark Core >Reporter: Ryan Blue > > Spark copied Hive's behavior for parquet, but this was inconsistent with > other file formats, and inconsistent with Impala (which is the original > source of putting a timestamp as an int96 in parquet, I believe). This made > timestamps in parquet act more like timestamps with timezones, while in other > file formats, timestamps have no time zone, they are a "floating time". > The easiest way to see this issue is to write out a table with timestamps in > multiple different formats from one timezone, then try to read them back in > another timezone. Eg., here I write out a few timestamps to parquet and > textfile hive tables, and also just as a json file, all in the > "America/Los_Angeles" timezone: > {code} > import org.apache.spark.sql.Row > import org.apache.spark.sql.types._ > val tblPrefix = args(0) > val schema = new StructType().add("ts", TimestampType) > val rows = sc.parallelize(Seq( > "2015-12-31 23:50:59.123", > "2015-12-31 22:49:59.123", > "2016-01-01 00:39:59.123", > "2016-01-01 01:29:59.123" > ).map { x => Row(java.sql.Timestamp.valueOf(x)) }) > val rawData = spark.createDataFrame(rows, schema).toDF() > rawData.show() > Seq("parquet", "textfile").foreach { format => > val tblName = s"${tblPrefix}_$format" > spark.sql(s"DROP TABLE IF EXISTS $tblName") > spark.sql( > raw"""CREATE TABLE $tblName ( > | ts timestamp > | ) > | STORED AS $format > """.stripMargin) > rawData.write.insertInto(tblName) > } > rawData.write.json(s"${tblPrefix}_json") > {code} > Then I start a spark-shell in "America/New_York" timezone, and read the data > back from each table: > {code} > scala> spark.sql("select * from la_parquet").collect().foreach{println} > [2016-01-01 02:50:59.123] > [2016-01-01 01:49:59.123] > [2016-01-01 03:39:59.123] > [2016-01-01 04:29:59.123] > scala> spark.sql("select * from la_textfile").collect().foreach{println} > [2015-12-31 23:50:59.123] > [2015-12-31 22:49:59.123] > [2016-01-01 00:39:59.123] > [2016-01-01 01:29:59.123] > scala> spark.read.json("la_json").collect().foreach{println} > [2015-12-31 23:50:59.123] > [2015-12-31 22:49:59.123] > [2016-01-01 00:39:59.123] > [2016-01-01 01:29:59.123] > scala> spark.read.json("la_json").join(spark.sql("select * from > la_textfile"), "ts").show() > ++ > | ts| > ++ > |2015-12-31 23:50:...| > |2015-12-31 22:49:...| > |2016-01-01 00:39:...| > |2016-01-01 01:29:...| > ++ > scala> spark.read.json("la_json").join(spark.sql("select * from la_parquet"), > "ts").show() > +---+ > | ts| > +---+ > +---+ > {code} > The textfile and json based data shows the same times, and can be joined > against each other, while the times from the parquet data have changed (and > obviously joins fail). > This is a big problem for any organization that may try to read the same data > (say in S3) with clusters in multiple timezones. It can also be a nasty > surprise as an organization tries to migrate file formats. Finally, its a > source of incompatibility between Hive, Impala, and Spark. > HIVE-12767 aims to fix this by introducing a table property which indicates > the "storage timezone" for the table. Spark should add the same to ensure > consistency between file formats, and with Hive & Impala. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12297) Add work-around for Parquet/Hive int96 timestamp bug.
[ https://issues.apache.org/jira/browse/SPARK-12297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16001214#comment-16001214 ] Reynold Xin edited comment on SPARK-12297 at 5/9/17 6:35 PM: - Sorry I'm going to revert this. I think this requires further discussions since it completely changes the behavior of one of the most important data types. It'd be great to consider this more holistically and think about alternatives in fixing them (the current fix might end up being the best but we should at least explore other ones). One of the fundamental problem is that Spark treats timestamp as timestamp with timezone, whereas impala treats timestamp as timestamp without timezone. The parquet storage is only a small piece here. Other statements such as cast would return different results for these types. Just fixing these piecemeal is a really bad idea when it comes to data type semantics. was (Author: rxin): Sorry I'm going to revert this. I think this requires further discussions since it completely changes the behavior of one of the most important data types. It'd be great to consider this more holistically and think about alternatives in fixing them (the current fix might end up being the best but we should at least explore other ones). One of the fundamental problem is that Spark treats timestamp as timestamp with timezone, whereas both impala and hive treats timestamp as timestamp without timezone. The parquet storage is only a small piece here. Other statements such as cast would return different results for these types. Just fixing these piecemeal is a really bad idea when it comes to data type semantics. > Add work-around for Parquet/Hive int96 timestamp bug. > - > > Key: SPARK-12297 > URL: https://issues.apache.org/jira/browse/SPARK-12297 > Project: Spark > Issue Type: Task > Components: Spark Core >Reporter: Ryan Blue >Assignee: Imran Rashid > Fix For: 2.3.0 > > > Spark copied Hive's behavior for parquet, but this was inconsistent with > other file formats, and inconsistent with Impala (which is the original > source of putting a timestamp as an int96 in parquet, I believe). This made > timestamps in parquet act more like timestamps with timezones, while in other > file formats, timestamps have no time zone, they are a "floating time". > The easiest way to see this issue is to write out a table with timestamps in > multiple different formats from one timezone, then try to read them back in > another timezone. Eg., here I write out a few timestamps to parquet and > textfile hive tables, and also just as a json file, all in the > "America/Los_Angeles" timezone: > {code} > import org.apache.spark.sql.Row > import org.apache.spark.sql.types._ > val tblPrefix = args(0) > val schema = new StructType().add("ts", TimestampType) > val rows = sc.parallelize(Seq( > "2015-12-31 23:50:59.123", > "2015-12-31 22:49:59.123", > "2016-01-01 00:39:59.123", > "2016-01-01 01:29:59.123" > ).map { x => Row(java.sql.Timestamp.valueOf(x)) }) > val rawData = spark.createDataFrame(rows, schema).toDF() > rawData.show() > Seq("parquet", "textfile").foreach { format => > val tblName = s"${tblPrefix}_$format" > spark.sql(s"DROP TABLE IF EXISTS $tblName") > spark.sql( > raw"""CREATE TABLE $tblName ( > | ts timestamp > | ) > | STORED AS $format > """.stripMargin) > rawData.write.insertInto(tblName) > } > rawData.write.json(s"${tblPrefix}_json") > {code} > Then I start a spark-shell in "America/New_York" timezone, and read the data > back from each table: > {code} > scala> spark.sql("select * from la_parquet").collect().foreach{println} > [2016-01-01 02:50:59.123] > [2016-01-01 01:49:59.123] > [2016-01-01 03:39:59.123] > [2016-01-01 04:29:59.123] > scala> spark.sql("select * from la_textfile").collect().foreach{println} > [2015-12-31 23:50:59.123] > [2015-12-31 22:49:59.123] > [2016-01-01 00:39:59.123] > [2016-01-01 01:29:59.123] > scala> spark.read.json("la_json").collect().foreach{println} > [2015-12-31 23:50:59.123] > [2015-12-31 22:49:59.123] > [2016-01-01 00:39:59.123] > [2016-01-01 01:29:59.123] > scala> spark.read.json("la_json").join(spark.sql("select * from > la_textfile"), "ts").show() > ++ > | ts| > ++ > |2015-12-31 23:50:...| > |2015-12-31 22:49:...| > |2016-01-01 00:39:...| > |2016-01-01 01:29:...| > ++ > scala> spark.read.json("la_json").join(spark.sql("select * from la_parquet"), > "ts").show() > +---+ > | ts| > +---+ > +---+ > {code} > The textfile and json based data shows the same times, and can be joined > against each other, while the times from the parquet data have changed (and > obviously joins fail). >
[jira] [Resolved] (SPARK-20627) Remove pip local version string (PEP440)
[ https://issues.apache.org/jira/browse/SPARK-20627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] holdenk resolved SPARK-20627. - Resolution: Fixed Fix Version/s: 2.1.2 2.3.0 2.2.1 Issue resolved by pull request 17885 [https://github.com/apache/spark/pull/17885] > Remove pip local version string (PEP440) > > > Key: SPARK-20627 > URL: https://issues.apache.org/jira/browse/SPARK-20627 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.1.0, 2.1.1, 2.1.2, 2.2.0, 2.3.0 >Reporter: holdenk > Fix For: 2.2.1, 2.3.0, 2.1.2 > > > In make distribution script right now we append the hadoop version string, > but this makes uploading to PyPI difficult and we don't cross-build for > multiple versions anymore. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20627) Remove pip local version string (PEP440)
[ https://issues.apache.org/jira/browse/SPARK-20627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] holdenk reassigned SPARK-20627: --- Assignee: holdenk > Remove pip local version string (PEP440) > > > Key: SPARK-20627 > URL: https://issues.apache.org/jira/browse/SPARK-20627 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.1.0, 2.1.1, 2.1.2, 2.2.0, 2.3.0 >Reporter: holdenk >Assignee: holdenk > Fix For: 2.1.2, 2.2.1, 2.3.0 > > > In make distribution script right now we append the hadoop version string, > but this makes uploading to PyPI difficult and we don't cross-build for > multiple versions anymore. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20666) Flaky test - SparkListenerBus randomly failing java.lang.IllegalAccessError
[ https://issues.apache.org/jira/browse/SPARK-20666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated SPARK-20666: --- Summary: Flaky test - SparkListenerBus randomly failing java.lang.IllegalAccessError (was: Flaky test - SparkListenerBus randomly failing java.lang.IllegalAccessError on Windows) > Flaky test - SparkListenerBus randomly failing java.lang.IllegalAccessError > --- > > Key: SPARK-20666 > URL: https://issues.apache.org/jira/browse/SPARK-20666 > Project: Spark > Issue Type: Bug > Components: ML, Spark Core, SparkR >Affects Versions: 2.3.0 >Reporter: Felix Cheung >Priority: Critical > > seeing quite a bit of this on AppVeyor, aka Windows only, always only when > running ML tests, it seems > {code} > Exception in thread "SparkListenerBus" java.lang.IllegalAccessError: > Attempted to access garbage collected accumulator 159454 > at > org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:265) > at > org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:261) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.util.AccumulatorContext$.get(AccumulatorV2.scala:261) > at org.apache.spark.util.AccumulatorV2.name(AccumulatorV2.scala:88) > at > org.apache.spark.sql.execution.metric.SQLMetric.toInfo(SQLMetrics.scala:67) > at > org.apache.spark.sql.execution.ui.SQLListener$$anonfun$onTaskEnd$1.apply(SQLListener.scala:216) > at > org.apache.spark.sql.execution.ui.SQLListener$$anonfun$onTaskEnd$1.apply(SQLListener.scala:216) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.execution.ui.SQLListener.onTaskEnd(SQLListener.scala:216) > at > org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45) > at > org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36) > at > org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36) > at > org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63) > at > org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:36) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(LiveListenerBus.scala:94) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:78) > at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1268) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:77) > 1 > MLlib recommendation algorithms: Spark package found in SPARK_HOME: > C:\projects\spark\bin\.. > {code} > {code} > java.lang.IllegalStateException: SparkContext has been shutdown > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2015) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2044) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2063) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2088) > at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) > at org.apache.spark.rdd.RDD.collect(RDD.scala:935) > at > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:275) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:2923) > at > org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2474) > at > org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2474) > at
[jira] [Commented] (SPARK-20666) Flaky test - SparkListenerBus randomly failing java.lang.IllegalAccessError on Windows
[ https://issues.apache.org/jira/browse/SPARK-20666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16003116#comment-16003116 ] Marcelo Vanzin commented on SPARK-20666: I'm raising the severity because lots of PRs are failing because of these kinds of errors... > Flaky test - SparkListenerBus randomly failing java.lang.IllegalAccessError > on Windows > -- > > Key: SPARK-20666 > URL: https://issues.apache.org/jira/browse/SPARK-20666 > Project: Spark > Issue Type: Bug > Components: ML, Spark Core, SparkR >Affects Versions: 2.3.0 >Reporter: Felix Cheung > > seeing quite a bit of this on AppVeyor, aka Windows only, always only when > running ML tests, it seems > {code} > Exception in thread "SparkListenerBus" java.lang.IllegalAccessError: > Attempted to access garbage collected accumulator 159454 > at > org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:265) > at > org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:261) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.util.AccumulatorContext$.get(AccumulatorV2.scala:261) > at org.apache.spark.util.AccumulatorV2.name(AccumulatorV2.scala:88) > at > org.apache.spark.sql.execution.metric.SQLMetric.toInfo(SQLMetrics.scala:67) > at > org.apache.spark.sql.execution.ui.SQLListener$$anonfun$onTaskEnd$1.apply(SQLListener.scala:216) > at > org.apache.spark.sql.execution.ui.SQLListener$$anonfun$onTaskEnd$1.apply(SQLListener.scala:216) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.execution.ui.SQLListener.onTaskEnd(SQLListener.scala:216) > at > org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45) > at > org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36) > at > org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36) > at > org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63) > at > org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:36) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(LiveListenerBus.scala:94) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:78) > at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1268) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:77) > 1 > MLlib recommendation algorithms: Spark package found in SPARK_HOME: > C:\projects\spark\bin\.. > {code} > {code} > java.lang.IllegalStateException: SparkContext has been shutdown > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2015) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2044) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2063) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2088) > at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) > at org.apache.spark.rdd.RDD.collect(RDD.scala:935) > at > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:275) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:2923) > at > org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2474) > at > org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2474) > at org.apache.spark.sql.Dataset$$anonfun$57.apply(Dataset.scala:2907) > at >
[jira] [Updated] (SPARK-20666) Flaky test - SparkListenerBus randomly failing java.lang.IllegalAccessError on Windows
[ https://issues.apache.org/jira/browse/SPARK-20666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated SPARK-20666: --- Priority: Critical (was: Major) > Flaky test - SparkListenerBus randomly failing java.lang.IllegalAccessError > on Windows > -- > > Key: SPARK-20666 > URL: https://issues.apache.org/jira/browse/SPARK-20666 > Project: Spark > Issue Type: Bug > Components: ML, Spark Core, SparkR >Affects Versions: 2.3.0 >Reporter: Felix Cheung >Priority: Critical > > seeing quite a bit of this on AppVeyor, aka Windows only, always only when > running ML tests, it seems > {code} > Exception in thread "SparkListenerBus" java.lang.IllegalAccessError: > Attempted to access garbage collected accumulator 159454 > at > org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:265) > at > org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:261) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.util.AccumulatorContext$.get(AccumulatorV2.scala:261) > at org.apache.spark.util.AccumulatorV2.name(AccumulatorV2.scala:88) > at > org.apache.spark.sql.execution.metric.SQLMetric.toInfo(SQLMetrics.scala:67) > at > org.apache.spark.sql.execution.ui.SQLListener$$anonfun$onTaskEnd$1.apply(SQLListener.scala:216) > at > org.apache.spark.sql.execution.ui.SQLListener$$anonfun$onTaskEnd$1.apply(SQLListener.scala:216) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.execution.ui.SQLListener.onTaskEnd(SQLListener.scala:216) > at > org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45) > at > org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36) > at > org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36) > at > org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63) > at > org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:36) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(LiveListenerBus.scala:94) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:78) > at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1268) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:77) > 1 > MLlib recommendation algorithms: Spark package found in SPARK_HOME: > C:\projects\spark\bin\.. > {code} > {code} > java.lang.IllegalStateException: SparkContext has been shutdown > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2015) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2044) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2063) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2088) > at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) > at org.apache.spark.rdd.RDD.collect(RDD.scala:935) > at > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:275) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:2923) > at > org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2474) > at > org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2474) > at org.apache.spark.sql.Dataset$$anonfun$57.apply(Dataset.scala:2907) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65) > at
[jira] [Commented] (SPARK-10408) Autoencoder
[ https://issues.apache.org/jira/browse/SPARK-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16003110#comment-16003110 ] Alexander Ulanov commented on SPARK-10408: -- Autoencoder is implemented in the referenced pull request. I will be glad to follow up on the code review if anyone can do it. > Autoencoder > --- > > Key: SPARK-10408 > URL: https://issues.apache.org/jira/browse/SPARK-10408 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.5.0 >Reporter: Alexander Ulanov >Assignee: Alexander Ulanov > > Goal: Implement various types of autoencoders > Requirements: > 1)Basic (deep) autoencoder that supports different types of inputs: binary, > real in [0..1]. real in [-inf, +inf] > 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature > to the MLP and then used here > 3)Denoising autoencoder > 4)Stacked autoencoder for pre-training of deep networks. It should support > arbitrary network layers > References: > 1. Vincent, Pascal, et al. "Extracting and composing robust features with > denoising autoencoders." Proceedings of the 25th international conference on > Machine learning. ACM, 2008. > http://www.iro.umontreal.ca/~vincentp/Publications/denoising_autoencoders_tr1316.pdf > > 2. > http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf, > 3. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. > (2010). Stacked denoising autoencoders: Learning useful representations in a > deep network with a local denoising criterion. Journal of Machine Learning > Research, 11(3371–3408). > http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.297.3484=rep1=pdf > 4, 5, 6. Bengio, Yoshua, et al. "Greedy layer-wise training of deep > networks." Advances in neural information processing systems 19 (2007): 153. > http://www.iro.umontreal.ca/~lisa/pointeurs/dbn_supervised_tr1282.pdf -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14584) Improve recognition of non-nullability in Dataset transformations
[ https://issues.apache.org/jira/browse/SPARK-14584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16003082#comment-16003082 ] Josh Rosen commented on SPARK-14584: [~hyukjin.kwon], yeah, this does appear to be fixed. My only question is whether we have a test to prevent this behavior from regressing. Do you remember where it was fixed / changed? > Improve recognition of non-nullability in Dataset transformations > - > > Key: SPARK-14584 > URL: https://issues.apache.org/jira/browse/SPARK-14584 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Josh Rosen > > There are many cases where we can statically know that a field will never be > null. For instance, a field in a case class with a primitive type will never > return null. However, there are currently several cases in the Dataset API > where we do not properly recognize this non-nullability. For instance: > {code} > case class MyCaseClass(foo: Int) > sc.parallelize(Seq(0)).toDS.map(MyCaseClass).printSchema > {code} > claims that the {{foo}} field is nullable even though this is impossible. > I believe that this is due to the way that we reason about nullability when > constructing serializer expressions in ExpressionEncoders. The following > assertion will currently fail if added to ExpressionEncoder: > {code} > require(schema.size == serializer.size) > schema.fields.zip(serializer).foreach { case (field, fieldSerializer) => > require(field.dataType == fieldSerializer.dataType, s"Field > ${field.name}'s data type is " + > s"${field.dataType} in the schema but ${fieldSerializer.dataType} in > its serializer") > require(field.nullable == fieldSerializer.nullable, s"Field > ${field.name}'s nullability is " + > s"${field.nullable} in the schema but ${fieldSerializer.nullable} in > its serializer") > } > {code} > Most often, the schema claims that a field is non-nullable while the encoder > allows for nullability, but occasionally we see a mismatch in the datatypes > due to disagreements over the nullability of nested structs' fields (or > fields of structs in arrays). > I think the problem is that when we're reasoning about nullability in a > struct's schema we consider its fields' nullability to be independent of the > nullability of the struct itself, whereas in the serializer expressions we > are considering those field extraction expressions to be nullable if the > input objects themselves can be nullable. > I'm not sure what's the simplest way to fix this. One proposal would be to > leave the serializers unchanged and have ObjectOperator derive its output > attributes from an explicitly-passed schema rather than using the > serializers' attributes. However, I worry that this might introduce bugs in > case the serializer and schema disagree. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20666) Flaky test - SparkListenerBus randomly failing java.lang.IllegalAccessError on Windows
[ https://issues.apache.org/jira/browse/SPARK-20666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16003055#comment-16003055 ] Marcelo Vanzin commented on SPARK-20666: [~cloud_fan] do you think your fix for SPARK-12837 could have caused this? I've only noticed this error recently, and that's the only recent change in accumulator code... A quick look at it didn't flag anything, though. > Flaky test - SparkListenerBus randomly failing java.lang.IllegalAccessError > on Windows > -- > > Key: SPARK-20666 > URL: https://issues.apache.org/jira/browse/SPARK-20666 > Project: Spark > Issue Type: Bug > Components: ML, Spark Core, SparkR >Affects Versions: 2.3.0 >Reporter: Felix Cheung > > seeing quite a bit of this on AppVeyor, aka Windows only, always only when > running ML tests, it seems > {code} > Exception in thread "SparkListenerBus" java.lang.IllegalAccessError: > Attempted to access garbage collected accumulator 159454 > at > org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:265) > at > org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:261) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.util.AccumulatorContext$.get(AccumulatorV2.scala:261) > at org.apache.spark.util.AccumulatorV2.name(AccumulatorV2.scala:88) > at > org.apache.spark.sql.execution.metric.SQLMetric.toInfo(SQLMetrics.scala:67) > at > org.apache.spark.sql.execution.ui.SQLListener$$anonfun$onTaskEnd$1.apply(SQLListener.scala:216) > at > org.apache.spark.sql.execution.ui.SQLListener$$anonfun$onTaskEnd$1.apply(SQLListener.scala:216) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.execution.ui.SQLListener.onTaskEnd(SQLListener.scala:216) > at > org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45) > at > org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36) > at > org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36) > at > org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63) > at > org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:36) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(LiveListenerBus.scala:94) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:78) > at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1268) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:77) > 1 > MLlib recommendation algorithms: Spark package found in SPARK_HOME: > C:\projects\spark\bin\.. > {code} > {code} > java.lang.IllegalStateException: SparkContext has been shutdown > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2015) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2044) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2063) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2088) > at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) > at org.apache.spark.rdd.RDD.collect(RDD.scala:935) > at > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:275) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:2923) > at > org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2474) > at > org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2474) >
[jira] [Comment Edited] (SPARK-20666) Flaky test - SparkListenerBus randomly failing java.lang.IllegalAccessError on Windows
[ https://issues.apache.org/jira/browse/SPARK-20666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16002962#comment-16002962 ] Marcelo Vanzin edited comment on SPARK-20666 at 5/9/17 4:52 PM: This doesn't fail just on Windows. It's been failing in PR builders also (e.g. https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76605/testReport). In a private branch of mine I ran into this problem and have this workaround: https://github.com/vanzin/spark/pull/22/files#diff-edd374dbb96bc16363b65dab1e554793R114 But that's not cleanly "backportable" at the moment, and I don't know if it's the right solution either. (BTW my fix is in SQL, which also suffers from this, so it would not help these mllib tests...) was (Author: vanzin): This doesn't fail just on Windows. It's been failing in PR builders also (e.g. https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76605/testReport). In a private branch of mine I ran into this problem and have this workaround: https://github.com/vanzin/spark/pull/22/files#diff-edd374dbb96bc16363b65dab1e554793R114 But that's not cleanly "backportable" at the moment, and I don't know if it's the right solution either. (BTW my fix is in SQL, which also suffers from this, so it would help these mllib tests...) > Flaky test - SparkListenerBus randomly failing java.lang.IllegalAccessError > on Windows > -- > > Key: SPARK-20666 > URL: https://issues.apache.org/jira/browse/SPARK-20666 > Project: Spark > Issue Type: Bug > Components: ML, Spark Core, SparkR >Affects Versions: 2.3.0 >Reporter: Felix Cheung > > seeing quite a bit of this on AppVeyor, aka Windows only, always only when > running ML tests, it seems > {code} > Exception in thread "SparkListenerBus" java.lang.IllegalAccessError: > Attempted to access garbage collected accumulator 159454 > at > org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:265) > at > org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:261) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.util.AccumulatorContext$.get(AccumulatorV2.scala:261) > at org.apache.spark.util.AccumulatorV2.name(AccumulatorV2.scala:88) > at > org.apache.spark.sql.execution.metric.SQLMetric.toInfo(SQLMetrics.scala:67) > at > org.apache.spark.sql.execution.ui.SQLListener$$anonfun$onTaskEnd$1.apply(SQLListener.scala:216) > at > org.apache.spark.sql.execution.ui.SQLListener$$anonfun$onTaskEnd$1.apply(SQLListener.scala:216) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.execution.ui.SQLListener.onTaskEnd(SQLListener.scala:216) > at > org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45) > at > org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36) > at > org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36) > at > org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63) > at > org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:36) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(LiveListenerBus.scala:94) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:78) > at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1268) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:77) > 1 > MLlib recommendation algorithms: Spark package found in SPARK_HOME: > C:\projects\spark\bin\.. > {code} > {code} > java.lang.IllegalStateException: SparkContext has been shutdown > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2015) > at
[jira] [Resolved] (SPARK-20674) Support registering UserDefinedFunction as named UDF
[ https://issues.apache.org/jira/browse/SPARK-20674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-20674. - Resolution: Fixed Fix Version/s: 2.2.0 > Support registering UserDefinedFunction as named UDF > > > Key: SPARK-20674 > URL: https://issues.apache.org/jira/browse/SPARK-20674 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.1 >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.2.0 > > > For some reason we don't have an API to register UserDefinedFunction as named > UDF. It is a no brainer to add one, in addition to the existing register > functions we have. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20666) Flaky test - SparkListenerBus randomly failing java.lang.IllegalAccessError on Windows
[ https://issues.apache.org/jira/browse/SPARK-20666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16002962#comment-16002962 ] Marcelo Vanzin commented on SPARK-20666: This doesn't fail just on Windows. It's been failing in PR builders also (e.g. https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76605/testReport). In a private branch of mine I ran into this problem and have this workaround: https://github.com/vanzin/spark/pull/22/files#diff-edd374dbb96bc16363b65dab1e554793R114 But that's not cleanly "backportable" at the moment, and I don't know if it's the right solution either. > Flaky test - SparkListenerBus randomly failing java.lang.IllegalAccessError > on Windows > -- > > Key: SPARK-20666 > URL: https://issues.apache.org/jira/browse/SPARK-20666 > Project: Spark > Issue Type: Bug > Components: ML, Spark Core, SparkR >Affects Versions: 2.3.0 >Reporter: Felix Cheung > > seeing quite a bit of this on AppVeyor, aka Windows only, always only when > running ML tests, it seems > {code} > Exception in thread "SparkListenerBus" java.lang.IllegalAccessError: > Attempted to access garbage collected accumulator 159454 > at > org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:265) > at > org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:261) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.util.AccumulatorContext$.get(AccumulatorV2.scala:261) > at org.apache.spark.util.AccumulatorV2.name(AccumulatorV2.scala:88) > at > org.apache.spark.sql.execution.metric.SQLMetric.toInfo(SQLMetrics.scala:67) > at > org.apache.spark.sql.execution.ui.SQLListener$$anonfun$onTaskEnd$1.apply(SQLListener.scala:216) > at > org.apache.spark.sql.execution.ui.SQLListener$$anonfun$onTaskEnd$1.apply(SQLListener.scala:216) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.execution.ui.SQLListener.onTaskEnd(SQLListener.scala:216) > at > org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45) > at > org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36) > at > org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36) > at > org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63) > at > org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:36) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(LiveListenerBus.scala:94) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:78) > at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1268) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:77) > 1 > MLlib recommendation algorithms: Spark package found in SPARK_HOME: > C:\projects\spark\bin\.. > {code} > {code} > java.lang.IllegalStateException: SparkContext has been shutdown > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2015) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2044) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2063) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2088) > at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) > at org.apache.spark.rdd.RDD.collect(RDD.scala:935) > at > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:275) > at >
[jira] [Comment Edited] (SPARK-20666) Flaky test - SparkListenerBus randomly failing java.lang.IllegalAccessError on Windows
[ https://issues.apache.org/jira/browse/SPARK-20666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16002962#comment-16002962 ] Marcelo Vanzin edited comment on SPARK-20666 at 5/9/17 4:18 PM: This doesn't fail just on Windows. It's been failing in PR builders also (e.g. https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76605/testReport). In a private branch of mine I ran into this problem and have this workaround: https://github.com/vanzin/spark/pull/22/files#diff-edd374dbb96bc16363b65dab1e554793R114 But that's not cleanly "backportable" at the moment, and I don't know if it's the right solution either. (BTW my fix is in SQL, which also suffers from this, so it would help these mllib tests...) was (Author: vanzin): This doesn't fail just on Windows. It's been failing in PR builders also (e.g. https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76605/testReport). In a private branch of mine I ran into this problem and have this workaround: https://github.com/vanzin/spark/pull/22/files#diff-edd374dbb96bc16363b65dab1e554793R114 But that's not cleanly "backportable" at the moment, and I don't know if it's the right solution either. > Flaky test - SparkListenerBus randomly failing java.lang.IllegalAccessError > on Windows > -- > > Key: SPARK-20666 > URL: https://issues.apache.org/jira/browse/SPARK-20666 > Project: Spark > Issue Type: Bug > Components: ML, Spark Core, SparkR >Affects Versions: 2.3.0 >Reporter: Felix Cheung > > seeing quite a bit of this on AppVeyor, aka Windows only, always only when > running ML tests, it seems > {code} > Exception in thread "SparkListenerBus" java.lang.IllegalAccessError: > Attempted to access garbage collected accumulator 159454 > at > org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:265) > at > org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:261) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.util.AccumulatorContext$.get(AccumulatorV2.scala:261) > at org.apache.spark.util.AccumulatorV2.name(AccumulatorV2.scala:88) > at > org.apache.spark.sql.execution.metric.SQLMetric.toInfo(SQLMetrics.scala:67) > at > org.apache.spark.sql.execution.ui.SQLListener$$anonfun$onTaskEnd$1.apply(SQLListener.scala:216) > at > org.apache.spark.sql.execution.ui.SQLListener$$anonfun$onTaskEnd$1.apply(SQLListener.scala:216) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.execution.ui.SQLListener.onTaskEnd(SQLListener.scala:216) > at > org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45) > at > org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36) > at > org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36) > at > org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63) > at > org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:36) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(LiveListenerBus.scala:94) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:78) > at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1268) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:77) > 1 > MLlib recommendation algorithms: Spark package found in SPARK_HOME: > C:\projects\spark\bin\.. > {code} > {code} > java.lang.IllegalStateException: SparkContext has been shutdown > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2015) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2044) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2063) > at
[jira] [Resolved] (SPARK-20548) Flaky Test: ReplSuite.newProductSeqEncoder with REPL defined class
[ https://issues.apache.org/jira/browse/SPARK-20548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-20548. - Resolution: Fixed Fix Version/s: 2.3.0 2.2.1 Issue resolved by pull request 17844 [https://github.com/apache/spark/pull/17844] > Flaky Test: ReplSuite.newProductSeqEncoder with REPL defined class > --- > > Key: SPARK-20548 > URL: https://issues.apache.org/jira/browse/SPARK-20548 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Sameer Agarwal >Assignee: Sameer Agarwal > Fix For: 2.2.1, 2.3.0 > > > {{newProductSeqEncoder with REPL defined class}} in {{ReplSuite}} has been > failing in-deterministically : https://spark-tests.appspot.com/failed-tests > over the last few days. > https://spark.test.databricks.com/job/spark-master-test-sbt-hadoop-2.7/176/testReport/junit/org.apache.spark.repl/ReplSuite/newProductSeqEncoder_with_REPL_defined_class/history/ -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20608) Standby namenodes should be allowed to included in yarn.spark.access.namenodes to support HDFS HA
[ https://issues.apache.org/jira/browse/SPARK-20608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16002956#comment-16002956 ] Marcelo Vanzin commented on SPARK-20608: bq. I think it is unreasonable that spark application fails when one of yarn.spark.access.namenodes cannot be accessed. Yes, that's unreasonable. But they cannot be accessed because *you're providing the wrong configuration*. With HA, you should use a namespace address, not a namenode address. I can't think of any way to make that clearer. > Standby namenodes should be allowed to included in > yarn.spark.access.namenodes to support HDFS HA > - > > Key: SPARK-20608 > URL: https://issues.apache.org/jira/browse/SPARK-20608 > Project: Spark > Issue Type: Improvement > Components: Spark Submit, YARN >Affects Versions: 2.0.1, 2.1.0 >Reporter: Yuechen Chen >Priority: Minor > Original Estimate: 672h > Remaining Estimate: 672h > > If one Spark Application need to access remote namenodes, > yarn.spark.access.namenodes should be only be configged in spark-submit > scripts, and Spark Client(On Yarn) would fetch HDFS credential periodically. > If one hadoop cluster is configured by HA, there would be one active namenode > and at least one standby namenode. > However, if yarn.spark.access.namenodes includes both active and standby > namenodes, Spark Application will be failed for the reason that the standby > namenode would not access by Spark for org.apache.hadoop.ipc.StandbyException. > I think it won't cause any bad effect to config standby namenodes in > yarn.spark.access.namenodes, and my Spark Application can be able to sustain > the failover of Hadoop namenode. > HA Examples: > Spark-submit script: > yarn.spark.access.namenodes=hdfs://namenode01,hdfs://namenode02 > Spark Application Codes: > dataframe.write.parquet(getActiveNameNode(...) + hdfsPath) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20681) DataFram.Drop doesn't take effective, neither does error
Xi Wang created SPARK-20681: --- Summary: DataFram.Drop doesn't take effective, neither does error Key: SPARK-20681 URL: https://issues.apache.org/jira/browse/SPARK-20681 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: Xi Wang Priority: Critical I am running the following code trying to drop nested columns, but it doesn't work, it doesn't return error either. *I read the DF from this json:* {'parent':{'child':{'grandchild':{'val':'1',' val_to_be_deleted':'0' scala> spark.read.format("json").load("c:/tmp/spark_issue.json") res0: org.apache.spark.sql.DataFrame = [father: struct>>] *read the df:* scala> res0.printSchema root |-- parent: struct (nullable = true) ||-- child: struct (nullable = true) |||-- grandchild: struct (nullable = true) ||||-- val: long (nullable = true) ||||-- val_to_be_deleted: long (nullable = true) *drop the column (I tried different ways, "quote", `back-tick`, col(object) ...) column remains anyway:* scala> res0.drop(col("father.child.grandchild.val_to_be_deleted")).printSchema root |-- father: struct (nullable = true) ||-- child: struct (nullable = true) |||-- grandchild: struct (nullable = true) ||||-- val: long (nullable = true) ||||-- val_to_be_deleted: long (nullable = true) scala> res0.drop("father.child.grandchild.val_to_be_deleted").printSchema root |-- father: struct (nullable = true) ||-- child: struct (nullable = true) |||-- grandchild: struct (nullable = true) ||||-- val: long (nullable = true) ||||-- val_to_be_deleted: long (nullable = true) Any help is appreciated. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13747) Concurrent execution in SQL doesn't work with Scala ForkJoinPool
[ https://issues.apache.org/jira/browse/SPARK-13747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16002852#comment-16002852 ] Dmitry Naumenko commented on SPARK-13747: - [~revolucion09] It's a bit an off-topic discussion. My 50 cents to it - from my understanding, fork-join pool in Akka helps to keep all processors busy, so you can archive a high message throughput per second (as long as you don't use blocking operations). But in Spark driver program, it's not a critical, cause the most time it will be waiting for worker nodes anyway. > Concurrent execution in SQL doesn't work with Scala ForkJoinPool > > > Key: SPARK-13747 > URL: https://issues.apache.org/jira/browse/SPARK-13747 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > Run the following codes may fail > {code} > (1 to 100).par.foreach { _ => > println(sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count()) > } > java.lang.IllegalArgumentException: spark.sql.execution.id is already set > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87) > > at > org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1904) > at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1385) > {code} > This is because SparkContext.runJob can be suspended when using a > ForkJoinPool (e.g.,scala.concurrent.ExecutionContext.Implicits.global) as it > calls Await.ready (introduced by https://github.com/apache/spark/pull/9264). > So when SparkContext.runJob is suspended, ForkJoinPool will run another task > in the same thread, however, the local properties has been polluted. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13747) Concurrent execution in SQL doesn't work with Scala ForkJoinPool
[ https://issues.apache.org/jira/browse/SPARK-13747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16002814#comment-16002814 ] Saif Addin commented on SPARK-13747: [~dnaumenko] Nonetheless, if I am not mistaken, there are proofs that fork join pools provide significant performance boost in scalable environments, that is why akka uses them by default. Fixed or Cached pool threads are considered dangerous for production environments. > Concurrent execution in SQL doesn't work with Scala ForkJoinPool > > > Key: SPARK-13747 > URL: https://issues.apache.org/jira/browse/SPARK-13747 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > Run the following codes may fail > {code} > (1 to 100).par.foreach { _ => > println(sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count()) > } > java.lang.IllegalArgumentException: spark.sql.execution.id is already set > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87) > > at > org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1904) > at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1385) > {code} > This is because SparkContext.runJob can be suspended when using a > ForkJoinPool (e.g.,scala.concurrent.ExecutionContext.Implicits.global) as it > calls Await.ready (introduced by https://github.com/apache/spark/pull/9264). > So when SparkContext.runJob is suspended, ForkJoinPool will run another task > in the same thread, however, the local properties has been polluted. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20355) Display Spark version on history page
[ https://issues.apache.org/jira/browse/SPARK-20355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-20355. --- Resolution: Fixed Assignee: Sanket Reddy Fix Version/s: 2.3.0 > Display Spark version on history page > - > > Key: SPARK-20355 > URL: https://issues.apache.org/jira/browse/SPARK-20355 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Web UI >Affects Versions: 2.1.0 >Reporter: Sanket Reddy >Assignee: Sanket Reddy >Priority: Minor > Fix For: 2.3.0 > > > Spark Version for a specific application is not displayed on the history page > now. It should be nice to switch the spark version on the UI when we click on > the specific application. > Currently there seems to be way as SparkListenerLogStart records the > application version. So, it should be trivial to listen to this event and > provision this change on the UI. > {"Event":"SparkListenerLogStart","Spark > Version":"1.6.2.0_2.7.2.7.1604210306_161643"} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20489) Different results in local mode and yarn mode when working with dates (silent corruption due to system timezone setting)
[ https://issues.apache.org/jira/browse/SPARK-20489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rick Moritz updated SPARK-20489: Summary: Different results in local mode and yarn mode when working with dates (silent corruption due to system timezone setting) (was: Different results in local mode and yarn mode when working with dates (race condition with SimpleDateFormat?)) > Different results in local mode and yarn mode when working with dates (silent > corruption due to system timezone setting) > > > Key: SPARK-20489 > URL: https://issues.apache.org/jira/browse/SPARK-20489 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.0.2 > Environment: yarn-client mode in Zeppelin, Cloudera > Spark2-distribution >Reporter: Rick Moritz >Priority: Critical > > Running the following code (in Zeppelin, or spark-shell), I get different > results, depending on whether I am using local[*] -mode or yarn-client mode: > {code:title=test case|borderStyle=solid} > import org.apache.spark.sql.Row > import org.apache.spark.sql.types._ > import spark.implicits._ > val counter = 1 to 2 > val size = 1 to 3 > val sampleText = spark.createDataFrame( > sc.parallelize(size) > .map(Row(_)), > StructType(Array(StructField("id", IntegerType, nullable=false)) > ) > ) > .withColumn("loadDTS",lit("2017-04-25T10:45:02.2")) > > val rddList = counter.map( > count => sampleText > .withColumn("loadDTS2", > date_format(date_add(col("loadDTS"),count),"-MM-dd'T'HH:mm:ss.SSS")) > .drop(col("loadDTS")) > .withColumnRenamed("loadDTS2","loadDTS") > .coalesce(4) > .rdd > ) > val resultText = spark.createDataFrame( > spark.sparkContext.union(rddList), > sampleText.schema > ) > val testGrouped = resultText.groupBy("id") > val timestamps = testGrouped.agg( > max(unix_timestamp($"loadDTS", "-MM-dd'T'HH:mm:ss.SSS")) as > "timestamp" > ) > val loadDateResult = resultText.join(timestamps, "id") > val filteredresult = loadDateResult.filter($"timestamp" === > unix_timestamp($"loadDTS", "-MM-dd'T'HH:mm:ss.SSS")) > filteredresult.count > {code} > The expected result, *3* is what I obtain in local mode, but as soon as I run > fully distributed, I get *0*. If Increase size to {{1 to 32000}}, I do get > some results (depending on the size of counter) - none of which makes any > sense. > Up to the application of the last filter, at first glance everything looks > okay, but then something goes wrong. Potentially this is due to lingering > re-use of SimpleDateFormats, but I can't get it to happen in a > non-distributed mode. The generated execution plan is the same in each case, > as expected. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20489) Different results in local mode and yarn mode when working with dates (race condition with SimpleDateFormat?)
[ https://issues.apache.org/jira/browse/SPARK-20489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16002762#comment-16002762 ] Rick Moritz commented on SPARK-20489: - Okay, I had a look at timezones, and it appears my Driver was in a different timezone, than my executors. Having put all machines into UTC, I now get the expected result. It helped, when I realized that unix timestamps aren't always in UTC, for some reason (thanks Jacek, https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-sql-Expression-UnixTimestamp.html). I wonder what could be done to avoid these quite bizarre error situations. Especially in big data, where in theory clusters could reasonably span across timezones, and in particular drivers )in client modes, for example) could be well "outside" the actual cluster, relying on the local timezone settings appears to me to be quite fragile. As an initial measure, I wonder how much effort it would be, to make sure that when dates are used, any discrepancy in timezone settings would result in an exception being thrown. I would consider that a reasonable assertion, with not too much overhead. > Different results in local mode and yarn mode when working with dates (race > condition with SimpleDateFormat?) > - > > Key: SPARK-20489 > URL: https://issues.apache.org/jira/browse/SPARK-20489 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.0.2 > Environment: yarn-client mode in Zeppelin, Cloudera > Spark2-distribution >Reporter: Rick Moritz >Priority: Critical > > Running the following code (in Zeppelin, or spark-shell), I get different > results, depending on whether I am using local[*] -mode or yarn-client mode: > {code:title=test case|borderStyle=solid} > import org.apache.spark.sql.Row > import org.apache.spark.sql.types._ > import spark.implicits._ > val counter = 1 to 2 > val size = 1 to 3 > val sampleText = spark.createDataFrame( > sc.parallelize(size) > .map(Row(_)), > StructType(Array(StructField("id", IntegerType, nullable=false)) > ) > ) > .withColumn("loadDTS",lit("2017-04-25T10:45:02.2")) > > val rddList = counter.map( > count => sampleText > .withColumn("loadDTS2", > date_format(date_add(col("loadDTS"),count),"-MM-dd'T'HH:mm:ss.SSS")) > .drop(col("loadDTS")) > .withColumnRenamed("loadDTS2","loadDTS") > .coalesce(4) > .rdd > ) > val resultText = spark.createDataFrame( > spark.sparkContext.union(rddList), > sampleText.schema > ) > val testGrouped = resultText.groupBy("id") > val timestamps = testGrouped.agg( > max(unix_timestamp($"loadDTS", "-MM-dd'T'HH:mm:ss.SSS")) as > "timestamp" > ) > val loadDateResult = resultText.join(timestamps, "id") > val filteredresult = loadDateResult.filter($"timestamp" === > unix_timestamp($"loadDTS", "-MM-dd'T'HH:mm:ss.SSS")) > filteredresult.count > {code} > The expected result, *3* is what I obtain in local mode, but as soon as I run > fully distributed, I get *0*. If Increase size to {{1 to 32000}}, I do get > some results (depending on the size of counter) - none of which makes any > sense. > Up to the application of the last filter, at first glance everything looks > okay, but then something goes wrong. Potentially this is due to lingering > re-use of SimpleDateFormats, but I can't get it to happen in a > non-distributed mode. The generated execution plan is the same in each case, > as expected. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20591) Succeeded tasks num not equal in job page and job detail page on spark web ui when speculative task(s) exist
[ https://issues.apache.org/jira/browse/SPARK-20591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20591: Assignee: (was: Apache Spark) > Succeeded tasks num not equal in job page and job detail page on spark web ui > when speculative task(s) exist > > > Key: SPARK-20591 > URL: https://issues.apache.org/jira/browse/SPARK-20591 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.0.2 >Reporter: Jinhua Fu >Priority: Minor > Attachments: job detail page(stages).png, job page.png > > > when spark.speculation is enabled,and there are some speculative tasks, then > we can see succeeded tasks num include speculative tasks on the job page, > which however not being included on the job detail page(job stages page). > When I consider some tasks may run a little slow by the job page's succeeded > tasks more than total tasks,which make me want to known which tasks and why,I > have to check every stage to find the speculative tasks which is beacause > speculative tasks not being included in the stage succeeded task num. > Can it be improved? > update two screenshots, succeeded task num is 557 on job page,but 550(by sum) > on job detail page(stages),the extra 7 tasks are speculative tasks. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20591) Succeeded tasks num not equal in job page and job detail page on spark web ui when speculative task(s) exist
[ https://issues.apache.org/jira/browse/SPARK-20591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20591: Assignee: Apache Spark > Succeeded tasks num not equal in job page and job detail page on spark web ui > when speculative task(s) exist > > > Key: SPARK-20591 > URL: https://issues.apache.org/jira/browse/SPARK-20591 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.0.2 >Reporter: Jinhua Fu >Assignee: Apache Spark >Priority: Minor > Attachments: job detail page(stages).png, job page.png > > > when spark.speculation is enabled,and there are some speculative tasks, then > we can see succeeded tasks num include speculative tasks on the job page, > which however not being included on the job detail page(job stages page). > When I consider some tasks may run a little slow by the job page's succeeded > tasks more than total tasks,which make me want to known which tasks and why,I > have to check every stage to find the speculative tasks which is beacause > speculative tasks not being included in the stage succeeded task num. > Can it be improved? > update two screenshots, succeeded task num is 557 on job page,but 550(by sum) > on job detail page(stages),the extra 7 tasks are speculative tasks. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20591) Succeeded tasks num not equal in job page and job detail page on spark web ui when speculative task(s) exist
[ https://issues.apache.org/jira/browse/SPARK-20591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16002686#comment-16002686 ] Apache Spark commented on SPARK-20591: -- User 'fjh100456' has created a pull request for this issue: https://github.com/apache/spark/pull/17923 > Succeeded tasks num not equal in job page and job detail page on spark web ui > when speculative task(s) exist > > > Key: SPARK-20591 > URL: https://issues.apache.org/jira/browse/SPARK-20591 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.0.2 >Reporter: Jinhua Fu >Priority: Minor > Attachments: job detail page(stages).png, job page.png > > > when spark.speculation is enabled,and there are some speculative tasks, then > we can see succeeded tasks num include speculative tasks on the job page, > which however not being included on the job detail page(job stages page). > When I consider some tasks may run a little slow by the job page's succeeded > tasks more than total tasks,which make me want to known which tasks and why,I > have to check every stage to find the speculative tasks which is beacause > speculative tasks not being included in the stage succeeded task num. > Can it be improved? > update two screenshots, succeeded task num is 557 on job page,but 550(by sum) > on job detail page(stages),the extra 7 tasks are speculative tasks. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13747) Concurrent execution in SQL doesn't work with Scala ForkJoinPool
[ https://issues.apache.org/jira/browse/SPARK-13747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16002674#comment-16002674 ] Dmitry Naumenko commented on SPARK-13747: - [~zsxwing] I've tried to build a Spark from a branch in pull request. Didn't manage to make a complete build (had some problems with R dependencies), so I've replaced only spark-core.jar and it seems like the issue still occurs. Could you please provide a jar/dist for re-test? As a side note, the fixed-thread-pool solution works well for us. We will probably stick with it. > Concurrent execution in SQL doesn't work with Scala ForkJoinPool > > > Key: SPARK-13747 > URL: https://issues.apache.org/jira/browse/SPARK-13747 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > Run the following codes may fail > {code} > (1 to 100).par.foreach { _ => > println(sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count()) > } > java.lang.IllegalArgumentException: spark.sql.execution.id is already set > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87) > > at > org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1904) > at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1385) > {code} > This is because SparkContext.runJob can be suspended when using a > ForkJoinPool (e.g.,scala.concurrent.ExecutionContext.Implicits.global) as it > calls Await.ready (introduced by https://github.com/apache/spark/pull/9264). > So when SparkContext.runJob is suspended, ForkJoinPool will run another task > in the same thread, however, the local properties has been polluted. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20542) Add an API into Bucketizer that can bin a lot of columns all at once
[ https://issues.apache.org/jira/browse/SPARK-20542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16002609#comment-16002609 ] Barry Becker commented on SPARK-20542: -- This is a great improvement, @viirya! According to your results, my case should go from about a minute down to 3 seconds. Probably StringIndexer would benefit from a similar approach. > Add an API into Bucketizer that can bin a lot of columns all at once > > > Key: SPARK-20542 > URL: https://issues.apache.org/jira/browse/SPARK-20542 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.2.0 >Reporter: Liang-Chi Hsieh > > Current ML's Bucketizer can only bin a column of continuous features. If a > dataset has thousands of of continuous columns needed to bin, we will result > in thousands of ML stages. It is very inefficient regarding query planning > and execution. > We should have a type of bucketizer that can bin a lot of columns all at > once. It would need to accept an list of arrays of split points to correspond > to the columns to bin, but it might make things more efficient by replacing > thousands of stages with just one. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20311) SQL "range(N) as alias" or "range(N) alias" doesn't work
[ https://issues.apache.org/jira/browse/SPARK-20311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-20311: --- Assignee: Takeshi Yamamuro > SQL "range(N) as alias" or "range(N) alias" doesn't work > > > Key: SPARK-20311 > URL: https://issues.apache.org/jira/browse/SPARK-20311 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Juliusz Sompolski >Assignee: Takeshi Yamamuro >Priority: Minor > Fix For: 2.2.1, 2.3.0 > > > `select * from range(10) as A;` or `select * from range(10) A;` > does not work. > As a workaround, a subquery has to be used: > `select * from (select * from range(10)) as A;` -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20311) SQL "range(N) as alias" or "range(N) alias" doesn't work
[ https://issues.apache.org/jira/browse/SPARK-20311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-20311. - Resolution: Fixed Fix Version/s: 2.3.0 2.2.1 Issue resolved by pull request 17666 [https://github.com/apache/spark/pull/17666] > SQL "range(N) as alias" or "range(N) alias" doesn't work > > > Key: SPARK-20311 > URL: https://issues.apache.org/jira/browse/SPARK-20311 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Juliusz Sompolski >Priority: Minor > Fix For: 2.2.1, 2.3.0 > > > `select * from range(10) as A;` or `select * from range(10) A;` > does not work. > As a workaround, a subquery has to be used: > `select * from (select * from range(10)) as A;` -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19581) running NaiveBayes model with 0 features can crash the executor with D rorreGEMV
[ https://issues.apache.org/jira/browse/SPARK-19581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16002582#comment-16002582 ] Barry Becker commented on SPARK-19581: -- I think its just a matter of sending a feature vector of length 0 to the NaiveBayes model inducer. If you do that, then you will see the error regardless of what data you use. > running NaiveBayes model with 0 features can crash the executor with D > rorreGEMV > > > Key: SPARK-19581 > URL: https://issues.apache.org/jira/browse/SPARK-19581 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.0 > Environment: spark development or standalone mode on windows or linux. >Reporter: Barry Becker >Priority: Minor > > The severity of this bug is high (because nothing should cause spark to crash > like this) but the priority may be low (because there is an easy workaround). > In our application, a user can select features and a target to run the > NaiveBayes inducer. If columns have too many values or all one value, they > will be removed before we call the inducer to create the model. As a result, > there are some cases, where all the features may get removed. When this > happens, executors will crash and get restarted (if on a cluster) or spark > will crash and need to be manually restarted (if in development mode). > It looks like NaiveBayes uses BLAS, and BLAS does not handle this case well > when it is encountered. I emits this vague error : > ** On entry to DGEMV parameter number 6 had an illegal value > and terminates. > My code looks like this: > {code} >val predictions = model.transform(testData) // Make predictions > // figure out how many were correctly predicted > val numCorrect = predictions.filter(new Column(actualTarget) === new > Column(PREDICTION_LABEL_COLUMN)).count() > val numIncorrect = testRowCount - numCorrect > {code} > The failure is at the line that does the count, but it is not the count that > causes the problem, it is the model.transform step (where the model contains > the NaiveBayes classifier). > Here is the stack trace (in development mode): > {code} > [2017-02-13 06:28:39,946] TRACE evidence.EvidenceVizModel$ [] > [akka://JobServer/user/context-supervisor/sql-context] - done making > predictions in 232 > ** On entry to DGEMV parameter number 6 had an illegal value > ** On entry to DGEMV parameter number 6 had an illegal value > ** On entry to DGEMV parameter number 6 had an illegal value > [2017-02-13 06:28:40,506] ERROR .scheduler.LiveListenerBus [] > [akka://JobServer/user/context-supervisor/sql-context] - SparkListenerBus has > already stopped! Dropping event SparkListenerSQLExecutionEnd(9,1486996120505) > [2017-02-13 06:28:40,506] ERROR .scheduler.LiveListenerBus [] > [akka://JobServer/user/context-supervisor/sql-context] - SparkListenerBus has > already stopped! Dropping event > SparkListenerStageCompleted(org.apache.spark.scheduler.StageInfo@1f6c4a29) > [2017-02-13 06:28:40,508] ERROR .scheduler.LiveListenerBus [] > [akka://JobServer/user/context-supervisor/sql-context] - SparkListenerBus has > already stopped! Dropping event > SparkListenerJobEnd(12,1486996120507,JobFailed(org.apache.spark.SparkException: > Job 12 cancelled because SparkContext was shut down)) > [2017-02-13 06:28:40,509] ERROR .jobserver.JobManagerActor [] > [akka://JobServer/user/context-supervisor/sql-context] - Got Throwable > org.apache.spark.SparkException: Job 12 cancelled because SparkContext was > shut down > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:808) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:806) > at scala.collection.mutable.HashSet.foreach(HashSet.scala:78) > at > org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:806) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onStop(DAGScheduler.scala:1668) > at org.apache.spark.util.EventLoop.stop(EventLoop.scala:83) > at > org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1587) > at > org.apache.spark.SparkContext$$anonfun$stop$8.apply$mcV$sp(SparkContext.scala:1826) > at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1283) > at org.apache.spark.SparkContext.stop(SparkContext.scala:1825) > at > org.apache.spark.SparkContext$$anonfun$2.apply$mcV$sp(SparkContext.scala:581) > at > org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216) > at >
[jira] [Resolved] (SPARK-20667) Cleanup the cataloged metadata after completing the package of sql/core and sql/hive
[ https://issues.apache.org/jira/browse/SPARK-20667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-20667. - Resolution: Fixed Fix Version/s: 2.3.0 2.2.1 Issue resolved by pull request 17908 [https://github.com/apache/spark/pull/17908] > Cleanup the cataloged metadata after completing the package of sql/core and > sql/hive > > > Key: SPARK-20667 > URL: https://issues.apache.org/jira/browse/SPARK-20667 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.2.0 >Reporter: Xiao Li >Assignee: Xiao Li > Fix For: 2.2.1, 2.3.0 > > > So far, we do not drop all the cataloged tables after each package. > Sometimes, we might hit strange test case errors because the previous test > suite did not drop the tables/functions/database. At least, we can first > clean up the environment when completing the package of sql/core and sql/hive. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20601) Python API Changes for Constrained Logistic Regression Params
[ https://issues.apache.org/jira/browse/SPARK-20601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20601: Assignee: Apache Spark > Python API Changes for Constrained Logistic Regression Params > - > > Key: SPARK-20601 > URL: https://issues.apache.org/jira/browse/SPARK-20601 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.2.0 >Reporter: Bryan Cutler >Assignee: Apache Spark > > With the addition of SPARK-20047 for constrained logistic regression, there > are 4 new params not in PySpark lr > * lowerBoundsOnCoefficients > * upperBoundsOnCoefficients > * lowerBoundsOnIntercepts > * upperBoundsOnIntercepts -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20601) Python API Changes for Constrained Logistic Regression Params
[ https://issues.apache.org/jira/browse/SPARK-20601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20601: Assignee: (was: Apache Spark) > Python API Changes for Constrained Logistic Regression Params > - > > Key: SPARK-20601 > URL: https://issues.apache.org/jira/browse/SPARK-20601 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.2.0 >Reporter: Bryan Cutler > > With the addition of SPARK-20047 for constrained logistic regression, there > are 4 new params not in PySpark lr > * lowerBoundsOnCoefficients > * upperBoundsOnCoefficients > * lowerBoundsOnIntercepts > * upperBoundsOnIntercepts -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20601) Python API Changes for Constrained Logistic Regression Params
[ https://issues.apache.org/jira/browse/SPARK-20601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16002493#comment-16002493 ] Apache Spark commented on SPARK-20601: -- User 'zero323' has created a pull request for this issue: https://github.com/apache/spark/pull/17922 > Python API Changes for Constrained Logistic Regression Params > - > > Key: SPARK-20601 > URL: https://issues.apache.org/jira/browse/SPARK-20601 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.2.0 >Reporter: Bryan Cutler > > With the addition of SPARK-20047 for constrained logistic regression, there > are 4 new params not in PySpark lr > * lowerBoundsOnCoefficients > * upperBoundsOnCoefficients > * lowerBoundsOnIntercepts > * upperBoundsOnIntercepts -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2060) Querying JSON Datasets with SQL and DSL in Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-2060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16002490#comment-16002490 ] Apache Spark commented on SPARK-2060: - User 'zero323' has created a pull request for this issue: https://github.com/apache/spark/pull/17922 > Querying JSON Datasets with SQL and DSL in Spark SQL > > > Key: SPARK-2060 > URL: https://issues.apache.org/jira/browse/SPARK-2060 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai > Fix For: 1.0.1, 1.1.0 > > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20680) Spark-sql do not support for void column datatype of view
[ https://issues.apache.org/jira/browse/SPARK-20680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lantao Jin updated SPARK-20680: --- Summary: Spark-sql do not support for void column datatype of view (was: Spark-sql do not support for column datatype of void in a HIVE view) > Spark-sql do not support for void column datatype of view > - > > Key: SPARK-20680 > URL: https://issues.apache.org/jira/browse/SPARK-20680 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0, 2.1.1 >Reporter: Lantao Jin > > Create a HIVE view: > {quote} > hive> create table bad as select 1 x, null z from dual; > {quote} > Because there's no type, Hive gives it the VOID type: > {quote} > hive> describe bad; > OK > x int > z void > {quote} > In Spark2.0.x, the behaviour to read this view is normal: > {quote} > spark-sql> describe bad; > x int NULL > z voidNULL > Time taken: 4.431 seconds, Fetched 2 row(s) > {quote} > But in Spark2.1.x, it failed with SparkException: Cannot recognize hive type > string: void > {quote} > spark-sql> describe bad; > 17/05/09 03:12:08 INFO execution.SparkSqlParser: Parsing command: describe bad > 17/05/09 03:12:08 INFO parser.CatalystSqlParser: Parsing command: int > 17/05/09 03:12:08 INFO parser.CatalystSqlParser: Parsing command: void > 17/05/09 03:12:08 ERROR thriftserver.SparkSQLDriver: Failed in [describe bad] > org.apache.spark.SparkException: Cannot recognize hive type string: void > at > org.apache.spark.sql.hive.client.HiveClientImpl.org$apache$spark$sql$hive$client$HiveClientImpl$$fromHiveColumn(HiveClientImpl.scala:789) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:365) > > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:365) > > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:365) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:361) > Caused by: org.apache.spark.sql.catalyst.parser.ParseException: > DataType void() is not supported.(line 1, pos 0) > == SQL == > void > ^^^ > ... 61 more > org.apache.spark.SparkException: Cannot recognize hive type string: void > {quote} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20680) Spark-sql do not support for column datatype of void in a HIVE view
Lantao Jin created SPARK-20680: -- Summary: Spark-sql do not support for column datatype of void in a HIVE view Key: SPARK-20680 URL: https://issues.apache.org/jira/browse/SPARK-20680 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.1, 2.1.0 Reporter: Lantao Jin Create a HIVE view: {quote} hive> create table bad as select 1 x, null z from dual; {quote} Because there's no type, Hive gives it the VOID type: {quote} hive> describe bad; OK x int z void {quote} In Spark2.0.x, the behaviour to read this view is normal: {quote} spark-sql> describe bad; x int NULL z voidNULL Time taken: 4.431 seconds, Fetched 2 row(s) {quote} But in Spark2.1.x, it failed with SparkException: Cannot recognize hive type string: void {quote} spark-sql> describe bad; 17/05/09 03:12:08 INFO execution.SparkSqlParser: Parsing command: describe bad 17/05/09 03:12:08 INFO parser.CatalystSqlParser: Parsing command: int 17/05/09 03:12:08 INFO parser.CatalystSqlParser: Parsing command: void 17/05/09 03:12:08 ERROR thriftserver.SparkSQLDriver: Failed in [describe bad] org.apache.spark.SparkException: Cannot recognize hive type string: void at org.apache.spark.sql.hive.client.HiveClientImpl.org$apache$spark$sql$hive$client$HiveClientImpl$$fromHiveColumn(HiveClientImpl.scala:789) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:365) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:365) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at scala.collection.AbstractIterable.foreach(Iterable.scala:54) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:365) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:361) Caused by: org.apache.spark.sql.catalyst.parser.ParseException: DataType void() is not supported.(line 1, pos 0) == SQL == void ^^^ ... 61 more org.apache.spark.SparkException: Cannot recognize hive type string: void {quote} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19364) Stream Blocks in Storage Persists Forever when Kinesis Checkpoints are enabled and an exception is thrown
[ https://issues.apache.org/jira/browse/SPARK-19364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-19364: - Priority: Major (was: Blocker) I am lowering the priority to {{Major}} as higher priorities over this are usually reserved for committers and I guess this was not set by a committer. > Stream Blocks in Storage Persists Forever when Kinesis Checkpoints are > enabled and an exception is thrown > -- > > Key: SPARK-19364 > URL: https://issues.apache.org/jira/browse/SPARK-19364 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.2 > Environment: ubuntu unix > spark 2.0.2 > application is java >Reporter: Andrew Milkowski > > -- update --- we found that below situation occurs when we encounter > "com.amazonaws.services.kinesis.clientlibrary.exceptions.ShutdownException: > Can't update checkpoint - instance doesn't hold the lease for this shard" > https://github.com/awslabs/amazon-kinesis-client/issues/108 > we use s3 directory (and dynamodb) to store checkpoints, but if such occurs > blocks should not get stuck but continue to be evicted gracefully from > memory, obviously kinesis library race condition is a problem onto itself... > -- exception leading to a block not being freed up -- > SLF4J: Class path contains multiple SLF4J bindings. > SLF4J: Found binding in > [jar:file:/mnt/yarn/usercache/hadoop/filecache/24/__spark_libs__7928020266533182031.zip/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: Found binding in > [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an > explanation. > SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] > 17/01/26 13:52:00 ERROR KinesisRecordProcessor: ShutdownException: Caught > shutdown exception, skipping checkpoint. > com.amazonaws.services.kinesis.clientlibrary.exceptions.ShutdownException: > Can't update checkpoint - instance doesn't hold the lease for this shard > at > com.amazonaws.services.kinesis.clientlibrary.lib.worker.KinesisClientLibLeaseCoordinator.setCheckpoint(KinesisClientLibLeaseCoordinator.java:120) > at > com.amazonaws.services.kinesis.clientlibrary.lib.worker.RecordProcessorCheckpointer.advancePosition(RecordProcessorCheckpointer.java:216) > at > com.amazonaws.services.kinesis.clientlibrary.lib.worker.RecordProcessorCheckpointer.checkpoint(RecordProcessorCheckpointer.java:137) > at > com.amazonaws.services.kinesis.clientlibrary.lib.worker.RecordProcessorCheckpointer.checkpoint(RecordProcessorCheckpointer.java:103) > at > org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1$$anonfun$apply$1.apply$mcV$sp(KinesisCheckpointer.scala:81) > at > org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1$$anonfun$apply$1.apply(KinesisCheckpointer.scala:81) > at > org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1$$anonfun$apply$1.apply(KinesisCheckpointer.scala:81) > at scala.util.Try$.apply(Try.scala:192) > at > org.apache.spark.streaming.kinesis.KinesisRecordProcessor$.retryRandom(KinesisRecordProcessor.scala:144) > at > org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1.apply(KinesisCheckpointer.scala:81) > at > org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1.apply(KinesisCheckpointer.scala:75) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.streaming.kinesis.KinesisCheckpointer.checkpoint(KinesisCheckpointer.scala:75) > at > org.apache.spark.streaming.kinesis.KinesisCheckpointer.org$apache$spark$streaming$kinesis$KinesisCheckpointer$$checkpointAll(KinesisCheckpointer.scala:103) > at > org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$1.apply$mcVJ$sp(KinesisCheckpointer.scala:117) > at > org.apache.spark.streaming.util.RecurringTimer.triggerActionForNextInterval(RecurringTimer.scala:94) > at > org.apache.spark.streaming.util.RecurringTimer.org$apache$spark$streaming$util$RecurringTimer$$loop(RecurringTimer.scala:106) > at > org.apache.spark.streaming.util.RecurringTimer$$anon$1.run(RecurringTimer.scala:29) > running standard kinesis stream ingestion with a java spark app and creating > dstream after running for some time some block streams seem to persist > forever and never cleaned up and this eventually leads to memory depletion on > workers > we even tried cleaning RDD's with the following: > cleaner = ssc.sparkContext().sc().cleaner().get(); > filtered.foreachRDD(new
[jira] [Updated] (SPARK-20415) SPARK job hangs while writing DataFrame to HDFS
[ https://issues.apache.org/jira/browse/SPARK-20415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-20415: - Priority: Major (was: Blocker) I am lowering the priority to {{Major}} as higher priorities over this are usually reserved for committers and I guess this was not set by a committer. > SPARK job hangs while writing DataFrame to HDFS > --- > > Key: SPARK-20415 > URL: https://issues.apache.org/jira/browse/SPARK-20415 > Project: Spark > Issue Type: Bug > Components: PySpark, YARN >Affects Versions: 2.1.0 > Environment: EMR 5.4.0 >Reporter: P K > > We are in POC phase with Spark. One of the Steps is reading compressed json > files that come from sources, "explode" them into tabular format and then > write them to HDFS. This worked for about three weeks until a few days ago, > for a particular dataset, the writer just hangs. I logged in to the worker > machines and see this stack trace: > "Executor task launch worker-0" #39 daemon prio=5 os_prio=0 > tid=0x7f6210352800 nid=0x4542 runnable [0x7f61f52b3000] >java.lang.Thread.State: RUNNABLE > at org.apache.spark.unsafe.Platform.copyMemory(Platform.java:210) > at > org.apache.spark.sql.catalyst.expressions.UnsafeArrayData.writeToMemory(UnsafeArrayData.java:311) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply6_2$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.execution.GenerateExec$$anonfun$doExecute$1$$anonfun$apply$9.apply(GenerateExec.scala:111) > at > org.apache.spark.sql.execution.GenerateExec$$anonfun$doExecute$1$$anonfun$apply$9.apply(GenerateExec.scala:109) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) > at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:211) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:243) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188) > at > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > The last messages ever printed in stderr before the hang are: > 17/04/18 01:41:14 INFO DAGScheduler: Final stage: ResultStage 4 (save at > NativeMethodAccessorImpl.java:0) > 17/04/18 01:41:14 INFO DAGScheduler: Parents of final stage: List() > 17/04/18 01:41:14 INFO DAGScheduler: Missing parents: List() > 17/04/18 01:41:14 INFO DAGScheduler: Submitting ResultStage 4 > (MapPartitionsRDD[31] at save at NativeMethodAccessorImpl.java:0), which has > no missing parents > 17/04/18 01:41:14 INFO MemoryStore: Block broadcast_9 stored as values in >
[jira] [Commented] (SPARK-19876) Add OneTime trigger executor
[ https://issues.apache.org/jira/browse/SPARK-19876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16002393#comment-16002393 ] Apache Spark commented on SPARK-19876: -- User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/17921 > Add OneTime trigger executor > > > Key: SPARK-19876 > URL: https://issues.apache.org/jira/browse/SPARK-19876 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Tyson Condie >Assignee: Tyson Condie > Fix For: 2.2.0 > > > The goal is to add a new trigger executor that will process a single trigger > then stop. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20554) Remove usage of scala.language.reflectiveCalls
[ https://issues.apache.org/jira/browse/SPARK-20554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16002378#comment-16002378 ] Sean Owen commented on SPARK-20554: --- [~umesh9...@gmail.com] are you working on this? > Remove usage of scala.language.reflectiveCalls > -- > > Key: SPARK-20554 > URL: https://issues.apache.org/jira/browse/SPARK-20554 > Project: Spark > Issue Type: Improvement > Components: ML, Spark Core, SQL, Structured Streaming >Affects Versions: 2.1.0 >Reporter: Sean Owen >Priority: Minor > > In several parts of the code we have imported > {{scala.language.reflectiveCalls}} to suppress a warning about, well, > reflective calls. I know from cleaning up build warnings in 2.2 that in > almost all cases of this are inadvertent and masking a type problem. > Example, in HiveDDLSuite: > {code} > val expectedTablePath = > if (dbPath.isEmpty) { > hiveContext.sessionState.catalog.defaultTablePath(tableIdentifier) > } else { > new Path(new Path(dbPath.get), tableIdentifier.table) > } > val filesystemPath = new Path(expectedTablePath.toString) > {code} > This shouldn't really work because one branch returns a URI and the other a > Path. In this case it only needs an object with a toString method and can > make this work with structural types and reflection. > Obviously, the intent was to add ".toURI" to the second branch though to make > both a URI! > I think we should probably clean this up by taking out all imports of > reflectiveCalls, and re-evaluating all of the warnings. There may be a few > legit usages. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20580) Allow RDD cache with unserializable objects
[ https://issues.apache.org/jira/browse/SPARK-20580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-20580. --- Resolution: Not A Problem Although map() wouldn't generally entail serializing anything, it could cause other computations to trigger, which do. Think of earlier stages which might entail shuffles, or caching, or checkpointing. This is generally an implementation detail. I don't think it would be possible to support unserializable objects in Spark, as most operations wouldn't work, and committing to make even a subset work is difficult and probably still confusing semantics for a user. In Java-land you can always implement your own serialization of complex third-party types anyway, if you must. > Allow RDD cache with unserializable objects > --- > > Key: SPARK-20580 > URL: https://issues.apache.org/jira/browse/SPARK-20580 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.3.0 >Reporter: Fernando Pereira >Priority: Minor > > In my current scenario we load complex Python objects in the worker nodes > that are not completely serializable. We then apply map certain operations to > the RDD which at some point we collect. In this basic usage all works well. > However, if we cache() the RDD (which defaults to memory) suddenly it fails > to execute the transformations after the caching step. Apparently caching > serializes the RDD data and deserializes it whenever more transformations are > required. > It would be nice to avoid serialization of the objects if they are to be > cached to memory, and keep the original object -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20606) ML 2.2 QA: Remove deprecated methods for ML
[ https://issues.apache.org/jira/browse/SPARK-20606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang resolved SPARK-20606. - Resolution: Fixed Fix Version/s: 2.2.0 > ML 2.2 QA: Remove deprecated methods for ML > --- > > Key: SPARK-20606 > URL: https://issues.apache.org/jira/browse/SPARK-20606 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.1.0 >Reporter: Yanbo Liang >Assignee: Yanbo Liang >Priority: Minor > Fix For: 2.2.0 > > > Remove ML methods we deprecated in 2.1. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org