spark git commit: [SPARK-15792][SQL] Allows operator to change the verbosity in explain output
Repository: spark Updated Branches: refs/heads/master 0e0904a2f -> 5f731d685 [SPARK-15792][SQL] Allows operator to change the verbosity in explain output ## What changes were proposed in this pull request? This PR allows customization of verbosity in explain output. After change, `dataframe.explain()` and `dataframe.explain(true)` has different verbosity output for physical plan. Currently, this PR only enables verbosity string for operator `HashAggregateExec` and `SortAggregateExec`. We will gradually enable verbosity string for more operators in future. **Less verbose mode:** dataframe.explain(extended = false) `output=[count(a)#85L]` is **NOT** displayed for HashAggregate. ``` scala> Seq((1,2,3)).toDF("a", "b", "c").createTempView("df2") scala> spark.sql("select count(a) from df2").explain() == Physical Plan == *HashAggregate(key=[], functions=[count(1)]) +- Exchange SinglePartition +- *HashAggregate(key=[], functions=[partial_count(1)]) +- LocalTableScan ``` **Verbose mode:** dataframe.explain(extended = true) `output=[count(a)#85L]` is displayed for HashAggregate. ``` scala> spark.sql("select count(a) from df2").explain(true) // "output=[count(a)#85L]" is added ... == Physical Plan == *HashAggregate(key=[], functions=[count(1)], output=[count(a)#85L]) +- Exchange SinglePartition +- *HashAggregate(key=[], functions=[partial_count(1)], output=[count#87L]) +- LocalTableScan ``` ## How was this patch tested? Manual test. Author: Sean Zhong <seanzh...@databricks.com> Closes #13535 from clockfly/verbose_breakdown_2. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/5f731d68 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/5f731d68 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/5f731d68 Branch: refs/heads/master Commit: 5f731d6859c4516941e5f90c99c966ef76268864 Parents: 0e0904a Author: Sean Zhong <seanzh...@databricks.com> Authored: Mon Jun 6 22:59:25 2016 -0700 Committer: Cheng Lian <l...@databricks.com> Committed: Mon Jun 6 22:59:25 2016 -0700 -- .../sql/catalyst/expressions/Expression.scala | 4 .../spark/sql/catalyst/plans/QueryPlan.scala| 2 ++ .../spark/sql/catalyst/trees/TreeNode.scala | 23 +++- .../spark/sql/execution/QueryExecution.scala| 14 +++- .../sql/execution/WholeStageCodegenExec.scala | 6 +++-- .../execution/aggregate/HashAggregateExec.scala | 12 -- .../execution/aggregate/SortAggregateExec.scala | 12 -- 7 files changed, 55 insertions(+), 18 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/5f731d68/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala index 2ec4621..efe592d 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala @@ -190,6 +190,10 @@ abstract class Expression extends TreeNode[Expression] { case single => single :: Nil } + // Marks this as final, Expression.verboseString should never be called, and thus shouldn't be + // overridden by concrete classes. + final override def verboseString: String = simpleString + override def simpleString: String = toString override def toString: String = prettyName + flatArguments.mkString("(", ", ", ")") http://git-wip-us.apache.org/repos/asf/spark/blob/5f731d68/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala index 19a66cf..cf34f4b 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala @@ -257,6 +257,8 @@ abstract class QueryPlan[PlanType <: QueryPlan[PlanType]] extends TreeNode[PlanT override def simpleString: String = statePrefix + super.simpleString + override def verboseString: String = simpleString + /** * All the subqueries of current plan. */ http://git-wip-us.apache.org/repos/asf/spark/blob/5f731d68/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala --
spark git commit: [SPARK-15632][SQL] Typed Filter should NOT change the Dataset schema
Repository: spark Updated Branches: refs/heads/branch-2.0 62765cbeb -> a5bec5b81 [SPARK-15632][SQL] Typed Filter should NOT change the Dataset schema ## What changes were proposed in this pull request? This PR makes sure the typed Filter doesn't change the Dataset schema. **Before the change:** ``` scala> val df = spark.range(0,9) scala> df.schema res12: org.apache.spark.sql.types.StructType = StructType(StructField(id,LongType,false)) scala> val afterFilter = df.filter(_=>true) scala> afterFilter.schema // !!! schema is CHANGED!!! Column name is changed from id to value, nullable is changed from false to true. res13: org.apache.spark.sql.types.StructType = StructType(StructField(value,LongType,true)) ``` SerializeFromObject and DeserializeToObject are inserted to wrap the Filter, and these two can possibly change the schema of Dataset. **After the change:** ``` scala> afterFilter.schema // schema is NOT changed. res47: org.apache.spark.sql.types.StructType = StructType(StructField(id,LongType,false)) ``` ## How was this patch tested? Unit test. Author: Sean Zhong <seanzh...@databricks.com> Closes #13529 from clockfly/spark-15632. (cherry picked from commit 0e0904a2fce3c4447c24f1752307b6d01ffbd0ad) Signed-off-by: Cheng Lian <l...@databricks.com> Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a5bec5b8 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a5bec5b8 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a5bec5b8 Branch: refs/heads/branch-2.0 Commit: a5bec5b81d9e8ce17f1ce509731b030f0f3538e3 Parents: 62765cb Author: Sean Zhong <seanzh...@databricks.com> Authored: Mon Jun 6 22:40:21 2016 -0700 Committer: Cheng Lian <l...@databricks.com> Committed: Mon Jun 6 22:40:29 2016 -0700 -- .../optimizer/TypedFilterOptimizationSuite.scala| 4 +++- .../main/scala/org/apache/spark/sql/Dataset.scala | 16 .../test/org/apache/spark/sql/JavaDatasetSuite.java | 13 + .../scala/org/apache/spark/sql/DatasetSuite.scala | 6 ++ .../sql/execution/WholeStageCodegenSuite.scala | 2 +- 5 files changed, 31 insertions(+), 10 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/a5bec5b8/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/TypedFilterOptimizationSuite.scala -- diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/TypedFilterOptimizationSuite.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/TypedFilterOptimizationSuite.scala index 289c16a..63d87bf 100644 --- a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/TypedFilterOptimizationSuite.scala +++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/TypedFilterOptimizationSuite.scala @@ -57,7 +57,9 @@ class TypedFilterOptimizationSuite extends PlanTest { comparePlans(optimized, expected) } - test("embed deserializer in filter condition if there is only one filter") { + // TODO: Remove this after we completely fix SPARK-15632 by adding optimization rules + // for typed filters. + ignore("embed deserializer in typed filter condition if there is only one filter") { val input = LocalRelation('_1.int, '_2.int) val f = (i: (Int, Int)) => i._1 > 0 http://git-wip-us.apache.org/repos/asf/spark/blob/a5bec5b8/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala index 96c871d..6cbc27d 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala @@ -1944,11 +1944,11 @@ class Dataset[T] private[sql]( */ @Experimental def filter(func: T => Boolean): Dataset[T] = { -val deserialized = CatalystSerde.deserialize[T](logicalPlan) +val deserializer = UnresolvedDeserializer(encoderFor[T].deserializer) val function = Literal.create(func, ObjectType(classOf[T => Boolean])) -val condition = Invoke(function, "apply", BooleanType, deserialized.output) -val filter = Filter(condition, deserialized) -withTypedPlan(CatalystSerde.serialize[T](filter)) +val condition = Invoke(function, "apply", BooleanType, deserializer :: Nil) +val filter = Filter(condition, logicalPlan) +withTypedPlan(filter) } /** @@ -1961,11 +1961,11 @@ class Dataset[T] private[sql]( */ @Experimental def filter(func: FilterFunction[T]): Dataset[T] = { -val deserialized = Ca
spark git commit: [SPARK-15632][SQL] Typed Filter should NOT change the Dataset schema
Repository: spark Updated Branches: refs/heads/master c409e23ab -> 0e0904a2f [SPARK-15632][SQL] Typed Filter should NOT change the Dataset schema ## What changes were proposed in this pull request? This PR makes sure the typed Filter doesn't change the Dataset schema. **Before the change:** ``` scala> val df = spark.range(0,9) scala> df.schema res12: org.apache.spark.sql.types.StructType = StructType(StructField(id,LongType,false)) scala> val afterFilter = df.filter(_=>true) scala> afterFilter.schema // !!! schema is CHANGED!!! Column name is changed from id to value, nullable is changed from false to true. res13: org.apache.spark.sql.types.StructType = StructType(StructField(value,LongType,true)) ``` SerializeFromObject and DeserializeToObject are inserted to wrap the Filter, and these two can possibly change the schema of Dataset. **After the change:** ``` scala> afterFilter.schema // schema is NOT changed. res47: org.apache.spark.sql.types.StructType = StructType(StructField(id,LongType,false)) ``` ## How was this patch tested? Unit test. Author: Sean Zhong <seanzh...@databricks.com> Closes #13529 from clockfly/spark-15632. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/0e0904a2 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/0e0904a2 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/0e0904a2 Branch: refs/heads/master Commit: 0e0904a2fce3c4447c24f1752307b6d01ffbd0ad Parents: c409e23 Author: Sean Zhong <seanzh...@databricks.com> Authored: Mon Jun 6 22:40:21 2016 -0700 Committer: Cheng Lian <l...@databricks.com> Committed: Mon Jun 6 22:40:21 2016 -0700 -- .../optimizer/TypedFilterOptimizationSuite.scala| 4 +++- .../main/scala/org/apache/spark/sql/Dataset.scala | 16 .../test/org/apache/spark/sql/JavaDatasetSuite.java | 13 + .../scala/org/apache/spark/sql/DatasetSuite.scala | 6 ++ .../sql/execution/WholeStageCodegenSuite.scala | 2 +- 5 files changed, 31 insertions(+), 10 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/0e0904a2/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/TypedFilterOptimizationSuite.scala -- diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/TypedFilterOptimizationSuite.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/TypedFilterOptimizationSuite.scala index 289c16a..63d87bf 100644 --- a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/TypedFilterOptimizationSuite.scala +++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/TypedFilterOptimizationSuite.scala @@ -57,7 +57,9 @@ class TypedFilterOptimizationSuite extends PlanTest { comparePlans(optimized, expected) } - test("embed deserializer in filter condition if there is only one filter") { + // TODO: Remove this after we completely fix SPARK-15632 by adding optimization rules + // for typed filters. + ignore("embed deserializer in typed filter condition if there is only one filter") { val input = LocalRelation('_1.int, '_2.int) val f = (i: (Int, Int)) => i._1 > 0 http://git-wip-us.apache.org/repos/asf/spark/blob/0e0904a2/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala index 96c871d..6cbc27d 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala @@ -1944,11 +1944,11 @@ class Dataset[T] private[sql]( */ @Experimental def filter(func: T => Boolean): Dataset[T] = { -val deserialized = CatalystSerde.deserialize[T](logicalPlan) +val deserializer = UnresolvedDeserializer(encoderFor[T].deserializer) val function = Literal.create(func, ObjectType(classOf[T => Boolean])) -val condition = Invoke(function, "apply", BooleanType, deserialized.output) -val filter = Filter(condition, deserialized) -withTypedPlan(CatalystSerde.serialize[T](filter)) +val condition = Invoke(function, "apply", BooleanType, deserializer :: Nil) +val filter = Filter(condition, logicalPlan) +withTypedPlan(filter) } /** @@ -1961,11 +1961,11 @@ class Dataset[T] private[sql]( */ @Experimental def filter(func: FilterFunction[T]): Dataset[T] = { -val deserialized = CatalystSerde.deserialize[T](logicalPlan) +val deserializer = UnresolvedDeserializer(encoderFor[T].deserializer) val function
spark git commit: [SPARK-15657][SQL] RowEncoder should validate the data type of input object
Repository: spark Updated Branches: refs/heads/branch-2.0 e6e1d8232 -> 38a626a54 [SPARK-15657][SQL] RowEncoder should validate the data type of input object ## What changes were proposed in this pull request? This PR improves the error handling of `RowEncoder`. When we create a `RowEncoder` with a given schema, we should validate the data type of input object. e.g. we should throw an exception when a field is boolean but is declared as a string column. This PR also removes the support to use `Product` as a valid external type of struct type. This support is added at https://github.com/apache/spark/pull/9712, but is incomplete, e.g. nested product, product in array are both not working. However, we never officially support this feature and I think it's ok to ban it. ## How was this patch tested? new tests in `RowEncoderSuite`. Author: Wenchen Fan <wenc...@databricks.com> Closes #13401 from cloud-fan/bug. (cherry picked from commit 30c4774f33fed63b7d400d220d710fb432f599a8) Signed-off-by: Cheng Lian <l...@databricks.com> Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/38a626a5 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/38a626a5 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/38a626a5 Branch: refs/heads/branch-2.0 Commit: 38a626a54dd0fac0ca460e1ba534048de513bc29 Parents: e6e1d82 Author: Wenchen Fan <wenc...@databricks.com> Authored: Sun Jun 5 15:59:52 2016 -0700 Committer: Cheng Lian <l...@databricks.com> Committed: Sun Jun 5 16:00:00 2016 -0700 -- .../main/scala/org/apache/spark/sql/Row.scala | 10 +--- .../sql/catalyst/encoders/RowEncoder.scala | 17 -- .../catalyst/expressions/objects/objects.scala | 61 +--- .../sql/catalyst/encoders/RowEncoderSuite.scala | 47 ++- 4 files changed, 95 insertions(+), 40 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/38a626a5/sql/catalyst/src/main/scala/org/apache/spark/sql/Row.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/Row.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/Row.scala index a257b83..391001d 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/Row.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/Row.scala @@ -304,15 +304,7 @@ trait Row extends Serializable { * * @throws ClassCastException when data type does not match. */ - def getStruct(i: Int): Row = { -// Product and Row both are recognized as StructType in a Row -val t = get(i) -if (t.isInstanceOf[Product]) { - Row.fromTuple(t.asInstanceOf[Product]) -} else { - t.asInstanceOf[Row] -} - } + def getStruct(i: Int): Row = getAs[Row](i) /** * Returns the value at position i. http://git-wip-us.apache.org/repos/asf/spark/blob/38a626a5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/RowEncoder.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/RowEncoder.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/RowEncoder.scala index 6cd7b34..67fca15 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/RowEncoder.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/RowEncoder.scala @@ -51,7 +51,7 @@ import org.apache.spark.unsafe.types.UTF8String * BinaryType -> byte array * ArrayType -> scala.collection.Seq or Array * MapType -> scala.collection.Map - * StructType -> org.apache.spark.sql.Row or Product + * StructType -> org.apache.spark.sql.Row * }}} */ object RowEncoder { @@ -121,11 +121,15 @@ object RowEncoder { case t @ ArrayType(et, _) => et match { case BooleanType | ByteType | ShortType | IntegerType | LongType | FloatType | DoubleType => +// TODO: validate input type for primitive array. NewInstance( classOf[GenericArrayData], inputObject :: Nil, dataType = t) - case _ => MapObjects(serializerFor(_, et), inputObject, externalDataTypeForInput(et)) + case _ => MapObjects( +element => serializerFor(ValidateExternalType(element, et), et), +inputObject, +ObjectType(classOf[Object])) } case t @ MapType(kt, vt, valueNullable) => @@ -151,8 +155,9 @@ object RowEncoder { case StructType(fields) => val nonNullOutput = CreateNamedStruct(fields.zipWithIndex.flatMap { case (field, index) => val fieldValue = serializerFor( - GetExternalRowField( -inputObject, index, field.name, external
spark git commit: [SPARK-15657][SQL] RowEncoder should validate the data type of input object
Repository: spark Updated Branches: refs/heads/master 8a9110510 -> 30c4774f3 [SPARK-15657][SQL] RowEncoder should validate the data type of input object ## What changes were proposed in this pull request? This PR improves the error handling of `RowEncoder`. When we create a `RowEncoder` with a given schema, we should validate the data type of input object. e.g. we should throw an exception when a field is boolean but is declared as a string column. This PR also removes the support to use `Product` as a valid external type of struct type. This support is added at https://github.com/apache/spark/pull/9712, but is incomplete, e.g. nested product, product in array are both not working. However, we never officially support this feature and I think it's ok to ban it. ## How was this patch tested? new tests in `RowEncoderSuite`. Author: Wenchen Fan <wenc...@databricks.com> Closes #13401 from cloud-fan/bug. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/30c4774f Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/30c4774f Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/30c4774f Branch: refs/heads/master Commit: 30c4774f33fed63b7d400d220d710fb432f599a8 Parents: 8a91105 Author: Wenchen Fan <wenc...@databricks.com> Authored: Sun Jun 5 15:59:52 2016 -0700 Committer: Cheng Lian <l...@databricks.com> Committed: Sun Jun 5 15:59:52 2016 -0700 -- .../main/scala/org/apache/spark/sql/Row.scala | 10 +--- .../sql/catalyst/encoders/RowEncoder.scala | 17 -- .../catalyst/expressions/objects/objects.scala | 61 +--- .../sql/catalyst/encoders/RowEncoderSuite.scala | 47 ++- 4 files changed, 95 insertions(+), 40 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/30c4774f/sql/catalyst/src/main/scala/org/apache/spark/sql/Row.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/Row.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/Row.scala index a257b83..391001d 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/Row.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/Row.scala @@ -304,15 +304,7 @@ trait Row extends Serializable { * * @throws ClassCastException when data type does not match. */ - def getStruct(i: Int): Row = { -// Product and Row both are recognized as StructType in a Row -val t = get(i) -if (t.isInstanceOf[Product]) { - Row.fromTuple(t.asInstanceOf[Product]) -} else { - t.asInstanceOf[Row] -} - } + def getStruct(i: Int): Row = getAs[Row](i) /** * Returns the value at position i. http://git-wip-us.apache.org/repos/asf/spark/blob/30c4774f/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/RowEncoder.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/RowEncoder.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/RowEncoder.scala index 6cd7b34..67fca15 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/RowEncoder.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/RowEncoder.scala @@ -51,7 +51,7 @@ import org.apache.spark.unsafe.types.UTF8String * BinaryType -> byte array * ArrayType -> scala.collection.Seq or Array * MapType -> scala.collection.Map - * StructType -> org.apache.spark.sql.Row or Product + * StructType -> org.apache.spark.sql.Row * }}} */ object RowEncoder { @@ -121,11 +121,15 @@ object RowEncoder { case t @ ArrayType(et, _) => et match { case BooleanType | ByteType | ShortType | IntegerType | LongType | FloatType | DoubleType => +// TODO: validate input type for primitive array. NewInstance( classOf[GenericArrayData], inputObject :: Nil, dataType = t) - case _ => MapObjects(serializerFor(_, et), inputObject, externalDataTypeForInput(et)) + case _ => MapObjects( +element => serializerFor(ValidateExternalType(element, et), et), +inputObject, +ObjectType(classOf[Object])) } case t @ MapType(kt, vt, valueNullable) => @@ -151,8 +155,9 @@ object RowEncoder { case StructType(fields) => val nonNullOutput = CreateNamedStruct(fields.zipWithIndex.flatMap { case (field, index) => val fieldValue = serializerFor( - GetExternalRowField( -inputObject, index, field.name, externalDataTypeForInput(field.dataType)), + ValidateExternalType( +GetExternalRowField(inputObject, index, field.name), +
spark git commit: [SPARK-15140][SQL] make the semantics of null input object for encoder clear
Repository: spark Updated Branches: refs/heads/branch-2.0 52376e067 -> 7315acf89 [SPARK-15140][SQL] make the semantics of null input object for encoder clear ## What changes were proposed in this pull request? For input object of non-flat type, we can't encode it to row if it's null, as Spark SQL doesn't allow row to be null, only its columns can be null. This PR explicitly add this constraint and throw exception if users break it. ## How was this patch tested? several new tests Author: Wenchen Fan <wenc...@databricks.com> Closes #13469 from cloud-fan/null-object. (cherry picked from commit 11c83f83d5172167cb64513d5311b4178797d40e) Signed-off-by: Cheng Lian <l...@databricks.com> Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7315acf8 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7315acf8 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7315acf8 Branch: refs/heads/branch-2.0 Commit: 7315acf896b2474a4b7513434f5ba2faf468abd9 Parents: 52376e0 Author: Wenchen Fan <wenc...@databricks.com> Authored: Fri Jun 3 14:28:19 2016 -0700 Committer: Cheng Lian <l...@databricks.com> Committed: Fri Jun 3 14:28:26 2016 -0700 -- .../sql/catalyst/encoders/ExpressionEncoder.scala | 13 ++--- .../spark/sql/catalyst/encoders/RowEncoder.scala | 7 +++ .../sql/catalyst/expressions/objects/objects.scala | 4 ++-- .../spark/sql/catalyst/encoders/RowEncoderSuite.scala | 8 .../test/scala/org/apache/spark/sql/DatasetSuite.scala | 10 ++ 5 files changed, 33 insertions(+), 9 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/7315acf8/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala index cc59d06..688082d 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala @@ -25,7 +25,7 @@ import org.apache.spark.sql.catalyst.{InternalRow, JavaTypeInference, ScalaRefle import org.apache.spark.sql.catalyst.analysis.{Analyzer, GetColumnByOrdinal, SimpleAnalyzer, UnresolvedAttribute, UnresolvedExtractValue} import org.apache.spark.sql.catalyst.expressions._ import org.apache.spark.sql.catalyst.expressions.codegen.{GenerateSafeProjection, GenerateUnsafeProjection} -import org.apache.spark.sql.catalyst.expressions.objects.{Invoke, NewInstance} +import org.apache.spark.sql.catalyst.expressions.objects.{AssertNotNull, Invoke, NewInstance} import org.apache.spark.sql.catalyst.optimizer.SimplifyCasts import org.apache.spark.sql.catalyst.plans.logical.{CatalystSerde, DeserializeToObject, LocalRelation} import org.apache.spark.sql.types.{ObjectType, StructField, StructType} @@ -50,8 +50,15 @@ object ExpressionEncoder { val cls = mirror.runtimeClass(tpe) val flat = !ScalaReflection.definedByConstructorParams(tpe) -val inputObject = BoundReference(0, ScalaReflection.dataTypeFor[T], nullable = false) -val serializer = ScalaReflection.serializerFor[T](inputObject) +val inputObject = BoundReference(0, ScalaReflection.dataTypeFor[T], nullable = true) +val nullSafeInput = if (flat) { + inputObject +} else { + // For input object of non-flat type, we can't encode it to row if it's null, as Spark SQL + // doesn't allow top-level row to be null, only its columns can be null. + AssertNotNull(inputObject, Seq("top level non-flat input object")) +} +val serializer = ScalaReflection.serializerFor[T](nullSafeInput) val deserializer = ScalaReflection.deserializerFor[T] val schema = ScalaReflection.schemaFor[T] match { http://git-wip-us.apache.org/repos/asf/spark/blob/7315acf8/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/RowEncoder.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/RowEncoder.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/RowEncoder.scala index 3c6ae1c..6cd7b34 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/RowEncoder.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/RowEncoder.scala @@ -57,8 +57,8 @@ import org.apache.spark.unsafe.types.UTF8String object RowEncoder { def apply(schema: StructType): ExpressionEncoder[Row] = { val cls = classOf[Row] -
spark git commit: [SPARK-15140][SQL] make the semantics of null input object for encoder clear
Repository: spark Updated Branches: refs/heads/master 28ad0f7b0 -> 11c83f83d [SPARK-15140][SQL] make the semantics of null input object for encoder clear ## What changes were proposed in this pull request? For input object of non-flat type, we can't encode it to row if it's null, as Spark SQL doesn't allow row to be null, only its columns can be null. This PR explicitly add this constraint and throw exception if users break it. ## How was this patch tested? several new tests Author: Wenchen Fan <wenc...@databricks.com> Closes #13469 from cloud-fan/null-object. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/11c83f83 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/11c83f83 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/11c83f83 Branch: refs/heads/master Commit: 11c83f83d5172167cb64513d5311b4178797d40e Parents: 28ad0f7 Author: Wenchen Fan <wenc...@databricks.com> Authored: Fri Jun 3 14:28:19 2016 -0700 Committer: Cheng Lian <l...@databricks.com> Committed: Fri Jun 3 14:28:19 2016 -0700 -- .../sql/catalyst/encoders/ExpressionEncoder.scala | 13 ++--- .../spark/sql/catalyst/encoders/RowEncoder.scala | 7 +++ .../sql/catalyst/expressions/objects/objects.scala | 4 ++-- .../spark/sql/catalyst/encoders/RowEncoderSuite.scala | 8 .../test/scala/org/apache/spark/sql/DatasetSuite.scala | 10 ++ 5 files changed, 33 insertions(+), 9 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/11c83f83/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala index cc59d06..688082d 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala @@ -25,7 +25,7 @@ import org.apache.spark.sql.catalyst.{InternalRow, JavaTypeInference, ScalaRefle import org.apache.spark.sql.catalyst.analysis.{Analyzer, GetColumnByOrdinal, SimpleAnalyzer, UnresolvedAttribute, UnresolvedExtractValue} import org.apache.spark.sql.catalyst.expressions._ import org.apache.spark.sql.catalyst.expressions.codegen.{GenerateSafeProjection, GenerateUnsafeProjection} -import org.apache.spark.sql.catalyst.expressions.objects.{Invoke, NewInstance} +import org.apache.spark.sql.catalyst.expressions.objects.{AssertNotNull, Invoke, NewInstance} import org.apache.spark.sql.catalyst.optimizer.SimplifyCasts import org.apache.spark.sql.catalyst.plans.logical.{CatalystSerde, DeserializeToObject, LocalRelation} import org.apache.spark.sql.types.{ObjectType, StructField, StructType} @@ -50,8 +50,15 @@ object ExpressionEncoder { val cls = mirror.runtimeClass(tpe) val flat = !ScalaReflection.definedByConstructorParams(tpe) -val inputObject = BoundReference(0, ScalaReflection.dataTypeFor[T], nullable = false) -val serializer = ScalaReflection.serializerFor[T](inputObject) +val inputObject = BoundReference(0, ScalaReflection.dataTypeFor[T], nullable = true) +val nullSafeInput = if (flat) { + inputObject +} else { + // For input object of non-flat type, we can't encode it to row if it's null, as Spark SQL + // doesn't allow top-level row to be null, only its columns can be null. + AssertNotNull(inputObject, Seq("top level non-flat input object")) +} +val serializer = ScalaReflection.serializerFor[T](nullSafeInput) val deserializer = ScalaReflection.deserializerFor[T] val schema = ScalaReflection.schemaFor[T] match { http://git-wip-us.apache.org/repos/asf/spark/blob/11c83f83/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/RowEncoder.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/RowEncoder.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/RowEncoder.scala index 3c6ae1c..6cd7b34 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/RowEncoder.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/RowEncoder.scala @@ -57,8 +57,8 @@ import org.apache.spark.unsafe.types.UTF8String object RowEncoder { def apply(schema: StructType): ExpressionEncoder[Row] = { val cls = classOf[Row] -val inputObject = BoundReference(0, ObjectType(cls), nullable = false) -val serializer = serializerFor(inputObject, schema) +val
spark git commit: [SPARK-15547][SQL] nested case class in encoder can have different number of fields from the real schema
Repository: spark Updated Branches: refs/heads/master eb10b481c -> 61b80d552 [SPARK-15547][SQL] nested case class in encoder can have different number of fields from the real schema ## What changes were proposed in this pull request? There are 2 kinds of `GetStructField`: 1. resolved from `UnresolvedExtractValue`, and it will have a `name` property. 2. created when we build deserializer expression for nested tuple, no `name` property. When we want to validate the ordinals of nested tuple, we should only catch `GetStructField` without the name property. ## How was this patch tested? new test in `EncoderResolutionSuite` Author: Wenchen Fan <wenc...@databricks.com> Closes #13474 from cloud-fan/ordinal-check. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/61b80d55 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/61b80d55 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/61b80d55 Branch: refs/heads/master Commit: 61b80d552aafb262b5f817f7bc9c0acd0328715b Parents: eb10b48 Author: Wenchen Fan <wenc...@databricks.com> Authored: Fri Jun 3 14:26:24 2016 -0700 Committer: Cheng Lian <l...@databricks.com> Committed: Fri Jun 3 14:26:24 2016 -0700 -- .../org/apache/spark/sql/catalyst/analysis/Analyzer.scala | 7 ++- .../spark/sql/catalyst/encoders/EncoderResolutionSuite.scala | 6 ++ 2 files changed, 12 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/61b80d55/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala index 4f6b483..0e68656 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala @@ -1964,7 +1964,12 @@ class Analyzer( */ private def validateNestedTupleFields(deserializer: Expression): Unit = { val structChildToOrdinals = deserializer -.collect { case g: GetStructField => g } +// There are 2 kinds of `GetStructField`: +// 1. resolved from `UnresolvedExtractValue`, and it will have a `name` property. +// 2. created when we build deserializer expression for nested tuple, no `name` property. +// Here we want to validate the ordinals of nested tuple, so we should only catch +// `GetStructField` without the name property. +.collect { case g: GetStructField if g.name.isEmpty => g } .groupBy(_.child) .mapValues(_.map(_.ordinal).distinct.sorted) http://git-wip-us.apache.org/repos/asf/spark/blob/61b80d55/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/encoders/EncoderResolutionSuite.scala -- diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/encoders/EncoderResolutionSuite.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/encoders/EncoderResolutionSuite.scala index 7251202..802397d 100644 --- a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/encoders/EncoderResolutionSuite.scala +++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/encoders/EncoderResolutionSuite.scala @@ -115,6 +115,12 @@ class EncoderResolutionSuite extends PlanTest { } } + test("nested case class can have different number of fields from the real schema") { +val encoder = ExpressionEncoder[(String, StringIntClass)] +val attrs = Seq('a.string, 'b.struct('a.string, 'b.int, 'c.int)) +encoder.resolveAndBind(attrs) + } + test("throw exception if real type is not compatible with encoder schema") { val msg1 = intercept[AnalysisException] { ExpressionEncoder[StringIntClass].resolveAndBind(Seq('a.string, 'b.long)) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-15494][SQL] encoder code cleanup
Repository: spark Updated Branches: refs/heads/branch-2.0 3a9ee549c -> deda53a9e [SPARK-15494][SQL] encoder code cleanup ## What changes were proposed in this pull request? Our encoder framework has been evolved a lot, this PR tries to clean up the code to make it more readable and emphasise the concept that encoder should be used as a container of serde expressions. 1. move validation logic to analyzer instead of encoder 2. only have a `resolveAndBind` method in encoder instead of `resolve` and `bind`, as we don't have the encoder life cycle concept anymore. 3. `Dataset` don't need to keep a resolved encoder, as there is no such concept anymore. bound encoder is still needed to do serialization outside of query framework. 4. Using `BoundReference` to represent an unresolved field in deserializer expression is kind of weird, this PR adds a `GetColumnByOrdinal` for this purpose. (serializer expression still use `BoundReference`, we can replace it with `GetColumnByOrdinal` in follow-ups) ## How was this patch tested? existing test Author: Wenchen Fan <wenc...@databricks.com> Author: Cheng Lian <l...@databricks.com> Closes #13269 from cloud-fan/clean-encoder. (cherry picked from commit 190ff274fd71662023a804cf98400c71f9f7da4f) Signed-off-by: Cheng Lian <l...@databricks.com> Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/deda53a9 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/deda53a9 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/deda53a9 Branch: refs/heads/branch-2.0 Commit: deda53a9e43c8be39f9e5df88371a7904c1c0e9a Parents: 3a9ee54 Author: Wenchen Fan <wenc...@databricks.com> Authored: Fri Jun 3 00:43:02 2016 -0700 Committer: Cheng Lian <l...@databricks.com> Committed: Fri Jun 3 00:43:10 2016 -0700 -- .../linalg/UDTSerializationBenchmark.scala | 2 +- .../scala/org/apache/spark/sql/Encoders.scala | 3 +- .../spark/sql/catalyst/JavaTypeInference.scala | 6 +- .../spark/sql/catalyst/ScalaReflection.scala| 307 +-- .../spark/sql/catalyst/analysis/Analyzer.scala | 53 +++- .../sql/catalyst/analysis/unresolved.scala | 7 + .../catalyst/encoders/ExpressionEncoder.scala | 134 ++-- .../sql/catalyst/encoders/RowEncoder.scala | 8 +- .../sql/catalyst/plans/logical/object.scala | 19 +- .../encoders/EncoderResolutionSuite.scala | 42 +-- .../encoders/ExpressionEncoderSuite.scala | 11 +- .../sql/catalyst/encoders/RowEncoderSuite.scala | 14 +- .../scala/org/apache/spark/sql/Dataset.scala| 48 +-- .../spark/sql/KeyValueGroupedDataset.scala | 26 +- .../spark/sql/RelationalGroupedDataset.scala| 2 +- .../aggregate/TypedAggregateExpression.scala| 6 +- .../scala/org/apache/spark/sql/functions.scala | 2 +- .../org/apache/spark/sql/DatasetSuite.scala | 8 +- .../scala/org/apache/spark/sql/QueryTest.scala | 8 +- .../sql/execution/GroupedIteratorSuite.scala| 6 +- .../apache/spark/sql/streaming/StreamTest.scala | 4 +- 21 files changed, 324 insertions(+), 392 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/deda53a9/mllib/src/test/scala/org/apache/spark/mllib/linalg/UDTSerializationBenchmark.scala -- diff --git a/mllib/src/test/scala/org/apache/spark/mllib/linalg/UDTSerializationBenchmark.scala b/mllib/src/test/scala/org/apache/spark/mllib/linalg/UDTSerializationBenchmark.scala index be7110a..8b439e6 100644 --- a/mllib/src/test/scala/org/apache/spark/mllib/linalg/UDTSerializationBenchmark.scala +++ b/mllib/src/test/scala/org/apache/spark/mllib/linalg/UDTSerializationBenchmark.scala @@ -29,7 +29,7 @@ object UDTSerializationBenchmark { val iters = 1e2.toInt val numRows = 1e3.toInt -val encoder = ExpressionEncoder[Vector].defaultBinding +val encoder = ExpressionEncoder[Vector].resolveAndBind() val vectors = (1 to numRows).map { i => Vectors.dense(Array.fill(1e5.toInt)(1.0 * i)) http://git-wip-us.apache.org/repos/asf/spark/blob/deda53a9/sql/catalyst/src/main/scala/org/apache/spark/sql/Encoders.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/Encoders.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/Encoders.scala index fa96f82..673c587 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/Encoders.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/Encoders.scala @@ -23,6 +23,7 @@ import scala.reflect.{classTag, ClassTag} import scala.reflect.runtime.universe.TypeTag import org.apache.spark.annotation.Experimental +import org.apache.spark.sql.catalyst
spark git commit: [SPARK-15494][SQL] encoder code cleanup
Repository: spark Updated Branches: refs/heads/master b9fcfb3bd -> 190ff274f [SPARK-15494][SQL] encoder code cleanup ## What changes were proposed in this pull request? Our encoder framework has been evolved a lot, this PR tries to clean up the code to make it more readable and emphasise the concept that encoder should be used as a container of serde expressions. 1. move validation logic to analyzer instead of encoder 2. only have a `resolveAndBind` method in encoder instead of `resolve` and `bind`, as we don't have the encoder life cycle concept anymore. 3. `Dataset` don't need to keep a resolved encoder, as there is no such concept anymore. bound encoder is still needed to do serialization outside of query framework. 4. Using `BoundReference` to represent an unresolved field in deserializer expression is kind of weird, this PR adds a `GetColumnByOrdinal` for this purpose. (serializer expression still use `BoundReference`, we can replace it with `GetColumnByOrdinal` in follow-ups) ## How was this patch tested? existing test Author: Wenchen Fan <wenc...@databricks.com> Author: Cheng Lian <l...@databricks.com> Closes #13269 from cloud-fan/clean-encoder. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/190ff274 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/190ff274 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/190ff274 Branch: refs/heads/master Commit: 190ff274fd71662023a804cf98400c71f9f7da4f Parents: b9fcfb3 Author: Wenchen Fan <wenc...@databricks.com> Authored: Fri Jun 3 00:43:02 2016 -0700 Committer: Cheng Lian <l...@databricks.com> Committed: Fri Jun 3 00:43:02 2016 -0700 -- .../linalg/UDTSerializationBenchmark.scala | 2 +- .../scala/org/apache/spark/sql/Encoders.scala | 3 +- .../spark/sql/catalyst/JavaTypeInference.scala | 6 +- .../spark/sql/catalyst/ScalaReflection.scala| 307 +-- .../spark/sql/catalyst/analysis/Analyzer.scala | 53 +++- .../sql/catalyst/analysis/unresolved.scala | 7 + .../catalyst/encoders/ExpressionEncoder.scala | 134 ++-- .../sql/catalyst/encoders/RowEncoder.scala | 8 +- .../sql/catalyst/plans/logical/object.scala | 19 +- .../encoders/EncoderResolutionSuite.scala | 42 +-- .../encoders/ExpressionEncoderSuite.scala | 11 +- .../sql/catalyst/encoders/RowEncoderSuite.scala | 14 +- .../scala/org/apache/spark/sql/Dataset.scala| 48 +-- .../spark/sql/KeyValueGroupedDataset.scala | 26 +- .../spark/sql/RelationalGroupedDataset.scala| 2 +- .../aggregate/TypedAggregateExpression.scala| 6 +- .../scala/org/apache/spark/sql/functions.scala | 2 +- .../org/apache/spark/sql/DatasetSuite.scala | 8 +- .../scala/org/apache/spark/sql/QueryTest.scala | 8 +- .../sql/execution/GroupedIteratorSuite.scala| 6 +- .../apache/spark/sql/streaming/StreamTest.scala | 4 +- 21 files changed, 324 insertions(+), 392 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/190ff274/mllib/src/test/scala/org/apache/spark/mllib/linalg/UDTSerializationBenchmark.scala -- diff --git a/mllib/src/test/scala/org/apache/spark/mllib/linalg/UDTSerializationBenchmark.scala b/mllib/src/test/scala/org/apache/spark/mllib/linalg/UDTSerializationBenchmark.scala index be7110a..8b439e6 100644 --- a/mllib/src/test/scala/org/apache/spark/mllib/linalg/UDTSerializationBenchmark.scala +++ b/mllib/src/test/scala/org/apache/spark/mllib/linalg/UDTSerializationBenchmark.scala @@ -29,7 +29,7 @@ object UDTSerializationBenchmark { val iters = 1e2.toInt val numRows = 1e3.toInt -val encoder = ExpressionEncoder[Vector].defaultBinding +val encoder = ExpressionEncoder[Vector].resolveAndBind() val vectors = (1 to numRows).map { i => Vectors.dense(Array.fill(1e5.toInt)(1.0 * i)) http://git-wip-us.apache.org/repos/asf/spark/blob/190ff274/sql/catalyst/src/main/scala/org/apache/spark/sql/Encoders.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/Encoders.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/Encoders.scala index fa96f82..673c587 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/Encoders.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/Encoders.scala @@ -23,6 +23,7 @@ import scala.reflect.{classTag, ClassTag} import scala.reflect.runtime.universe.TypeTag import org.apache.spark.annotation.Experimental +import org.apache.spark.sql.catalyst.analysis.GetColumnByOrdinal import org.apache.spark.sql.catalyst.encoders.{encoderFor, ExpressionEncoder} import org.apache.spark.sql.catalyst.expressions.
spark git commit: [SPARK-14959][SQL] handle partitioned table directories in distributed filesystem
Repository: spark Updated Branches: refs/heads/master 6dde27404 -> 76aa45d35 [SPARK-14959][SQL] handle partitioned table directories in distributed filesystem ## What changes were proposed in this pull request? # The root cause: When `DataSource.resolveRelation` is trying to build `ListingFileCatalog` object, `ListLeafFiles` is invoked where a list of `FileStatus` objects are retrieved from the provided path. These FileStatus objects include directories for the partitions (id=0 and id=2 in the jira). However, these directory `FileStatus` objects also try to invoke `getFileBlockLocations` where directory is not allowed for `DistributedFileSystem`, hence the exception happens. This PR is to remove the block of code that invokes `getFileBlockLocations` for every FileStatus object of the provided path. Instead, we call `HadoopFsRelation.listLeafFiles` directly because this utility method filters out the directories before calling `getFileBlockLocations` for generating `LocatedFileStatus` objects. ## How was this patch tested? Regtest is run. Manual test: ``` scala> spark.read.format("parquet").load("hdfs://bdavm009.svl.ibm.com:8020/user/spark/SPARK-14959_part").show +-+---+ | text| id| +-+---+ |hello| 0| |world| 0| |hello| 1| |there| 1| +-+---+ spark.read.format("orc").load("hdfs://bdavm009.svl.ibm.com:8020/user/spark/SPARK-14959_orc").show +-+---+ | text| id| +-+---+ |hello| 0| |world| 0| |hello| 1| |there| 1| +-+---+ ``` I also tried it with 2 level of partitioning. I have not found a way to add test case in the unit test bucket that can test a real hdfs file location. Any suggestions will be appreciated. Author: Xin Wu <xi...@us.ibm.com> Closes #13463 from xwu0226/SPARK-14959. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/76aa45d3 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/76aa45d3 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/76aa45d3 Branch: refs/heads/master Commit: 76aa45d359d034e9ccaac64b36738d47e1e42f2c Parents: 6dde274 Author: Xin Wu <xi...@us.ibm.com> Authored: Thu Jun 2 22:49:17 2016 -0700 Committer: Cheng Lian <l...@databricks.com> Committed: Thu Jun 2 22:49:17 2016 -0700 -- .../datasources/ListingFileCatalog.scala| 36 ++-- .../datasources/fileSourceInterfaces.scala | 10 ++ .../datasources/FileSourceStrategySuite.scala | 1 + 3 files changed, 14 insertions(+), 33 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/76aa45d3/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ListingFileCatalog.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ListingFileCatalog.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ListingFileCatalog.scala index 644e5d6..dd3c96a 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ListingFileCatalog.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ListingFileCatalog.scala @@ -83,40 +83,10 @@ class ListingFileCatalog( val statuses: Seq[FileStatus] = paths.flatMap { path => val fs = path.getFileSystem(hadoopConf) logInfo(s"Listing $path on driver") - -val statuses = { - val stats = Try(fs.listStatus(path)).getOrElse(Array.empty[FileStatus]) - if (pathFilter != null) stats.filter(f => pathFilter.accept(f.getPath)) else stats -} - -statuses.map { - case f: LocatedFileStatus => f - - // NOTE: - // - // - Although S3/S3A/S3N file system can be quite slow for remote file metadata - // operations, calling `getFileBlockLocations` does no harm here since these file system - // implementations don't actually issue RPC for this method. - // - // - Here we are calling `getFileBlockLocations` in a sequential manner, but it should a - // a big deal since we always use to `listLeafFilesInParallel` when the number of paths - // exceeds threshold. - case f => -HadoopFsRelation.createLocatedFileStatus(f, fs.getFileBlockLocations(f, 0, f.getLen)) -} - }.filterNot { status => -val name = status.getPath.getName -HadoopFsRelation.shouldFilterOut(name) - } - - val (dirs, files) = statuses.partition(_.isDirectory) - - // It uses [[LinkedHashSet]] since the order of files can affect the results. (SPARK-11500) - if (dirs.isEmpty) { -mutable.LinkedHashSet(files: _*) - } else { -
spark git commit: [SPARK-14959][SQL] handle partitioned table directories in distributed filesystem
Repository: spark Updated Branches: refs/heads/branch-2.0 d5bd64a20 -> 1e13d09c5 [SPARK-14959][SQL] handle partitioned table directories in distributed filesystem ## What changes were proposed in this pull request? # The root cause: When `DataSource.resolveRelation` is trying to build `ListingFileCatalog` object, `ListLeafFiles` is invoked where a list of `FileStatus` objects are retrieved from the provided path. These FileStatus objects include directories for the partitions (id=0 and id=2 in the jira). However, these directory `FileStatus` objects also try to invoke `getFileBlockLocations` where directory is not allowed for `DistributedFileSystem`, hence the exception happens. This PR is to remove the block of code that invokes `getFileBlockLocations` for every FileStatus object of the provided path. Instead, we call `HadoopFsRelation.listLeafFiles` directly because this utility method filters out the directories before calling `getFileBlockLocations` for generating `LocatedFileStatus` objects. ## How was this patch tested? Regtest is run. Manual test: ``` scala> spark.read.format("parquet").load("hdfs://bdavm009.svl.ibm.com:8020/user/spark/SPARK-14959_part").show +-+---+ | text| id| +-+---+ |hello| 0| |world| 0| |hello| 1| |there| 1| +-+---+ spark.read.format("orc").load("hdfs://bdavm009.svl.ibm.com:8020/user/spark/SPARK-14959_orc").show +-+---+ | text| id| +-+---+ |hello| 0| |world| 0| |hello| 1| |there| 1| +-+---+ ``` I also tried it with 2 level of partitioning. I have not found a way to add test case in the unit test bucket that can test a real hdfs file location. Any suggestions will be appreciated. Author: Xin Wu <xi...@us.ibm.com> Closes #13463 from xwu0226/SPARK-14959. (cherry picked from commit 76aa45d359d034e9ccaac64b36738d47e1e42f2c) Signed-off-by: Cheng Lian <l...@databricks.com> Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1e13d09c Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1e13d09c Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1e13d09c Branch: refs/heads/branch-2.0 Commit: 1e13d09c526ec37c344950b07d938751bbd6fd0a Parents: d5bd64a Author: Xin Wu <xi...@us.ibm.com> Authored: Thu Jun 2 22:49:17 2016 -0700 Committer: Cheng Lian <l...@databricks.com> Committed: Thu Jun 2 22:49:24 2016 -0700 -- .../datasources/ListingFileCatalog.scala| 36 ++-- .../datasources/fileSourceInterfaces.scala | 10 ++ .../datasources/FileSourceStrategySuite.scala | 1 + 3 files changed, 14 insertions(+), 33 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/1e13d09c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ListingFileCatalog.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ListingFileCatalog.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ListingFileCatalog.scala index 644e5d6..dd3c96a 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ListingFileCatalog.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ListingFileCatalog.scala @@ -83,40 +83,10 @@ class ListingFileCatalog( val statuses: Seq[FileStatus] = paths.flatMap { path => val fs = path.getFileSystem(hadoopConf) logInfo(s"Listing $path on driver") - -val statuses = { - val stats = Try(fs.listStatus(path)).getOrElse(Array.empty[FileStatus]) - if (pathFilter != null) stats.filter(f => pathFilter.accept(f.getPath)) else stats -} - -statuses.map { - case f: LocatedFileStatus => f - - // NOTE: - // - // - Although S3/S3A/S3N file system can be quite slow for remote file metadata - // operations, calling `getFileBlockLocations` does no harm here since these file system - // implementations don't actually issue RPC for this method. - // - // - Here we are calling `getFileBlockLocations` in a sequential manner, but it should a - // a big deal since we always use to `listLeafFilesInParallel` when the number of paths - // exceeds threshold. - case f => -HadoopFsRelation.createLocatedFileStatus(f, fs.getFileBlockLocations(f, 0, f.getLen)) -} - }.filterNot { status => -val name = status.getPath.getName -HadoopFsRelation.shouldFilterOut(name) - } - - val (dirs, files) = statuses.partition(_.isDirectory) - - // It uses [[LinkedHashSet]] since the order of files can aff
spark git commit: [SPARK-15733][SQL] Makes the explain output less verbose by hiding some verbose output like None, null, empty List, and etc.
Repository: spark Updated Branches: refs/heads/branch-2.0 8be6e8cc5 -> d5bd64a20 [SPARK-15733][SQL] Makes the explain output less verbose by hiding some verbose output like None, null, empty List, and etc. ## What changes were proposed in this pull request? This PR makes the explain output less verbose by hiding some verbose output like `None`, `null`, empty List `[]`, empty set `{}`, and etc. **Before change**: ``` == Physical Plan == ExecutedCommand : +- ShowTablesCommand None, None ``` **After change**: ``` == Physical Plan == ExecutedCommand : +- ShowTablesCommand ``` ## How was this patch tested? Manual test. Author: Sean Zhong <seanzh...@databricks.com> Closes #13470 from clockfly/verbose_breakdown_4. (cherry picked from commit 6dde27404cb3d921d75dd6afca4b383f9df5976a) Signed-off-by: Cheng Lian <l...@databricks.com> Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d5bd64a2 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d5bd64a2 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d5bd64a2 Branch: refs/heads/branch-2.0 Commit: d5bd64a20db2377fa5efc053607de730411489cb Parents: 8be6e8c Author: Sean Zhong <seanzh...@databricks.com> Authored: Thu Jun 2 22:45:37 2016 -0700 Committer: Cheng Lian <l...@databricks.com> Committed: Thu Jun 2 22:45:46 2016 -0700 -- .../spark/sql/catalyst/trees/TreeNode.scala | 18 +- 1 file changed, 13 insertions(+), 5 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/d5bd64a2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala index 3ebd815..50481cd 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala @@ -427,13 +427,21 @@ abstract class TreeNode[BaseType <: TreeNode[BaseType]] extends Product { private lazy val allChildren: Set[TreeNode[_]] = (children ++ innerChildren).toSet[TreeNode[_]] /** Returns a string representing the arguments to this node, minus any children */ - def argString: String = productIterator.flatMap { + def argString: String = stringArgs.flatMap { case tn: TreeNode[_] if allChildren.contains(tn) => Nil case Some(tn: TreeNode[_]) if allChildren.contains(tn) => Nil -case tn: TreeNode[_] => s"${tn.simpleString}" :: Nil -case seq: Seq[BaseType] if seq.toSet.subsetOf(children.toSet) => Nil -case seq: Seq[_] => seq.mkString("[", ",", "]") :: Nil -case set: Set[_] => set.mkString("{", ",", "}") :: Nil +case Some(tn: TreeNode[_]) => tn.simpleString :: Nil +case tn: TreeNode[_] => tn.simpleString :: Nil +case seq: Seq[Any] if seq.toSet.subsetOf(allChildren.asInstanceOf[Set[Any]]) => Nil +case iter: Iterable[_] if iter.isEmpty => Nil +case seq: Seq[_] => seq.mkString("[", ", ", "]") :: Nil +case set: Set[_] => set.mkString("{", ", ", "}") :: Nil +case array: Array[_] if array.isEmpty => Nil +case array: Array[_] => array.mkString("[", ", ", "]") :: Nil +case null => Nil +case None => Nil +case Some(null) => Nil +case Some(any) => any :: Nil case other => other :: Nil }.mkString(", ") - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-15732][SQL] better error message when use java reserved keyword as field name
Repository: spark Updated Branches: refs/heads/branch-2.0 d02f2926b -> 180dd53b7 [SPARK-15732][SQL] better error message when use java reserved keyword as field name ## What changes were proposed in this pull request? When users create a case class and use java reserved keyword as field name, spark sql will generate illegal java code and throw exception at runtime. This PR checks the field names when building the encoder, and if illegal field names are used, throw exception immediately with a good error message. ## How was this patch tested? new test in DatasetSuite Author: Wenchen Fan <wenc...@databricks.com> Closes #13485 from cloud-fan/java. (cherry picked from commit 6323e4bd763eafe23c619a89552c00e1244e4690) Signed-off-by: Cheng Lian <l...@databricks.com> Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/180dd53b Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/180dd53b Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/180dd53b Branch: refs/heads/branch-2.0 Commit: 180dd53b77a042235070e27b732c8f18165894e3 Parents: d02f292 Author: Wenchen Fan <wenc...@databricks.com> Authored: Thu Jun 2 18:13:04 2016 -0700 Committer: Cheng Lian <l...@databricks.com> Committed: Thu Jun 2 18:13:12 2016 -0700 -- .../org/apache/spark/sql/catalyst/ScalaReflection.scala | 11 +++ .../test/scala/org/apache/spark/sql/DatasetSuite.scala | 10 ++ 2 files changed, 21 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/180dd53b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala index 052cc48..4750861 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala @@ -582,6 +582,11 @@ object ScalaReflection extends ScalaReflection { case t if definedByConstructorParams(t) => val params = getConstructorParameters(t) val nonNullOutput = CreateNamedStruct(params.flatMap { case (fieldName, fieldType) => +if (javaKeywords.contains(fieldName)) { + throw new UnsupportedOperationException(s"`$fieldName` is a reserved keyword and " + +"cannot be used as field name\n" + walkedTypePath.mkString("\n")) +} + val fieldValue = Invoke(inputObject, fieldName, dataTypeFor(fieldType)) val clsName = getClassNameFromType(fieldType) val newPath = s"""- field (class: "$clsName", name: "$fieldName")""" +: walkedTypePath @@ -720,6 +725,12 @@ object ScalaReflection extends ScalaReflection { tpe <:< localTypeOf[Product] || tpe <:< localTypeOf[DefinedByConstructorParams] } + private val javaKeywords = Set("abstract", "assert", "boolean", "break", "byte", "case", "catch", +"char", "class", "const", "continue", "default", "do", "double", "else", "extends", "false", +"final", "finally", "float", "for", "goto", "if", "implements", "import", "instanceof", "int", +"interface", "long", "native", "new", "null", "package", "private", "protected", "public", +"return", "short", "static", "strictfp", "super", "switch", "synchronized", "this", "throw", +"throws", "transient", "true", "try", "void", "volatile", "while") } /** http://git-wip-us.apache.org/repos/asf/spark/blob/180dd53b/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala index a3881ff..df8f4b0 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala @@ -786,6 +786,14 @@ class DatasetSuite extends QueryTest with SharedSQLContext { va
spark git commit: [SPARK-15732][SQL] better error message when use java reserved keyword as field name
Repository: spark Updated Branches: refs/heads/master d1c1fbc34 -> 6323e4bd7 [SPARK-15732][SQL] better error message when use java reserved keyword as field name ## What changes were proposed in this pull request? When users create a case class and use java reserved keyword as field name, spark sql will generate illegal java code and throw exception at runtime. This PR checks the field names when building the encoder, and if illegal field names are used, throw exception immediately with a good error message. ## How was this patch tested? new test in DatasetSuite Author: Wenchen Fan <wenc...@databricks.com> Closes #13485 from cloud-fan/java. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6323e4bd Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6323e4bd Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6323e4bd Branch: refs/heads/master Commit: 6323e4bd763eafe23c619a89552c00e1244e4690 Parents: d1c1fbc Author: Wenchen Fan <wenc...@databricks.com> Authored: Thu Jun 2 18:13:04 2016 -0700 Committer: Cheng Lian <l...@databricks.com> Committed: Thu Jun 2 18:13:04 2016 -0700 -- .../org/apache/spark/sql/catalyst/ScalaReflection.scala | 11 +++ .../test/scala/org/apache/spark/sql/DatasetSuite.scala | 10 ++ 2 files changed, 21 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/6323e4bd/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala index 052cc48..4750861 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala @@ -582,6 +582,11 @@ object ScalaReflection extends ScalaReflection { case t if definedByConstructorParams(t) => val params = getConstructorParameters(t) val nonNullOutput = CreateNamedStruct(params.flatMap { case (fieldName, fieldType) => +if (javaKeywords.contains(fieldName)) { + throw new UnsupportedOperationException(s"`$fieldName` is a reserved keyword and " + +"cannot be used as field name\n" + walkedTypePath.mkString("\n")) +} + val fieldValue = Invoke(inputObject, fieldName, dataTypeFor(fieldType)) val clsName = getClassNameFromType(fieldType) val newPath = s"""- field (class: "$clsName", name: "$fieldName")""" +: walkedTypePath @@ -720,6 +725,12 @@ object ScalaReflection extends ScalaReflection { tpe <:< localTypeOf[Product] || tpe <:< localTypeOf[DefinedByConstructorParams] } + private val javaKeywords = Set("abstract", "assert", "boolean", "break", "byte", "case", "catch", +"char", "class", "const", "continue", "default", "do", "double", "else", "extends", "false", +"final", "finally", "float", "for", "goto", "if", "implements", "import", "instanceof", "int", +"interface", "long", "native", "new", "null", "package", "private", "protected", "public", +"return", "short", "static", "strictfp", "super", "switch", "synchronized", "this", "throw", +"throws", "transient", "true", "try", "void", "volatile", "while") } /** http://git-wip-us.apache.org/repos/asf/spark/blob/6323e4bd/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala index a3881ff..df8f4b0 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala @@ -786,6 +786,14 @@ class DatasetSuite extends QueryTest with SharedSQLContext { val result = joined.collect().toSet assert(result == Set(ClassData("a", 1) -> null, ClassData("b", 2) ->
spark git commit: [SPARK-15734][SQL] Avoids printing internal row in explain output
Repository: spark Updated Branches: refs/heads/branch-2.0 810472a92 -> 1153a4098 [SPARK-15734][SQL] Avoids printing internal row in explain output ## What changes were proposed in this pull request? This PR avoids printing internal rows in explain output for some operators. **Before change:** ``` scala> (1 to 10).toSeq.map(_ => (1,2,3)).toDF().createTempView("df3") scala> spark.sql("select * from df3 where 1=2").explain(true) ... == Analyzed Logical Plan == _1: int, _2: int, _3: int Project [_1#37,_2#38,_3#39] +- Filter (1 = 2) +- SubqueryAlias df3 +- LocalRelation [_1#37,_2#38,_3#39], [[0,1,2,3],[0,1,2,3],[0,1,2,3],[0,1,2,3],[0,1,2,3],[0,1,2,3],[0,1,2,3],[0,1,2,3],[0,1,2,3],[0,1,2,3]] ... == Physical Plan == LocalTableScan [_1#37,_2#38,_3#39] ``` **After change:** ``` scala> spark.sql("select * from df3 where 1=2").explain(true) ... == Analyzed Logical Plan == _1: int, _2: int, _3: int Project [_1#58,_2#59,_3#60] +- Filter (1 = 2) +- SubqueryAlias df3 +- LocalRelation [_1#58,_2#59,_3#60] ... == Physical Plan == LocalTableScan , [_1#58,_2#59,_3#60] ``` ## How was this patch tested? Manual test. Author: Sean Zhong <seanzh...@databricks.com> Closes #13471 from clockfly/verbose_breakdown_5. (cherry picked from commit 985d532812cf176d0e12b799c723f917282b6813) Signed-off-by: Cheng Lian <l...@databricks.com> Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1153a409 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1153a409 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1153a409 Branch: refs/heads/branch-2.0 Commit: 1153a4098956dd4ba408ab6a1c0d4263de2ef0a7 Parents: 810472a Author: Sean Zhong <seanzh...@databricks.com> Authored: Thu Jun 2 16:21:33 2016 -0700 Committer: Cheng Lian <l...@databricks.com> Committed: Thu Jun 2 16:21:40 2016 -0700 -- .../spark/sql/catalyst/plans/logical/LocalRelation.scala | 8 +++- .../scala/org/apache/spark/sql/execution/ExistingRDD.scala | 2 ++ .../org/apache/spark/sql/execution/LocalTableScanExec.scala | 8 3 files changed, 17 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/1153a409/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LocalRelation.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LocalRelation.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LocalRelation.scala index 5813b74..87b8647 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LocalRelation.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LocalRelation.scala @@ -57,7 +57,13 @@ case class LocalRelation(output: Seq[Attribute], data: Seq[InternalRow] = Nil) LocalRelation(output.map(_.newInstance()), data).asInstanceOf[this.type] } - override protected def stringArgs = Iterator(output) + override protected def stringArgs: Iterator[Any] = { +if (data.isEmpty) { + Iterator("", output) +} else { + Iterator(output) +} + } override def sameResult(plan: LogicalPlan): Boolean = plan match { case LocalRelation(otherOutput, otherData) => http://git-wip-us.apache.org/repos/asf/spark/blob/1153a409/sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala index fef3255..b8b3926 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala @@ -91,6 +91,8 @@ private[sql] case class LogicalRDD( case _ => false } + override protected def stringArgs: Iterator[Any] = Iterator(output) + override def producedAttributes: AttributeSet = outputSet @transient override lazy val statistics: Statistics = Statistics( http://git-wip-us.apache.org/repos/asf/spark/blob/1153a409/sql/core/src/main/scala/org/apache/spark/sql/execution/LocalTableScanExec.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/LocalTableScanExec.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/LocalTableScanExec.scala index c5e78b0..df2f238 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/LocalTableScanExec.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/LocalTableScanExec.scala @@
spark git commit: [SPARK-15734][SQL] Avoids printing internal row in explain output
Repository: spark Updated Branches: refs/heads/master 431542765 -> 985d53281 [SPARK-15734][SQL] Avoids printing internal row in explain output ## What changes were proposed in this pull request? This PR avoids printing internal rows in explain output for some operators. **Before change:** ``` scala> (1 to 10).toSeq.map(_ => (1,2,3)).toDF().createTempView("df3") scala> spark.sql("select * from df3 where 1=2").explain(true) ... == Analyzed Logical Plan == _1: int, _2: int, _3: int Project [_1#37,_2#38,_3#39] +- Filter (1 = 2) +- SubqueryAlias df3 +- LocalRelation [_1#37,_2#38,_3#39], [[0,1,2,3],[0,1,2,3],[0,1,2,3],[0,1,2,3],[0,1,2,3],[0,1,2,3],[0,1,2,3],[0,1,2,3],[0,1,2,3],[0,1,2,3]] ... == Physical Plan == LocalTableScan [_1#37,_2#38,_3#39] ``` **After change:** ``` scala> spark.sql("select * from df3 where 1=2").explain(true) ... == Analyzed Logical Plan == _1: int, _2: int, _3: int Project [_1#58,_2#59,_3#60] +- Filter (1 = 2) +- SubqueryAlias df3 +- LocalRelation [_1#58,_2#59,_3#60] ... == Physical Plan == LocalTableScan , [_1#58,_2#59,_3#60] ``` ## How was this patch tested? Manual test. Author: Sean Zhong <seanzh...@databricks.com> Closes #13471 from clockfly/verbose_breakdown_5. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/985d5328 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/985d5328 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/985d5328 Branch: refs/heads/master Commit: 985d532812cf176d0e12b799c723f917282b6813 Parents: 4315427 Author: Sean Zhong <seanzh...@databricks.com> Authored: Thu Jun 2 16:21:33 2016 -0700 Committer: Cheng Lian <l...@databricks.com> Committed: Thu Jun 2 16:21:33 2016 -0700 -- .../spark/sql/catalyst/plans/logical/LocalRelation.scala | 8 +++- .../scala/org/apache/spark/sql/execution/ExistingRDD.scala | 2 ++ .../org/apache/spark/sql/execution/LocalTableScanExec.scala | 8 3 files changed, 17 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/985d5328/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LocalRelation.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LocalRelation.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LocalRelation.scala index 5813b74..87b8647 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LocalRelation.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LocalRelation.scala @@ -57,7 +57,13 @@ case class LocalRelation(output: Seq[Attribute], data: Seq[InternalRow] = Nil) LocalRelation(output.map(_.newInstance()), data).asInstanceOf[this.type] } - override protected def stringArgs = Iterator(output) + override protected def stringArgs: Iterator[Any] = { +if (data.isEmpty) { + Iterator("", output) +} else { + Iterator(output) +} + } override def sameResult(plan: LogicalPlan): Boolean = plan match { case LocalRelation(otherOutput, otherData) => http://git-wip-us.apache.org/repos/asf/spark/blob/985d5328/sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala index fef3255..b8b3926 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala @@ -91,6 +91,8 @@ private[sql] case class LogicalRDD( case _ => false } + override protected def stringArgs: Iterator[Any] = Iterator(output) + override def producedAttributes: AttributeSet = outputSet @transient override lazy val statistics: Statistics = Statistics( http://git-wip-us.apache.org/repos/asf/spark/blob/985d5328/sql/core/src/main/scala/org/apache/spark/sql/execution/LocalTableScanExec.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/LocalTableScanExec.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/LocalTableScanExec.scala index c5e78b0..df2f238 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/LocalTableScanExec.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/LocalTableScanExec.scala @@ -48,6 +48,14 @@ private[sql] case class LocalTableScanExec( } } + override protected def stringArgs: Iterator[Any] = { +
spark git commit: [SPARK-15719][SQL] Disables writing Parquet summary files by default
Repository: spark Updated Branches: refs/heads/branch-2.0 847ccf793 -> 810472a92 [SPARK-15719][SQL] Disables writing Parquet summary files by default ## What changes were proposed in this pull request? This PR disables writing Parquet summary files by default (i.e., when Hadoop configuration "parquet.enable.summary-metadata" is not set). Please refer to [SPARK-15719][1] for more details. ## How was this patch tested? New test case added in `ParquetQuerySuite` to check no summary files are written by default. [1]: https://issues.apache.org/jira/browse/SPARK-15719 Author: Cheng Lian <l...@databricks.com> Closes #13455 from liancheng/spark-15719-disable-parquet-summary-files. (cherry picked from commit 431542765785304edb76a19885fbc5f9b8ae7d64) Signed-off-by: Cheng Lian <l...@databricks.com> Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/810472a9 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/810472a9 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/810472a9 Branch: refs/heads/branch-2.0 Commit: 810472a928811f078ef42825d1cf16cf711fcefb Parents: 847ccf7 Author: Cheng Lian <l...@databricks.com> Authored: Thu Jun 2 16:16:27 2016 -0700 Committer: Cheng Lian <l...@databricks.com> Committed: Thu Jun 2 16:16:37 2016 -0700 -- .../datasources/parquet/ParquetFileFormat.scala | 7 - .../datasources/parquet/ParquetIOSuite.scala| 20 +++-- .../datasources/parquet/ParquetQuerySuite.scala | 31 .../parquet/ParquetSchemaSuite.scala| 19 ++-- .../sources/ParquetHadoopFsRelationSuite.scala | 29 ++ 5 files changed, 62 insertions(+), 44 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/810472a9/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala index ff7962d..ada9cd4 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala @@ -124,6 +124,11 @@ private[sql] class ParquetFileFormat // Sets compression scheme conf.set(ParquetOutputFormat.COMPRESSION, parquetOptions.compressionCodec) +// SPARK-15719: Disables writing Parquet summary files by default. +if (conf.get(ParquetOutputFormat.ENABLE_JOB_SUMMARY) == null) { + conf.setBoolean(ParquetOutputFormat.ENABLE_JOB_SUMMARY, false) +} + new OutputWriterFactory { override def newInstance( path: String, @@ -786,7 +791,7 @@ private[sql] object ParquetFileFormat extends Logging { // // Parquet requires `FileStatus`es to read footers. Here we try to send cached `FileStatus`es // to executor side to avoid fetching them again. However, `FileStatus` is not `Serializable` -// but only `Writable`. What makes it worth, for some reason, `FileStatus` doesn't play well +// but only `Writable`. What makes it worse, for some reason, `FileStatus` doesn't play well // with `SerializableWritable[T]` and always causes a weird `IllegalStateException`. These // facts virtually prevents us to serialize `FileStatus`es. // http://git-wip-us.apache.org/repos/asf/spark/blob/810472a9/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala index d0107aa..92f2db3 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala @@ -515,17 +515,19 @@ class ParquetIOSuite extends QueryTest with ParquetTest with SharedSQLContext { val hadoopConf = spark.sessionState.newHadoopConfWithOptions(extraOptions) -withTempPath { dir => - val path = s"${dir.getCanonicalPath}/part-r-0.parquet" - spark.range(1 << 16).selectExpr("(id % 4) AS i") - .coalesce(1).write.options(extraOptions).mode("overwrite").parquet(path) +withSQLConf(ParquetOutputFormat.ENABLE_JOB_SUMMARY -> "true") { + withTempPath { dir => +val path
spark git commit: [SPARK-15441][SQL] support null object in Dataset outer-join
Repository: spark Updated Branches: refs/heads/branch-2.0 8cdc0d4da -> 91812226f [SPARK-15441][SQL] support null object in Dataset outer-join ## What changes were proposed in this pull request? Currently we can't encode top level null object into internal row, as Spark SQL doesn't allow row to be null, only its columns can be null. This is not a problem before, as we assume the input object is never null. However, for outer join, we do need the semantics of null object. This PR fixes this problem by making both join sides produce a single column, i.e. nest the logical plan output(by `CreateStruct`), so that we have an extra level to represent top level null obejct. ## How was this patch tested? new test in `DatasetSuite` Author: Wenchen Fan <wenc...@databricks.com> Closes #13425 from cloud-fan/outer-join2. (cherry picked from commit 8640cdb836b4964e4af891d9959af64a2e1f304e) Signed-off-by: Cheng Lian <l...@databricks.com> Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/91812226 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/91812226 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/91812226 Branch: refs/heads/branch-2.0 Commit: 91812226fc18d476be3dcef1351ff5f5ef6ed86f Parents: 8cdc0d4 Author: Wenchen Fan <wenc...@databricks.com> Authored: Wed Jun 1 16:16:54 2016 -0700 Committer: Cheng Lian <l...@databricks.com> Committed: Wed Jun 1 16:18:23 2016 -0700 -- .../catalyst/encoders/ExpressionEncoder.scala | 3 +- .../catalyst/expressions/objects/objects.scala | 1 - .../scala/org/apache/spark/sql/Dataset.scala| 67 ++-- .../org/apache/spark/sql/DatasetSuite.scala | 23 +++ 4 files changed, 59 insertions(+), 35 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/91812226/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala index f21a39a..2296946 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala @@ -125,12 +125,13 @@ object ExpressionEncoder { } } else { val input = BoundReference(index, enc.schema, nullable = true) -enc.deserializer.transformUp { +val deserialized = enc.deserializer.transformUp { case UnresolvedAttribute(nameParts) => assert(nameParts.length == 1) UnresolvedExtractValue(input, Literal(nameParts.head)) case BoundReference(ordinal, dt, _) => GetStructField(input, ordinal) } +If(IsNull(input), Literal.create(null, deserialized.dataType), deserialized) } } http://git-wip-us.apache.org/repos/asf/spark/blob/91812226/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala index 2f2323f..c2e3ab8 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala @@ -19,7 +19,6 @@ package org.apache.spark.sql.catalyst.expressions.objects import java.lang.reflect.Modifier -import scala.annotation.tailrec import scala.language.existentials import scala.reflect.ClassTag http://git-wip-us.apache.org/repos/asf/spark/blob/91812226/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala index 3a6ec45..369b772 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala @@ -747,31 +747,62 @@ class Dataset[T] private[sql]( */ @Experimental def joinWith[U](other: Dataset[U], condition: Column, joinType: String): Dataset[(T, U)] = { -val left = this.logicalPlan -val right = other.logicalPlan - -val joined = sparkSession.sessionState.executePlan(Join(left, right, joinType = - JoinType(joinType), Some(condition.expr))) -val leftOutput = joined.analyzed.output.
spark git commit: [SPARK-15441][SQL] support null object in Dataset outer-join
Repository: spark Updated Branches: refs/heads/master 7bb64aae2 -> 8640cdb83 [SPARK-15441][SQL] support null object in Dataset outer-join ## What changes were proposed in this pull request? Currently we can't encode top level null object into internal row, as Spark SQL doesn't allow row to be null, only its columns can be null. This is not a problem before, as we assume the input object is never null. However, for outer join, we do need the semantics of null object. This PR fixes this problem by making both join sides produce a single column, i.e. nest the logical plan output(by `CreateStruct`), so that we have an extra level to represent top level null obejct. ## How was this patch tested? new test in `DatasetSuite` Author: Wenchen Fan <wenc...@databricks.com> Closes #13425 from cloud-fan/outer-join2. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/8640cdb8 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/8640cdb8 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/8640cdb8 Branch: refs/heads/master Commit: 8640cdb836b4964e4af891d9959af64a2e1f304e Parents: 7bb64aa Author: Wenchen Fan <wenc...@databricks.com> Authored: Wed Jun 1 16:16:54 2016 -0700 Committer: Cheng Lian <l...@databricks.com> Committed: Wed Jun 1 16:16:54 2016 -0700 -- .../catalyst/encoders/ExpressionEncoder.scala | 3 +- .../catalyst/expressions/objects/objects.scala | 1 - .../scala/org/apache/spark/sql/Dataset.scala| 67 ++-- .../org/apache/spark/sql/DatasetSuite.scala | 23 +++ 4 files changed, 59 insertions(+), 35 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/8640cdb8/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala index f21a39a..2296946 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala @@ -125,12 +125,13 @@ object ExpressionEncoder { } } else { val input = BoundReference(index, enc.schema, nullable = true) -enc.deserializer.transformUp { +val deserialized = enc.deserializer.transformUp { case UnresolvedAttribute(nameParts) => assert(nameParts.length == 1) UnresolvedExtractValue(input, Literal(nameParts.head)) case BoundReference(ordinal, dt, _) => GetStructField(input, ordinal) } +If(IsNull(input), Literal.create(null, deserialized.dataType), deserialized) } } http://git-wip-us.apache.org/repos/asf/spark/blob/8640cdb8/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala index 2f2323f..c2e3ab8 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala @@ -19,7 +19,6 @@ package org.apache.spark.sql.catalyst.expressions.objects import java.lang.reflect.Modifier -import scala.annotation.tailrec import scala.language.existentials import scala.reflect.ClassTag http://git-wip-us.apache.org/repos/asf/spark/blob/8640cdb8/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala index 3a6ec45..369b772 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala @@ -747,31 +747,62 @@ class Dataset[T] private[sql]( */ @Experimental def joinWith[U](other: Dataset[U], condition: Column, joinType: String): Dataset[(T, U)] = { -val left = this.logicalPlan -val right = other.logicalPlan - -val joined = sparkSession.sessionState.executePlan(Join(left, right, joinType = - JoinType(joinType), Some(condition.expr))) -val leftOutput = joined.analyzed.output.take(left.output.length) -val rightOutput = joined.analyzed.output.takeRight(right.output.length) +// Creates a Join node and re
spark git commit: [SPARK-9876] [BRANCH-2.0] Revert "[SPARK-9876][SQL] Update Parquet to 1.8.1."
Repository: spark Updated Branches: refs/heads/branch-2.0 e033fd50f -> 8cdc0d4da [SPARK-9876] [BRANCH-2.0] Revert "[SPARK-9876][SQL] Update Parquet to 1.8.1." ## What changes were proposed in this pull request? Since we are pretty late in the 2.0 release cycle, it is not clear if this upgrade can be tested thoroughly and if we can resolve the regression issue that we observed before. This PR reverts #13280 from branch 2.0. ## How was this patch tested? Existing tests This reverts commit 776d183c82b424ef7c3cae30537d8afe9b9eee83. Author: Yin Huai <yh...@databricks.com> Closes #13450 from yhuai/revertParquet1.8.1-branch-2.0. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/8cdc0d4d Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/8cdc0d4d Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/8cdc0d4d Branch: refs/heads/branch-2.0 Commit: 8cdc0d4da6324f48e1a25496dc4bc7635d9472ef Parents: e033fd5 Author: Yin Huai <yh...@databricks.com> Authored: Wed Jun 1 16:05:13 2016 -0700 Committer: Cheng Lian <l...@databricks.com> Committed: Wed Jun 1 16:05:13 2016 -0700 -- dev/deps/spark-deps-hadoop-2.2 | 11 +-- dev/deps/spark-deps-hadoop-2.3 | 11 +-- dev/deps/spark-deps-hadoop-2.4 | 11 +-- dev/deps/spark-deps-hadoop-2.6 | 11 +-- dev/deps/spark-deps-hadoop-2.7 | 11 +-- pom.xml | 2 +- .../SpecificParquetRecordReaderBase.java| 20 ++--- .../parquet/CatalystReadSupport.scala | 12 +-- .../parquet/CatalystSchemaConverter.scala | 16 .../datasources/parquet/ParquetFilters.scala| 83 .../parquet/ParquetSchemaSuite.scala| 20 ++--- 11 files changed, 117 insertions(+), 91 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/8cdc0d4d/dev/deps/spark-deps-hadoop-2.2 -- diff --git a/dev/deps/spark-deps-hadoop-2.2 b/dev/deps/spark-deps-hadoop-2.2 index b5c38a6..96001ea 100644 --- a/dev/deps/spark-deps-hadoop-2.2 +++ b/dev/deps/spark-deps-hadoop-2.2 @@ -129,13 +129,14 @@ opencsv-2.3.jar oro-2.0.8.jar osgi-resource-locator-1.0.1.jar paranamer-2.3.jar -parquet-column-1.8.1.jar -parquet-common-1.8.1.jar -parquet-encoding-1.8.1.jar +parquet-column-1.7.0.jar +parquet-common-1.7.0.jar +parquet-encoding-1.7.0.jar parquet-format-2.3.0-incubating.jar -parquet-hadoop-1.8.1.jar +parquet-generator-1.7.0.jar +parquet-hadoop-1.7.0.jar parquet-hadoop-bundle-1.6.0.jar -parquet-jackson-1.8.1.jar +parquet-jackson-1.7.0.jar pmml-model-1.2.15.jar pmml-schema-1.2.15.jar protobuf-java-2.5.0.jar http://git-wip-us.apache.org/repos/asf/spark/blob/8cdc0d4d/dev/deps/spark-deps-hadoop-2.3 -- diff --git a/dev/deps/spark-deps-hadoop-2.3 b/dev/deps/spark-deps-hadoop-2.3 index 969df04..9f3d9ad 100644 --- a/dev/deps/spark-deps-hadoop-2.3 +++ b/dev/deps/spark-deps-hadoop-2.3 @@ -136,13 +136,14 @@ opencsv-2.3.jar oro-2.0.8.jar osgi-resource-locator-1.0.1.jar paranamer-2.3.jar -parquet-column-1.8.1.jar -parquet-common-1.8.1.jar -parquet-encoding-1.8.1.jar +parquet-column-1.7.0.jar +parquet-common-1.7.0.jar +parquet-encoding-1.7.0.jar parquet-format-2.3.0-incubating.jar -parquet-hadoop-1.8.1.jar +parquet-generator-1.7.0.jar +parquet-hadoop-1.7.0.jar parquet-hadoop-bundle-1.6.0.jar -parquet-jackson-1.8.1.jar +parquet-jackson-1.7.0.jar pmml-model-1.2.15.jar pmml-schema-1.2.15.jar protobuf-java-2.5.0.jar http://git-wip-us.apache.org/repos/asf/spark/blob/8cdc0d4d/dev/deps/spark-deps-hadoop-2.4 -- diff --git a/dev/deps/spark-deps-hadoop-2.4 b/dev/deps/spark-deps-hadoop-2.4 index f0491ec..77d5266 100644 --- a/dev/deps/spark-deps-hadoop-2.4 +++ b/dev/deps/spark-deps-hadoop-2.4 @@ -136,13 +136,14 @@ opencsv-2.3.jar oro-2.0.8.jar osgi-resource-locator-1.0.1.jar paranamer-2.3.jar -parquet-column-1.8.1.jar -parquet-common-1.8.1.jar -parquet-encoding-1.8.1.jar +parquet-column-1.7.0.jar +parquet-common-1.7.0.jar +parquet-encoding-1.7.0.jar parquet-format-2.3.0-incubating.jar -parquet-hadoop-1.8.1.jar +parquet-generator-1.7.0.jar +parquet-hadoop-1.7.0.jar parquet-hadoop-bundle-1.6.0.jar -parquet-jackson-1.8.1.jar +parquet-jackson-1.7.0.jar pmml-model-1.2.15.jar pmml-schema-1.2.15.jar protobuf-java-2.5.0.jar http://git-wip-us.apache.org/repos/asf/spark/blob/8cdc0d4d/dev/deps/spark-deps-hadoop-2.6 -- diff --git a/dev/deps/spark-deps-hadoop-2.6 b/dev/deps/spark-deps-hadoop-2.6 index b3dced6..9afe50f 100644 --- a/dev/deps/spark
spark git commit: [SPARK-15269][SQL] Removes unexpected empty table directories created while creating external Spark SQL data sourcet tables.
Repository: spark Updated Branches: refs/heads/branch-2.0 44052a707 -> e033fd50f [SPARK-15269][SQL] Removes unexpected empty table directories created while creating external Spark SQL data sourcet tables. This PR is an alternative to #13120 authored by xwu0226. ## What changes were proposed in this pull request? When creating an external Spark SQL data source table and persisting its metadata to Hive metastore, we don't use the standard Hive `Table.dataLocation` field because Hive only allows directory paths as data locations while Spark SQL also allows file paths. However, if we don't set `Table.dataLocation`, Hive always creates an unexpected empty table directory under database location, but doesn't remove it while dropping the table (because the table is external). This PR works around this issue by explicitly setting `Table.dataLocation` and then manullay removing the created directory after creating the external table. Please refer to [this JIRA comment][1] for more details about why we chose this approach as a workaround. [1]: https://issues.apache.org/jira/browse/SPARK-15269?focusedCommentId=15297408=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15297408 ## How was this patch tested? 1. A new test case is added in `HiveQuerySuite` for this case 2. Updated `ShowCreateTableSuite` to use the same table name in all test cases. (This is how I hit this issue at the first place.) Author: Cheng Lian <l...@databricks.com> Closes #13270 from liancheng/spark-15269-unpleasant-fix. (cherry picked from commit 7bb64aae27f670531699f59d3f410e38866609b7) Signed-off-by: Cheng Lian <l...@databricks.com> Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e033fd50 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e033fd50 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e033fd50 Branch: refs/heads/branch-2.0 Commit: e033fd50f0fcefb2a6cffb763ff7e026b0066c07 Parents: 44052a7 Author: Cheng Lian <l...@databricks.com> Authored: Wed Jun 1 16:02:27 2016 -0700 Committer: Cheng Lian <l...@databricks.com> Committed: Wed Jun 1 16:02:37 2016 -0700 -- .../apache/spark/sql/AnalysisException.scala| 5 ++- .../sql/catalyst/catalog/SessionCatalog.scala | 2 +- .../org/apache/spark/sql/SparkSession.scala | 6 +-- .../command/createDataSourceTables.scala| 4 +- .../apache/spark/sql/internal/HiveSerDe.scala | 1 - .../spark/sql/hive/HiveExternalCatalog.scala| 45 ++-- .../apache/spark/sql/hive/HiveSharedState.scala | 4 +- .../spark/sql/hive/client/HiveClientImpl.scala | 22 -- .../sql/hive/HiveExternalCatalogSuite.scala | 3 +- .../sql/hive/MetastoreDataSourcesSuite.scala| 13 ++ .../spark/sql/hive/ShowCreateTableSuite.scala | 36 11 files changed, 105 insertions(+), 36 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/e033fd50/sql/catalyst/src/main/scala/org/apache/spark/sql/AnalysisException.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/AnalysisException.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/AnalysisException.scala index d2003fd..6911843 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/AnalysisException.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/AnalysisException.scala @@ -32,8 +32,9 @@ class AnalysisException protected[sql] ( val message: String, val line: Option[Int] = None, val startPosition: Option[Int] = None, -val plan: Option[LogicalPlan] = None) - extends Exception with Serializable { +val plan: Option[LogicalPlan] = None, +val cause: Option[Throwable] = None) + extends Exception(message, cause.orNull) with Serializable { def withPosition(line: Option[Int], startPosition: Option[Int]): AnalysisException = { val newException = new AnalysisException(message, line, startPosition) http://git-wip-us.apache.org/repos/asf/spark/blob/e033fd50/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala index cf9286e..371c198 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala @@ -22,7 +22,7 @@ import javax.annotation.concurrent.GuardedBy import scala.collection.mutable import org.apache.hadoop.conf.Configuration -import org.ap
spark git commit: [SPARK-15269][SQL] Removes unexpected empty table directories created while creating external Spark SQL data sourcet tables.
Repository: spark Updated Branches: refs/heads/master 9e2643b21 -> 7bb64aae2 [SPARK-15269][SQL] Removes unexpected empty table directories created while creating external Spark SQL data sourcet tables. This PR is an alternative to #13120 authored by xwu0226. ## What changes were proposed in this pull request? When creating an external Spark SQL data source table and persisting its metadata to Hive metastore, we don't use the standard Hive `Table.dataLocation` field because Hive only allows directory paths as data locations while Spark SQL also allows file paths. However, if we don't set `Table.dataLocation`, Hive always creates an unexpected empty table directory under database location, but doesn't remove it while dropping the table (because the table is external). This PR works around this issue by explicitly setting `Table.dataLocation` and then manullay removing the created directory after creating the external table. Please refer to [this JIRA comment][1] for more details about why we chose this approach as a workaround. [1]: https://issues.apache.org/jira/browse/SPARK-15269?focusedCommentId=15297408=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15297408 ## How was this patch tested? 1. A new test case is added in `HiveQuerySuite` for this case 2. Updated `ShowCreateTableSuite` to use the same table name in all test cases. (This is how I hit this issue at the first place.) Author: Cheng Lian <l...@databricks.com> Closes #13270 from liancheng/spark-15269-unpleasant-fix. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7bb64aae Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7bb64aae Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7bb64aae Branch: refs/heads/master Commit: 7bb64aae27f670531699f59d3f410e38866609b7 Parents: 9e2643b Author: Cheng Lian <l...@databricks.com> Authored: Wed Jun 1 16:02:27 2016 -0700 Committer: Cheng Lian <l...@databricks.com> Committed: Wed Jun 1 16:02:27 2016 -0700 -- .../apache/spark/sql/AnalysisException.scala| 5 ++- .../sql/catalyst/catalog/SessionCatalog.scala | 2 +- .../org/apache/spark/sql/SparkSession.scala | 6 +-- .../command/createDataSourceTables.scala| 4 +- .../apache/spark/sql/internal/HiveSerDe.scala | 1 - .../spark/sql/hive/HiveExternalCatalog.scala| 45 ++-- .../apache/spark/sql/hive/HiveSharedState.scala | 4 +- .../spark/sql/hive/client/HiveClientImpl.scala | 22 -- .../sql/hive/HiveExternalCatalogSuite.scala | 3 +- .../sql/hive/MetastoreDataSourcesSuite.scala| 13 ++ .../spark/sql/hive/ShowCreateTableSuite.scala | 36 11 files changed, 105 insertions(+), 36 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/7bb64aae/sql/catalyst/src/main/scala/org/apache/spark/sql/AnalysisException.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/AnalysisException.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/AnalysisException.scala index d2003fd..6911843 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/AnalysisException.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/AnalysisException.scala @@ -32,8 +32,9 @@ class AnalysisException protected[sql] ( val message: String, val line: Option[Int] = None, val startPosition: Option[Int] = None, -val plan: Option[LogicalPlan] = None) - extends Exception with Serializable { +val plan: Option[LogicalPlan] = None, +val cause: Option[Throwable] = None) + extends Exception(message, cause.orNull) with Serializable { def withPosition(line: Option[Int], startPosition: Option[Int]): AnalysisException = { val newException = new AnalysisException(message, line, startPosition) http://git-wip-us.apache.org/repos/asf/spark/blob/7bb64aae/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala index cf9286e..371c198 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala @@ -22,7 +22,7 @@ import javax.annotation.concurrent.GuardedBy import scala.collection.mutable import org.apache.hadoop.conf.Configuration -import org.apache.hadoop.fs.Path +import org.apache.hadoop.fs.{FileSystem, Path} import org.apache.spark.int
spark git commit: [SPARK-14343][SQL] Proper column pruning for text data source
Repository: spark Updated Branches: refs/heads/branch-2.0 8fb125bdf -> cb254ecb1 [SPARK-14343][SQL] Proper column pruning for text data source ## What changes were proposed in this pull request? Text data source ignores requested schema, and may give wrong result when the only data column is not requested. This may happen when only partitioning column(s) are requested for a partitioned text table. ## How was this patch tested? New test case added in `TextSuite`. Author: Cheng Lian <l...@databricks.com> Closes #13431 from liancheng/spark-14343-partitioned-text-table. (cherry picked from commit 1f43562daf9454428796317203d9dcc9030a46eb) Signed-off-by: Cheng Lian <l...@databricks.com> Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/cb254ecb Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/cb254ecb Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/cb254ecb Branch: refs/heads/branch-2.0 Commit: cb254ecb1f351c00f7fe4c3c9cc41c46beda90b4 Parents: 8fb125b Author: Cheng Lian <l...@databricks.com> Authored: Wed Jun 1 07:30:55 2016 -0700 Committer: Cheng Lian <l...@databricks.com> Committed: Wed Jun 1 07:31:52 2016 -0700 -- .../datasources/text/TextFileFormat.scala | 31 +--- .../execution/datasources/text/TextSuite.scala | 17 +-- 2 files changed, 35 insertions(+), 13 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/cb254ecb/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala index 1e5bce4..9c03ab2 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala @@ -92,20 +92,31 @@ class TextFileFormat extends FileFormat with DataSourceRegister { filters: Seq[Filter], options: Map[String, String], hadoopConf: Configuration): PartitionedFile => Iterator[InternalRow] = { +assert( + requiredSchema.length <= 1, + "Text data source only produces a single data column named \"value\".") + val broadcastedHadoopConf = sparkSession.sparkContext.broadcast(new SerializableConfiguration(hadoopConf)) (file: PartitionedFile) => { - val unsafeRow = new UnsafeRow(1) - val bufferHolder = new BufferHolder(unsafeRow) - val unsafeRowWriter = new UnsafeRowWriter(bufferHolder, 1) - - new HadoopFileLinesReader(file, broadcastedHadoopConf.value.value).map { line => -// Writes to an UnsafeRow directly -bufferHolder.reset() -unsafeRowWriter.write(0, line.getBytes, 0, line.getLength) -unsafeRow.setTotalSize(bufferHolder.totalSize()) -unsafeRow + val reader = new HadoopFileLinesReader(file, broadcastedHadoopConf.value.value) + + if (requiredSchema.isEmpty) { +val emptyUnsafeRow = new UnsafeRow(0) +reader.map(_ => emptyUnsafeRow) + } else { +val unsafeRow = new UnsafeRow(1) +val bufferHolder = new BufferHolder(unsafeRow) +val unsafeRowWriter = new UnsafeRowWriter(bufferHolder, 1) + +reader.map { line => + // Writes to an UnsafeRow directly + bufferHolder.reset() + unsafeRowWriter.write(0, line.getBytes, 0, line.getLength) + unsafeRow.setTotalSize(bufferHolder.totalSize()) + unsafeRow +} } } } http://git-wip-us.apache.org/repos/asf/spark/blob/cb254ecb/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/text/TextSuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/text/TextSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/text/TextSuite.scala index b5e51e9..7b6981f 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/text/TextSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/text/TextSuite.scala @@ -19,9 +19,6 @@ package org.apache.spark.sql.execution.datasources.text import java.io.File -import scala.collection.JavaConverters._ - -import org.apache.hadoop.conf.Configuration import org.apache.hadoop.io.SequenceFile.CompressionType import org.apache.hadoop.io.compress.GzipCodec @@ -31,6 +28,7 @@ import org.apache.spark.sql.types.{StringType, StructType} import
spark git commit: [SPARK-14343][SQL] Proper column pruning for text data source
Repository: spark Updated Branches: refs/heads/master 6563d72b1 -> 1f43562da [SPARK-14343][SQL] Proper column pruning for text data source ## What changes were proposed in this pull request? Text data source ignores requested schema, and may give wrong result when the only data column is not requested. This may happen when only partitioning column(s) are requested for a partitioned text table. ## How was this patch tested? New test case added in `TextSuite`. Author: Cheng Lian <l...@databricks.com> Closes #13431 from liancheng/spark-14343-partitioned-text-table. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1f43562d Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1f43562d Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1f43562d Branch: refs/heads/master Commit: 1f43562daf9454428796317203d9dcc9030a46eb Parents: 6563d72 Author: Cheng Lian <l...@databricks.com> Authored: Wed Jun 1 07:30:55 2016 -0700 Committer: Cheng Lian <l...@databricks.com> Committed: Wed Jun 1 07:30:55 2016 -0700 -- .../datasources/text/TextFileFormat.scala | 31 +--- .../execution/datasources/text/TextSuite.scala | 17 +-- 2 files changed, 35 insertions(+), 13 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/1f43562d/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala index 1e5bce4..9c03ab2 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala @@ -92,20 +92,31 @@ class TextFileFormat extends FileFormat with DataSourceRegister { filters: Seq[Filter], options: Map[String, String], hadoopConf: Configuration): PartitionedFile => Iterator[InternalRow] = { +assert( + requiredSchema.length <= 1, + "Text data source only produces a single data column named \"value\".") + val broadcastedHadoopConf = sparkSession.sparkContext.broadcast(new SerializableConfiguration(hadoopConf)) (file: PartitionedFile) => { - val unsafeRow = new UnsafeRow(1) - val bufferHolder = new BufferHolder(unsafeRow) - val unsafeRowWriter = new UnsafeRowWriter(bufferHolder, 1) - - new HadoopFileLinesReader(file, broadcastedHadoopConf.value.value).map { line => -// Writes to an UnsafeRow directly -bufferHolder.reset() -unsafeRowWriter.write(0, line.getBytes, 0, line.getLength) -unsafeRow.setTotalSize(bufferHolder.totalSize()) -unsafeRow + val reader = new HadoopFileLinesReader(file, broadcastedHadoopConf.value.value) + + if (requiredSchema.isEmpty) { +val emptyUnsafeRow = new UnsafeRow(0) +reader.map(_ => emptyUnsafeRow) + } else { +val unsafeRow = new UnsafeRow(1) +val bufferHolder = new BufferHolder(unsafeRow) +val unsafeRowWriter = new UnsafeRowWriter(bufferHolder, 1) + +reader.map { line => + // Writes to an UnsafeRow directly + bufferHolder.reset() + unsafeRowWriter.write(0, line.getBytes, 0, line.getLength) + unsafeRow.setTotalSize(bufferHolder.totalSize()) + unsafeRow +} } } } http://git-wip-us.apache.org/repos/asf/spark/blob/1f43562d/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/text/TextSuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/text/TextSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/text/TextSuite.scala index b5e51e9..7b6981f 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/text/TextSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/text/TextSuite.scala @@ -19,9 +19,6 @@ package org.apache.spark.sql.execution.datasources.text import java.io.File -import scala.collection.JavaConverters._ - -import org.apache.hadoop.conf.Configuration import org.apache.hadoop.io.SequenceFile.CompressionType import org.apache.hadoop.io.compress.GzipCodec @@ -31,6 +28,7 @@ import org.apache.spark.sql.types.{StringType, StructType} import org.apache.spark.util.Utils class TextSuite extends QueryTest with SharedSQLContext { + import testImplicits._ test("reading text f
spark git commit: [SPARK-15112][SQL] Disables EmbedSerializerInFilter for plan fragments that change schema
Repository: spark Updated Branches: refs/heads/branch-2.0 be2d23dfd -> 6dae027a6 [SPARK-15112][SQL] Disables EmbedSerializerInFilter for plan fragments that change schema ## What changes were proposed in this pull request? `EmbedSerializerInFilter` implicitly assumes that the plan fragment being optimized doesn't change plan schema, which is reasonable because `Dataset.filter` should never change the schema. However, due to another issue involving `DeserializeToObject` and `SerializeFromObject`, typed filter *does* change plan schema (see [SPARK-15632][1]). This breaks `EmbedSerializerInFilter` and causes corrupted data. This PR disables `EmbedSerializerInFilter` when there's a schema change to avoid data corruption. The schema change issue should be addressed in follow-up PRs. ## How was this patch tested? New test case added in `DatasetSuite`. [1]: https://issues.apache.org/jira/browse/SPARK-15632 Author: Cheng Lian <l...@databricks.com> Closes #13362 from liancheng/spark-15112-corrupted-filter. (cherry picked from commit 1360a6d636dd812a27955fc85df8e0255db60dfa) Signed-off-by: Cheng Lian <l...@databricks.com> Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6dae027a Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6dae027a Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6dae027a Branch: refs/heads/branch-2.0 Commit: 6dae027a6cdd7c862963f71e1ea08f7f1b4b3506 Parents: be2d23d Author: Cheng Lian <l...@databricks.com> Authored: Sun May 29 23:19:12 2016 -0700 Committer: Cheng Lian <l...@databricks.com> Committed: Sun May 29 23:19:29 2016 -0700 -- .../sql/catalyst/optimizer/Optimizer.scala | 21 +++- .../org/apache/spark/sql/DatasetSuite.scala | 16 ++- 2 files changed, 35 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/6dae027a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala index 48d7009..688c77d 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala @@ -1597,7 +1597,19 @@ case class GetCurrentDatabase(sessionCatalog: SessionCatalog) extends Rule[Logic */ object EmbedSerializerInFilter extends Rule[LogicalPlan] { def apply(plan: LogicalPlan): LogicalPlan = plan transform { -case s @ SerializeFromObject(_, Filter(condition, d: DeserializeToObject)) => +case s @ SerializeFromObject(_, Filter(condition, d: DeserializeToObject)) + // SPARK-15632: Conceptually, filter operator should never introduce schema change. This + // optimization rule also relies on this assumption. However, Dataset typed filter operator + // does introduce schema changes in some cases. Thus, we only enable this optimization when + // + // 1. either input and output schemata are exactly the same, or + // 2. both input and output schemata are single-field schema and share the same type. + // + // The 2nd case is included because encoders for primitive types always have only a single + // field with hard-coded field name "value". + // TODO Cleans this up after fixing SPARK-15632. + if s.schema == d.child.schema || samePrimitiveType(s.schema, d.child.schema) => + val numObjects = condition.collect { case a: Attribute if a == d.output.head => a }.length @@ -1622,6 +1634,13 @@ object EmbedSerializerInFilter extends Rule[LogicalPlan] { Project(objAttrs, filter) } } + + def samePrimitiveType(lhs: StructType, rhs: StructType): Boolean = { +(lhs, rhs) match { + case (StructType(Array(f1)), StructType(Array(f2))) => f1.dataType == f2.dataType + case _ => false +} + } } /** http://git-wip-us.apache.org/repos/asf/spark/blob/6dae027a/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala index e395007..8fc4dc9 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala @@ -706,7 +706,7 @@ class DatasetSuite extends QueryTest with SharedSQLContext { val dataset = Seq(1, 2, 3).toDS() dataset.crea
spark git commit: [SPARK-15112][SQL] Disables EmbedSerializerInFilter for plan fragments that change schema
Repository: spark Updated Branches: refs/heads/master ce1572d16 -> 1360a6d63 [SPARK-15112][SQL] Disables EmbedSerializerInFilter for plan fragments that change schema ## What changes were proposed in this pull request? `EmbedSerializerInFilter` implicitly assumes that the plan fragment being optimized doesn't change plan schema, which is reasonable because `Dataset.filter` should never change the schema. However, due to another issue involving `DeserializeToObject` and `SerializeFromObject`, typed filter *does* change plan schema (see [SPARK-15632][1]). This breaks `EmbedSerializerInFilter` and causes corrupted data. This PR disables `EmbedSerializerInFilter` when there's a schema change to avoid data corruption. The schema change issue should be addressed in follow-up PRs. ## How was this patch tested? New test case added in `DatasetSuite`. [1]: https://issues.apache.org/jira/browse/SPARK-15632 Author: Cheng Lian <l...@databricks.com> Closes #13362 from liancheng/spark-15112-corrupted-filter. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1360a6d6 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1360a6d6 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1360a6d6 Branch: refs/heads/master Commit: 1360a6d636dd812a27955fc85df8e0255db60dfa Parents: ce1572d Author: Cheng Lian <l...@databricks.com> Authored: Sun May 29 23:19:12 2016 -0700 Committer: Cheng Lian <l...@databricks.com> Committed: Sun May 29 23:19:12 2016 -0700 -- .../sql/catalyst/optimizer/Optimizer.scala | 21 +++- .../org/apache/spark/sql/DatasetSuite.scala | 16 ++- 2 files changed, 35 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/1360a6d6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala index 48d7009..688c77d 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala @@ -1597,7 +1597,19 @@ case class GetCurrentDatabase(sessionCatalog: SessionCatalog) extends Rule[Logic */ object EmbedSerializerInFilter extends Rule[LogicalPlan] { def apply(plan: LogicalPlan): LogicalPlan = plan transform { -case s @ SerializeFromObject(_, Filter(condition, d: DeserializeToObject)) => +case s @ SerializeFromObject(_, Filter(condition, d: DeserializeToObject)) + // SPARK-15632: Conceptually, filter operator should never introduce schema change. This + // optimization rule also relies on this assumption. However, Dataset typed filter operator + // does introduce schema changes in some cases. Thus, we only enable this optimization when + // + // 1. either input and output schemata are exactly the same, or + // 2. both input and output schemata are single-field schema and share the same type. + // + // The 2nd case is included because encoders for primitive types always have only a single + // field with hard-coded field name "value". + // TODO Cleans this up after fixing SPARK-15632. + if s.schema == d.child.schema || samePrimitiveType(s.schema, d.child.schema) => + val numObjects = condition.collect { case a: Attribute if a == d.output.head => a }.length @@ -1622,6 +1634,13 @@ object EmbedSerializerInFilter extends Rule[LogicalPlan] { Project(objAttrs, filter) } } + + def samePrimitiveType(lhs: StructType, rhs: StructType): Boolean = { +(lhs, rhs) match { + case (StructType(Array(f1)), StructType(Array(f2))) => f1.dataType == f2.dataType + case _ => false +} + } } /** http://git-wip-us.apache.org/repos/asf/spark/blob/1360a6d6/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala index e395007..8fc4dc9 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala @@ -706,7 +706,7 @@ class DatasetSuite extends QueryTest with SharedSQLContext { val dataset = Seq(1, 2, 3).toDS() dataset.createOrReplaceTempView("tempView") -// Overrrides the existing temporary view with same name +// Overrides the existing t
spark git commit: [SPARK-9876][SQL] Update Parquet to 1.8.1.
Repository: spark Updated Branches: refs/heads/branch-2.0 9c137b2e3 -> dc6e94157 [SPARK-9876][SQL] Update Parquet to 1.8.1. ## What changes were proposed in this pull request? This includes minimal changes to get Spark using the current release of Parquet, 1.8.1. ## How was this patch tested? This uses the existing Parquet tests. Author: Ryan Blue <b...@apache.org> Closes #13280 from rdblue/SPARK-9876-update-parquet. (cherry picked from commit 776d183c82b424ef7c3cae30537d8afe9b9eee83) Signed-off-by: Cheng Lian <l...@databricks.com> Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/dc6e9415 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/dc6e9415 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/dc6e9415 Branch: refs/heads/branch-2.0 Commit: dc6e94157ce08df91aa1a31db8e5ec733a1ab0c5 Parents: 9c137b2 Author: Ryan Blue <b...@apache.org> Authored: Fri May 27 16:59:38 2016 -0700 Committer: Cheng Lian <l...@databricks.com> Committed: Fri May 27 16:59:50 2016 -0700 -- dev/deps/spark-deps-hadoop-2.2 | 11 ++- dev/deps/spark-deps-hadoop-2.3 | 11 ++- dev/deps/spark-deps-hadoop-2.4 | 11 ++- dev/deps/spark-deps-hadoop-2.6 | 11 ++- dev/deps/spark-deps-hadoop-2.7 | 11 ++- pom.xml | 2 +- .../SpecificParquetRecordReaderBase.java| 20 +++-- .../parquet/CatalystReadSupport.scala | 12 ++- .../parquet/CatalystSchemaConverter.scala | 16 .../datasources/parquet/ParquetFilters.scala| 83 .../parquet/ParquetSchemaSuite.scala| 20 +++-- 11 files changed, 91 insertions(+), 117 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/dc6e9415/dev/deps/spark-deps-hadoop-2.2 -- diff --git a/dev/deps/spark-deps-hadoop-2.2 b/dev/deps/spark-deps-hadoop-2.2 index 578691c..deec033 100644 --- a/dev/deps/spark-deps-hadoop-2.2 +++ b/dev/deps/spark-deps-hadoop-2.2 @@ -129,14 +129,13 @@ opencsv-2.3.jar oro-2.0.8.jar osgi-resource-locator-1.0.1.jar paranamer-2.8.jar -parquet-column-1.7.0.jar -parquet-common-1.7.0.jar -parquet-encoding-1.7.0.jar +parquet-column-1.8.1.jar +parquet-common-1.8.1.jar +parquet-encoding-1.8.1.jar parquet-format-2.3.0-incubating.jar -parquet-generator-1.7.0.jar -parquet-hadoop-1.7.0.jar +parquet-hadoop-1.8.1.jar parquet-hadoop-bundle-1.6.0.jar -parquet-jackson-1.7.0.jar +parquet-jackson-1.8.1.jar pmml-model-1.2.15.jar pmml-schema-1.2.15.jar protobuf-java-2.5.0.jar http://git-wip-us.apache.org/repos/asf/spark/blob/dc6e9415/dev/deps/spark-deps-hadoop-2.3 -- diff --git a/dev/deps/spark-deps-hadoop-2.3 b/dev/deps/spark-deps-hadoop-2.3 index fc6306f..43c7dd3 100644 --- a/dev/deps/spark-deps-hadoop-2.3 +++ b/dev/deps/spark-deps-hadoop-2.3 @@ -136,14 +136,13 @@ opencsv-2.3.jar oro-2.0.8.jar osgi-resource-locator-1.0.1.jar paranamer-2.8.jar -parquet-column-1.7.0.jar -parquet-common-1.7.0.jar -parquet-encoding-1.7.0.jar +parquet-column-1.8.1.jar +parquet-common-1.8.1.jar +parquet-encoding-1.8.1.jar parquet-format-2.3.0-incubating.jar -parquet-generator-1.7.0.jar -parquet-hadoop-1.7.0.jar +parquet-hadoop-1.8.1.jar parquet-hadoop-bundle-1.6.0.jar -parquet-jackson-1.7.0.jar +parquet-jackson-1.8.1.jar pmml-model-1.2.15.jar pmml-schema-1.2.15.jar protobuf-java-2.5.0.jar http://git-wip-us.apache.org/repos/asf/spark/blob/dc6e9415/dev/deps/spark-deps-hadoop-2.4 -- diff --git a/dev/deps/spark-deps-hadoop-2.4 b/dev/deps/spark-deps-hadoop-2.4 index dee1417..7186b30 100644 --- a/dev/deps/spark-deps-hadoop-2.4 +++ b/dev/deps/spark-deps-hadoop-2.4 @@ -136,14 +136,13 @@ opencsv-2.3.jar oro-2.0.8.jar osgi-resource-locator-1.0.1.jar paranamer-2.8.jar -parquet-column-1.7.0.jar -parquet-common-1.7.0.jar -parquet-encoding-1.7.0.jar +parquet-column-1.8.1.jar +parquet-common-1.8.1.jar +parquet-encoding-1.8.1.jar parquet-format-2.3.0-incubating.jar -parquet-generator-1.7.0.jar -parquet-hadoop-1.7.0.jar +parquet-hadoop-1.8.1.jar parquet-hadoop-bundle-1.6.0.jar -parquet-jackson-1.7.0.jar +parquet-jackson-1.8.1.jar pmml-model-1.2.15.jar pmml-schema-1.2.15.jar protobuf-java-2.5.0.jar http://git-wip-us.apache.org/repos/asf/spark/blob/dc6e9415/dev/deps/spark-deps-hadoop-2.6 -- diff --git a/dev/deps/spark-deps-hadoop-2.6 b/dev/deps/spark-deps-hadoop-2.6 index 9695661..3e4ed74 100644 --- a/dev/deps/spark-deps-hadoop-2.6 +++ b/dev/deps/spark-deps-hadoop-2.6 @@ -144,14 +144,13 @@ opencsv-2.3.jar o
spark git commit: [SPARK-9876][SQL] Update Parquet to 1.8.1.
Repository: spark Updated Branches: refs/heads/master 019afd9c7 -> 776d183c8 [SPARK-9876][SQL] Update Parquet to 1.8.1. ## What changes were proposed in this pull request? This includes minimal changes to get Spark using the current release of Parquet, 1.8.1. ## How was this patch tested? This uses the existing Parquet tests. Author: Ryan Blue <b...@apache.org> Closes #13280 from rdblue/SPARK-9876-update-parquet. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/776d183c Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/776d183c Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/776d183c Branch: refs/heads/master Commit: 776d183c82b424ef7c3cae30537d8afe9b9eee83 Parents: 019afd9 Author: Ryan Blue <b...@apache.org> Authored: Fri May 27 16:59:38 2016 -0700 Committer: Cheng Lian <l...@databricks.com> Committed: Fri May 27 16:59:38 2016 -0700 -- dev/deps/spark-deps-hadoop-2.2 | 11 ++- dev/deps/spark-deps-hadoop-2.3 | 11 ++- dev/deps/spark-deps-hadoop-2.4 | 11 ++- dev/deps/spark-deps-hadoop-2.6 | 11 ++- dev/deps/spark-deps-hadoop-2.7 | 11 ++- pom.xml | 2 +- .../SpecificParquetRecordReaderBase.java| 20 +++-- .../parquet/CatalystReadSupport.scala | 12 ++- .../parquet/CatalystSchemaConverter.scala | 16 .../datasources/parquet/ParquetFilters.scala| 83 .../parquet/ParquetSchemaSuite.scala| 20 +++-- 11 files changed, 91 insertions(+), 117 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/776d183c/dev/deps/spark-deps-hadoop-2.2 -- diff --git a/dev/deps/spark-deps-hadoop-2.2 b/dev/deps/spark-deps-hadoop-2.2 index 578691c..deec033 100644 --- a/dev/deps/spark-deps-hadoop-2.2 +++ b/dev/deps/spark-deps-hadoop-2.2 @@ -129,14 +129,13 @@ opencsv-2.3.jar oro-2.0.8.jar osgi-resource-locator-1.0.1.jar paranamer-2.8.jar -parquet-column-1.7.0.jar -parquet-common-1.7.0.jar -parquet-encoding-1.7.0.jar +parquet-column-1.8.1.jar +parquet-common-1.8.1.jar +parquet-encoding-1.8.1.jar parquet-format-2.3.0-incubating.jar -parquet-generator-1.7.0.jar -parquet-hadoop-1.7.0.jar +parquet-hadoop-1.8.1.jar parquet-hadoop-bundle-1.6.0.jar -parquet-jackson-1.7.0.jar +parquet-jackson-1.8.1.jar pmml-model-1.2.15.jar pmml-schema-1.2.15.jar protobuf-java-2.5.0.jar http://git-wip-us.apache.org/repos/asf/spark/blob/776d183c/dev/deps/spark-deps-hadoop-2.3 -- diff --git a/dev/deps/spark-deps-hadoop-2.3 b/dev/deps/spark-deps-hadoop-2.3 index fc6306f..43c7dd3 100644 --- a/dev/deps/spark-deps-hadoop-2.3 +++ b/dev/deps/spark-deps-hadoop-2.3 @@ -136,14 +136,13 @@ opencsv-2.3.jar oro-2.0.8.jar osgi-resource-locator-1.0.1.jar paranamer-2.8.jar -parquet-column-1.7.0.jar -parquet-common-1.7.0.jar -parquet-encoding-1.7.0.jar +parquet-column-1.8.1.jar +parquet-common-1.8.1.jar +parquet-encoding-1.8.1.jar parquet-format-2.3.0-incubating.jar -parquet-generator-1.7.0.jar -parquet-hadoop-1.7.0.jar +parquet-hadoop-1.8.1.jar parquet-hadoop-bundle-1.6.0.jar -parquet-jackson-1.7.0.jar +parquet-jackson-1.8.1.jar pmml-model-1.2.15.jar pmml-schema-1.2.15.jar protobuf-java-2.5.0.jar http://git-wip-us.apache.org/repos/asf/spark/blob/776d183c/dev/deps/spark-deps-hadoop-2.4 -- diff --git a/dev/deps/spark-deps-hadoop-2.4 b/dev/deps/spark-deps-hadoop-2.4 index dee1417..7186b30 100644 --- a/dev/deps/spark-deps-hadoop-2.4 +++ b/dev/deps/spark-deps-hadoop-2.4 @@ -136,14 +136,13 @@ opencsv-2.3.jar oro-2.0.8.jar osgi-resource-locator-1.0.1.jar paranamer-2.8.jar -parquet-column-1.7.0.jar -parquet-common-1.7.0.jar -parquet-encoding-1.7.0.jar +parquet-column-1.8.1.jar +parquet-common-1.8.1.jar +parquet-encoding-1.8.1.jar parquet-format-2.3.0-incubating.jar -parquet-generator-1.7.0.jar -parquet-hadoop-1.7.0.jar +parquet-hadoop-1.8.1.jar parquet-hadoop-bundle-1.6.0.jar -parquet-jackson-1.7.0.jar +parquet-jackson-1.8.1.jar pmml-model-1.2.15.jar pmml-schema-1.2.15.jar protobuf-java-2.5.0.jar http://git-wip-us.apache.org/repos/asf/spark/blob/776d183c/dev/deps/spark-deps-hadoop-2.6 -- diff --git a/dev/deps/spark-deps-hadoop-2.6 b/dev/deps/spark-deps-hadoop-2.6 index 9695661..3e4ed74 100644 --- a/dev/deps/spark-deps-hadoop-2.6 +++ b/dev/deps/spark-deps-hadoop-2.6 @@ -144,14 +144,13 @@ opencsv-2.3.jar oro-2.0.8.jar osgi-resource-locator-1.0.1.jar paranamer-2.8.jar -parquet-column-1.7.0.jar -parquet-common-1.7.0.jar -parquet-encodi
spark git commit: [SPARK-15550][SQL] Dataset.show() should show contents nested products as rows
Repository: spark Updated Branches: refs/heads/branch-2.0 29681cca1 -> 4939c2a12 [SPARK-15550][SQL] Dataset.show() should show contents nested products as rows ## What changes were proposed in this pull request? This PR addresses two related issues: 1. `Dataset.showString()` should show case classes/Java beans at all levels as rows, while master code only handles top level ones. 2. `Dataset.showString()` should show full contents produced the underlying query plan Dataset is only a view of the underlying query plan. Columns not referred by the encoder are still reachable using methods like `Dataset.col`. So it probably makes more sense to show full contents of the query plan. ## How was this patch tested? Two new test cases are added in `DatasetSuite` to check `.showString()` output. Author: Cheng Lian <l...@databricks.com> Closes #13331 from liancheng/spark-15550-ds-show. (cherry picked from commit e7082caeb4a53c1ee172d136894eece1ac880f65) Signed-off-by: Cheng Lian <l...@databricks.com> Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/4939c2a1 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/4939c2a1 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/4939c2a1 Branch: refs/heads/branch-2.0 Commit: 4939c2a12c854a87a020a7e759e4f87810f16710 Parents: 29681cc Author: Cheng Lian <l...@databricks.com> Authored: Thu May 26 16:23:48 2016 -0700 Committer: Cheng Lian <l...@databricks.com> Committed: Thu May 26 16:23:56 2016 -0700 -- .../scala/org/apache/spark/sql/Dataset.scala| 10 +-- .../org/apache/spark/sql/DatasetSuite.scala | 68 ++-- 2 files changed, 52 insertions(+), 26 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/4939c2a1/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala index 961ae32..85f0cf8 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala @@ -237,19 +237,13 @@ class Dataset[T] private[sql]( */ private[sql] def showString(_numRows: Int, truncate: Boolean = true): String = { val numRows = _numRows.max(0) -val takeResult = take(numRows + 1) +val takeResult = toDF().take(numRows + 1) val hasMoreData = takeResult.length > numRows val data = takeResult.take(numRows) // For array values, replace Seq and Array with square brackets // For cells that are beyond 20 characters, replace it with the first 17 and "..." -val rows: Seq[Seq[String]] = schema.fieldNames.toSeq +: data.map { - case r: Row => r - case tuple: Product => Row.fromTuple(tuple) - case definedByCtor: DefinedByConstructorParams => - Row.fromSeq(ScalaReflection.getConstructorParameterValues(definedByCtor)) - case o => Row(o) -}.map { row => +val rows: Seq[Seq[String]] = schema.fieldNames.toSeq +: data.map { row => row.toSeq.map { cell => val str = cell match { case null => "null" http://git-wip-us.apache.org/repos/asf/spark/blob/4939c2a1/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala index 05de79e..32320a6 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala @@ -22,9 +22,8 @@ import java.sql.{Date, Timestamp} import scala.language.postfixOps -import org.scalatest.words.MatcherWords.be - import org.apache.spark.sql.catalyst.encoders.{OuterScopes, RowEncoder} +import org.apache.spark.sql.catalyst.util.sideBySide import org.apache.spark.sql.execution.streaming.MemoryStream import org.apache.spark.sql.functions._ import org.apache.spark.sql.test.SharedSQLContext @@ -217,7 +216,7 @@ class DatasetSuite extends QueryTest with SharedSQLContext { val ds = Seq(("a", 1), ("b", 2), ("c", 3)).toDS() checkDataset( ds.filter(_._1 == "b").select(expr("_1").as[String]), - ("b")) + "b") } test("foreach") { @@ -436,20 +435,6 @@ class DatasetSuite extends QueryTest with SharedSQLContext { assert(ds.toString == "[_1: int, _2: int]") } - test("showString: Kryo encoder") { -implicit val k
spark git commit: [SPARK-15550][SQL] Dataset.show() should show contents nested products as rows
Repository: spark Updated Branches: refs/heads/master fe6de16f7 -> e7082caeb [SPARK-15550][SQL] Dataset.show() should show contents nested products as rows ## What changes were proposed in this pull request? This PR addresses two related issues: 1. `Dataset.showString()` should show case classes/Java beans at all levels as rows, while master code only handles top level ones. 2. `Dataset.showString()` should show full contents produced the underlying query plan Dataset is only a view of the underlying query plan. Columns not referred by the encoder are still reachable using methods like `Dataset.col`. So it probably makes more sense to show full contents of the query plan. ## How was this patch tested? Two new test cases are added in `DatasetSuite` to check `.showString()` output. Author: Cheng Lian <l...@databricks.com> Closes #13331 from liancheng/spark-15550-ds-show. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e7082cae Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e7082cae Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e7082cae Branch: refs/heads/master Commit: e7082caeb4a53c1ee172d136894eece1ac880f65 Parents: fe6de16 Author: Cheng Lian <l...@databricks.com> Authored: Thu May 26 16:23:48 2016 -0700 Committer: Cheng Lian <l...@databricks.com> Committed: Thu May 26 16:23:48 2016 -0700 -- .../scala/org/apache/spark/sql/Dataset.scala| 10 +-- .../org/apache/spark/sql/DatasetSuite.scala | 68 ++-- 2 files changed, 52 insertions(+), 26 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/e7082cae/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala index 961ae32..85f0cf8 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala @@ -237,19 +237,13 @@ class Dataset[T] private[sql]( */ private[sql] def showString(_numRows: Int, truncate: Boolean = true): String = { val numRows = _numRows.max(0) -val takeResult = take(numRows + 1) +val takeResult = toDF().take(numRows + 1) val hasMoreData = takeResult.length > numRows val data = takeResult.take(numRows) // For array values, replace Seq and Array with square brackets // For cells that are beyond 20 characters, replace it with the first 17 and "..." -val rows: Seq[Seq[String]] = schema.fieldNames.toSeq +: data.map { - case r: Row => r - case tuple: Product => Row.fromTuple(tuple) - case definedByCtor: DefinedByConstructorParams => - Row.fromSeq(ScalaReflection.getConstructorParameterValues(definedByCtor)) - case o => Row(o) -}.map { row => +val rows: Seq[Seq[String]] = schema.fieldNames.toSeq +: data.map { row => row.toSeq.map { cell => val str = cell match { case null => "null" http://git-wip-us.apache.org/repos/asf/spark/blob/e7082cae/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala index 05de79e..32320a6 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala @@ -22,9 +22,8 @@ import java.sql.{Date, Timestamp} import scala.language.postfixOps -import org.scalatest.words.MatcherWords.be - import org.apache.spark.sql.catalyst.encoders.{OuterScopes, RowEncoder} +import org.apache.spark.sql.catalyst.util.sideBySide import org.apache.spark.sql.execution.streaming.MemoryStream import org.apache.spark.sql.functions._ import org.apache.spark.sql.test.SharedSQLContext @@ -217,7 +216,7 @@ class DatasetSuite extends QueryTest with SharedSQLContext { val ds = Seq(("a", 1), ("b", 2), ("c", 3)).toDS() checkDataset( ds.filter(_._1 == "b").select(expr("_1").as[String]), - ("b")) + "b") } test("foreach") { @@ -436,20 +435,6 @@ class DatasetSuite extends QueryTest with SharedSQLContext { assert(ds.toString == "[_1: int, _2: int]") } - test("showString: Kryo encoder") { -implicit val kryoEncoder = Encoders.kryo[KryoData] -val ds = Seq(KryoData(1), KryoData(2)).toDS() - -val expectedAnswer = """
spark git commit: [SPARK-15498][TESTS] fix slow tests
Repository: spark Updated Branches: refs/heads/branch-2.0 e13cfd6d2 -> fb7b90f61 [SPARK-15498][TESTS] fix slow tests ## What changes were proposed in this pull request? This PR fixes 3 slow tests: 1. `ParquetQuerySuite.read/write wide table`: This is not a good unit test as it runs more than 5 minutes. This PR removes it and add a new regression test in `CodeGenerationSuite`, which is more "unit". 2. `ParquetQuerySuite.returning batch for wide table`: reduce the threshold and use smaller data size. 3. `DatasetSuite.SPARK-14554: Dataset.map may generate wrong java code for wide table`: Improve `CodeFormatter.format`(introduced at https://github.com/apache/spark/pull/12979) can dramatically speed this it up. ## How was this patch tested? N/A Author: Wenchen Fan <wenc...@databricks.com> Closes #13273 from cloud-fan/test. (cherry picked from commit 50b660d725269dc0c11e0d350ddd7fc8b19539a0) Signed-off-by: Cheng Lian <l...@databricks.com> Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/fb7b90f6 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/fb7b90f6 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/fb7b90f6 Branch: refs/heads/branch-2.0 Commit: fb7b90f61a7390b5ddb182413f628530cc72bee0 Parents: e13cfd6 Author: Wenchen Fan <wenc...@databricks.com> Authored: Tue May 24 21:23:39 2016 -0700 Committer: Cheng Lian <l...@databricks.com> Committed: Tue May 24 21:23:49 2016 -0700 -- .../expressions/codegen/CodeFormatter.scala | 48 + .../expressions/codegen/CodeGenerator.scala | 22 ++-- .../expressions/CodeGenerationSuite.scala | 14 +++ .../codegen/CodeFormatterSuite.scala| 107 +-- .../datasources/parquet/ParquetQuerySuite.scala | 16 +-- 5 files changed, 128 insertions(+), 79 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/fb7b90f6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeFormatter.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeFormatter.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeFormatter.scala index 855ae64..05b7c96 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeFormatter.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeFormatter.scala @@ -17,7 +17,7 @@ package org.apache.spark.sql.catalyst.expressions.codegen -import org.apache.commons.lang3.StringUtils +import java.util.regex.Matcher /** * An utility class that indents a block of code based on the curly braces and parentheses. @@ -26,13 +26,17 @@ import org.apache.commons.lang3.StringUtils * Written by Matei Zaharia. */ object CodeFormatter { + val commentHolder = """\/\*(.+?)\*\/""".r + def format(code: CodeAndComment): String = { -new CodeFormatter().addLines( - StringUtils.replaceEach( -code.body, -code.comment.keys.toArray, -code.comment.values.toArray) -).result +val formatter = new CodeFormatter +code.body.split("\n").foreach { line => + val commentReplaced = commentHolder.replaceAllIn( +line.trim, +m => code.comment.get(m.group(1)).map(Matcher.quoteReplacement).getOrElse(m.group(0))) + formatter.addLine(commentReplaced) +} +formatter.result() } def stripExtraNewLines(input: String): String = { @@ -53,16 +57,28 @@ object CodeFormatter { def stripOverlappingComments(codeAndComment: CodeAndComment): CodeAndComment = { val code = new StringBuilder val map = codeAndComment.comment + +def getComment(line: String): Option[String] = { + if (line.startsWith("/*") && line.endsWith("*/")) { +map.get(line.substring(2, line.length - 2)) + } else { +None + } +} + var lastLine: String = "dummy" codeAndComment.body.split('\n').foreach { l => val line = l.trim() - val skip = lastLine.startsWith("/*") && lastLine.endsWith("*/") && -line.startsWith("/*") && line.endsWith("*/") && -map(lastLine).substring(3).contains(map(line).substring(3)) + + val skip = getComment(lastLine).zip(getComment(line)).exists { +case (lastComment, currentComment) => + lastComment.substring(3).contains(currentComment.substring(3)) + } + if (!skip) { -code.append(line) -code.append("\n") +code.append(line).appen
spark git commit: [SPARK-15498][TESTS] fix slow tests
Repository: spark Updated Branches: refs/heads/master 4acababca -> 50b660d72 [SPARK-15498][TESTS] fix slow tests ## What changes were proposed in this pull request? This PR fixes 3 slow tests: 1. `ParquetQuerySuite.read/write wide table`: This is not a good unit test as it runs more than 5 minutes. This PR removes it and add a new regression test in `CodeGenerationSuite`, which is more "unit". 2. `ParquetQuerySuite.returning batch for wide table`: reduce the threshold and use smaller data size. 3. `DatasetSuite.SPARK-14554: Dataset.map may generate wrong java code for wide table`: Improve `CodeFormatter.format`(introduced at https://github.com/apache/spark/pull/12979) can dramatically speed this it up. ## How was this patch tested? N/A Author: Wenchen Fan <wenc...@databricks.com> Closes #13273 from cloud-fan/test. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/50b660d7 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/50b660d7 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/50b660d7 Branch: refs/heads/master Commit: 50b660d725269dc0c11e0d350ddd7fc8b19539a0 Parents: 4acabab Author: Wenchen Fan <wenc...@databricks.com> Authored: Tue May 24 21:23:39 2016 -0700 Committer: Cheng Lian <l...@databricks.com> Committed: Tue May 24 21:23:39 2016 -0700 -- .../expressions/codegen/CodeFormatter.scala | 48 + .../expressions/codegen/CodeGenerator.scala | 22 ++-- .../expressions/CodeGenerationSuite.scala | 14 +++ .../codegen/CodeFormatterSuite.scala| 107 +-- .../datasources/parquet/ParquetQuerySuite.scala | 16 +-- 5 files changed, 128 insertions(+), 79 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/50b660d7/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeFormatter.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeFormatter.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeFormatter.scala index 855ae64..05b7c96 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeFormatter.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeFormatter.scala @@ -17,7 +17,7 @@ package org.apache.spark.sql.catalyst.expressions.codegen -import org.apache.commons.lang3.StringUtils +import java.util.regex.Matcher /** * An utility class that indents a block of code based on the curly braces and parentheses. @@ -26,13 +26,17 @@ import org.apache.commons.lang3.StringUtils * Written by Matei Zaharia. */ object CodeFormatter { + val commentHolder = """\/\*(.+?)\*\/""".r + def format(code: CodeAndComment): String = { -new CodeFormatter().addLines( - StringUtils.replaceEach( -code.body, -code.comment.keys.toArray, -code.comment.values.toArray) -).result +val formatter = new CodeFormatter +code.body.split("\n").foreach { line => + val commentReplaced = commentHolder.replaceAllIn( +line.trim, +m => code.comment.get(m.group(1)).map(Matcher.quoteReplacement).getOrElse(m.group(0))) + formatter.addLine(commentReplaced) +} +formatter.result() } def stripExtraNewLines(input: String): String = { @@ -53,16 +57,28 @@ object CodeFormatter { def stripOverlappingComments(codeAndComment: CodeAndComment): CodeAndComment = { val code = new StringBuilder val map = codeAndComment.comment + +def getComment(line: String): Option[String] = { + if (line.startsWith("/*") && line.endsWith("*/")) { +map.get(line.substring(2, line.length - 2)) + } else { +None + } +} + var lastLine: String = "dummy" codeAndComment.body.split('\n').foreach { l => val line = l.trim() - val skip = lastLine.startsWith("/*") && lastLine.endsWith("*/") && -line.startsWith("/*") && line.endsWith("*/") && -map(lastLine).substring(3).contains(map(line).substring(3)) + + val skip = getComment(lastLine).zip(getComment(line)).exists { +case (lastComment, currentComment) => + lastComment.substring(3).contains(currentComment.substring(3)) + } + if (!skip) { -code.append(line) -code.append("\n") +code.append(line).append("\n") } + lastLine = line } new CodeAndComment(code.result().trim(), map) @@ -117,8 +133,9 @@ private class Co
spark git commit: [SPARK-15431][SQL] Support LIST FILE(s)|JAR(s) command natively
Repository: spark Updated Branches: refs/heads/branch-2.0 ca271c792 -> 3def56120 [SPARK-15431][SQL] Support LIST FILE(s)|JAR(s) command natively ## What changes were proposed in this pull request? Currently command `ADD FILE|JAR ` is supported natively in SparkSQL. However, when this command is run, the file/jar is added to the resources that can not be looked up by `LIST FILE(s)|JAR(s)` command because the `LIST` command is passed to Hive command processor in Spark-SQL or simply not supported in Spark-shell. There is no way users can find out what files/jars are added to the spark context. Refer to [Hive commands](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Cli) This PR is to support following commands: `LIST (FILE[s] [filepath ...] | JAR[s] [jarfile ...])` ### For example: # LIST FILE(s) ``` scala> spark.sql("add file hdfs://bdavm009.svl.ibm.com:8020/tmp/test.txt") res1: org.apache.spark.sql.DataFrame = [] scala> spark.sql("add file hdfs://bdavm009.svl.ibm.com:8020/tmp/test1.txt") res2: org.apache.spark.sql.DataFrame = [] scala> spark.sql("list file hdfs://bdavm009.svl.ibm.com:8020/tmp/test1.txt").show(false) +--+ |result| +--+ |hdfs://bdavm009.svl.ibm.com:8020/tmp/test1.txt| +--+ scala> spark.sql("list files").show(false) +--+ |result| +--+ |hdfs://bdavm009.svl.ibm.com:8020/tmp/test1.txt| |hdfs://bdavm009.svl.ibm.com:8020/tmp/test.txt | +--+ ``` # LIST JAR(s) ``` scala> spark.sql("add jar /Users/xinwu/spark/core/src/test/resources/TestUDTF.jar") res9: org.apache.spark.sql.DataFrame = [result: int] scala> spark.sql("list jar TestUDTF.jar").show(false) +-+ |result | +-+ |spark://192.168.1.234:50131/jars/TestUDTF.jar| +-+ scala> spark.sql("list jars").show(false) +-+ |result | +-+ |spark://192.168.1.234:50131/jars/TestUDTF.jar| +-+ ``` ## How was this patch tested? New test cases are added for Spark-SQL, Spark-Shell and SparkContext API code path. Author: Xin Wu <xi...@us.ibm.com> Author: xin Wu <xi...@us.ibm.com> Closes #13212 from xwu0226/list_command. (cherry picked from commit 01659bc50cd3d53815d205d005c3678e714c08e0) Signed-off-by: Cheng Lian <l...@databricks.com> Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/3def5612 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/3def5612 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/3def5612 Branch: refs/heads/branch-2.0 Commit: 3def56120e0a57238c95f64e7bb84f973b22c065 Parents: ca271c7 Author: Xin Wu <xi...@us.ibm.com> Authored: Mon May 23 17:32:01 2016 -0700 Committer: Cheng Lian <l...@databricks.com> Committed: Mon May 23 17:32:15 2016 -0700 -- .../scala/org/apache/spark/SparkContext.scala | 10 core/src/test/resources/TestUDTF.jar| Bin 0 -> 1328 bytes .../org/apache/spark/SparkContextSuite.scala| 14 - .../apache/spark/sql/catalyst/parser/SqlBase.g4 | 5 +- .../spark/sql/execution/SparkSqlParser.scala| 39 +++-- .../spark/sql/execution/command/resources.scala | 57 ++- .../hive/thriftserver/SparkSQLCLIDriver.scala | 10 ++-- .../src/test/resources/TestUDTF.jar | Bin 0 -> 1328 bytes .../spark/sql/hive/thriftserver/CliSuite.scala | 19 +++ .../sql/hive/execution/HiveQuerySuite.scala | 12 10 files changed, 149 insertions(+), 17 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/3def5612/core/src/main/scala/org/apache/spark/SparkContext.scala -- diff --git a/core/src/main/scala/org/apache/spark/SparkContext.scala b/core/src/main/scala/org/apache/spark/SparkContext.scala index e6cdd0d..351024b 100644 --- a/core/src/main/scala/org/apache/spark/SparkContext.scala +++ b/core/src/main/scala/org/apache/spark/SparkContext.scala @@ -1387,6 +1387,11 @@ class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationCli } /** + * Returns a list of file paths that are added to resources. + */
spark git commit: [SPARK-15431][SQL] Support LIST FILE(s)|JAR(s) command natively
Repository: spark Updated Branches: refs/heads/master a8e97d17b -> 01659bc50 [SPARK-15431][SQL] Support LIST FILE(s)|JAR(s) command natively ## What changes were proposed in this pull request? Currently command `ADD FILE|JAR ` is supported natively in SparkSQL. However, when this command is run, the file/jar is added to the resources that can not be looked up by `LIST FILE(s)|JAR(s)` command because the `LIST` command is passed to Hive command processor in Spark-SQL or simply not supported in Spark-shell. There is no way users can find out what files/jars are added to the spark context. Refer to [Hive commands](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Cli) This PR is to support following commands: `LIST (FILE[s] [filepath ...] | JAR[s] [jarfile ...])` ### For example: # LIST FILE(s) ``` scala> spark.sql("add file hdfs://bdavm009.svl.ibm.com:8020/tmp/test.txt") res1: org.apache.spark.sql.DataFrame = [] scala> spark.sql("add file hdfs://bdavm009.svl.ibm.com:8020/tmp/test1.txt") res2: org.apache.spark.sql.DataFrame = [] scala> spark.sql("list file hdfs://bdavm009.svl.ibm.com:8020/tmp/test1.txt").show(false) +--+ |result| +--+ |hdfs://bdavm009.svl.ibm.com:8020/tmp/test1.txt| +--+ scala> spark.sql("list files").show(false) +--+ |result| +--+ |hdfs://bdavm009.svl.ibm.com:8020/tmp/test1.txt| |hdfs://bdavm009.svl.ibm.com:8020/tmp/test.txt | +--+ ``` # LIST JAR(s) ``` scala> spark.sql("add jar /Users/xinwu/spark/core/src/test/resources/TestUDTF.jar") res9: org.apache.spark.sql.DataFrame = [result: int] scala> spark.sql("list jar TestUDTF.jar").show(false) +-+ |result | +-+ |spark://192.168.1.234:50131/jars/TestUDTF.jar| +-+ scala> spark.sql("list jars").show(false) +-+ |result | +-+ |spark://192.168.1.234:50131/jars/TestUDTF.jar| +-+ ``` ## How was this patch tested? New test cases are added for Spark-SQL, Spark-Shell and SparkContext API code path. Author: Xin Wu <xi...@us.ibm.com> Author: xin Wu <xi...@us.ibm.com> Closes #13212 from xwu0226/list_command. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/01659bc5 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/01659bc5 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/01659bc5 Branch: refs/heads/master Commit: 01659bc50cd3d53815d205d005c3678e714c08e0 Parents: a8e97d1 Author: Xin Wu <xi...@us.ibm.com> Authored: Mon May 23 17:32:01 2016 -0700 Committer: Cheng Lian <l...@databricks.com> Committed: Mon May 23 17:32:01 2016 -0700 -- .../scala/org/apache/spark/SparkContext.scala | 10 core/src/test/resources/TestUDTF.jar| Bin 0 -> 1328 bytes .../org/apache/spark/SparkContextSuite.scala| 14 - .../apache/spark/sql/catalyst/parser/SqlBase.g4 | 5 +- .../spark/sql/execution/SparkSqlParser.scala| 39 +++-- .../spark/sql/execution/command/resources.scala | 57 ++- .../hive/thriftserver/SparkSQLCLIDriver.scala | 10 ++-- .../src/test/resources/TestUDTF.jar | Bin 0 -> 1328 bytes .../spark/sql/hive/thriftserver/CliSuite.scala | 19 +++ .../sql/hive/execution/HiveQuerySuite.scala | 12 10 files changed, 149 insertions(+), 17 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/01659bc5/core/src/main/scala/org/apache/spark/SparkContext.scala -- diff --git a/core/src/main/scala/org/apache/spark/SparkContext.scala b/core/src/main/scala/org/apache/spark/SparkContext.scala index e6cdd0d..351024b 100644 --- a/core/src/main/scala/org/apache/spark/SparkContext.scala +++ b/core/src/main/scala/org/apache/spark/SparkContext.scala @@ -1387,6 +1387,11 @@ class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationCli } /** + * Returns a list of file paths that are added to resources. + */ + def listFiles(): Seq[String] = addedFiles.keySet.toSeq + + /** * Add a file to be downloaded with this Spark job on
spark git commit: [SPARK-14031][SQL] speedup CSV writer
Repository: spark Updated Branches: refs/heads/branch-2.0 4462da707 -> 6eb8ec6f4 [SPARK-14031][SQL] speedup CSV writer ## What changes were proposed in this pull request? Currently, we create an CSVWriter for every row, it's very expensive and memory hungry, took about 15 seconds to write out 1 mm rows (two columns). This PR will write the rows in batch mode, create a CSVWriter for every 1k rows, which could write out 1 mm rows in about 1 seconds (15X faster). ## How was this patch tested? Manually benchmark it. Author: Davies Liu <dav...@databricks.com> Closes #13229 from davies/csv_writer. (cherry picked from commit 80091b8a6840b562cf76341926e5b828d4def7e2) Signed-off-by: Cheng Lian <l...@databricks.com> Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6eb8ec6f Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6eb8ec6f Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6eb8ec6f Branch: refs/heads/branch-2.0 Commit: 6eb8ec6f41d7e460ba7f706590d2da62c045aead Parents: 4462da7 Author: Davies Liu <dav...@databricks.com> Authored: Mon May 23 10:48:25 2016 -0700 Committer: Cheng Lian <l...@databricks.com> Committed: Mon May 23 10:48:39 2016 -0700 -- .../execution/datasources/csv/CSVParser.scala | 19 +++- .../execution/datasources/csv/CSVRelation.scala | 23 +--- 2 files changed, 29 insertions(+), 13 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/6eb8ec6f/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVParser.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVParser.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVParser.scala index ae797a1..111995d 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVParser.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVParser.scala @@ -76,17 +76,26 @@ private[sql] class LineCsvWriter(params: CSVOptions, headers: Seq[String]) exten writerSettings.setQuoteAllFields(false) writerSettings.setHeaders(headers: _*) - def writeRow(row: Seq[String], includeHeader: Boolean): String = { -val buffer = new ByteArrayOutputStream() -val outputWriter = new OutputStreamWriter(buffer, StandardCharsets.UTF_8) -val writer = new CsvWriter(outputWriter, writerSettings) + private var buffer = new ByteArrayOutputStream() + private var writer = new CsvWriter( +new OutputStreamWriter(buffer, StandardCharsets.UTF_8), +writerSettings) + def writeRow(row: Seq[String], includeHeader: Boolean): Unit = { if (includeHeader) { writer.writeHeaders() } writer.writeRow(row.toArray: _*) + } + + def flush(): String = { writer.close() -buffer.toString.stripLineEnd +val lines = buffer.toString.stripLineEnd +buffer = new ByteArrayOutputStream() +writer = new CsvWriter( + new OutputStreamWriter(buffer, StandardCharsets.UTF_8), + writerSettings) +lines } } http://git-wip-us.apache.org/repos/asf/spark/blob/6eb8ec6f/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVRelation.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVRelation.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVRelation.scala index 4f2d438..9849484 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVRelation.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVRelation.scala @@ -176,8 +176,8 @@ private[sql] class CsvOutputWriter( }.getRecordWriter(context) } - private var firstRow: Boolean = params.headerFlag - + private val FLUSH_BATCH_SIZE = 1024L + private var records: Long = 0L private val csvWriter = new LineCsvWriter(params, dataSchema.fieldNames.toSeq) private def rowToString(row: Seq[Any]): Seq[String] = row.map { field => @@ -191,16 +191,23 @@ private[sql] class CsvOutputWriter( override def write(row: Row): Unit = throw new UnsupportedOperationException("call writeInternal") override protected[sql] def writeInternal(row: InternalRow): Unit = { -// TODO: Instead of converting and writing every row, we should use the univocity buffer -val resultString = csvWriter.writeRow(rowToString(row.toSeq(dataSchema)), firstRow) -if (firstRow) { - firstRow = false +csvWriter.writeRow(rowToString(row.toSeq(dataSchema)), records == 0L && params.headerFlag) +records += 1 +if
spark git commit: [SPARK-14031][SQL] speedup CSV writer
Repository: spark Updated Branches: refs/heads/master dafcb05c2 -> 80091b8a6 [SPARK-14031][SQL] speedup CSV writer ## What changes were proposed in this pull request? Currently, we create an CSVWriter for every row, it's very expensive and memory hungry, took about 15 seconds to write out 1 mm rows (two columns). This PR will write the rows in batch mode, create a CSVWriter for every 1k rows, which could write out 1 mm rows in about 1 seconds (15X faster). ## How was this patch tested? Manually benchmark it. Author: Davies Liu <dav...@databricks.com> Closes #13229 from davies/csv_writer. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/80091b8a Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/80091b8a Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/80091b8a Branch: refs/heads/master Commit: 80091b8a6840b562cf76341926e5b828d4def7e2 Parents: dafcb05 Author: Davies Liu <dav...@databricks.com> Authored: Mon May 23 10:48:25 2016 -0700 Committer: Cheng Lian <l...@databricks.com> Committed: Mon May 23 10:48:25 2016 -0700 -- .../execution/datasources/csv/CSVParser.scala | 19 +++- .../execution/datasources/csv/CSVRelation.scala | 23 +--- 2 files changed, 29 insertions(+), 13 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/80091b8a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVParser.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVParser.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVParser.scala index ae797a1..111995d 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVParser.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVParser.scala @@ -76,17 +76,26 @@ private[sql] class LineCsvWriter(params: CSVOptions, headers: Seq[String]) exten writerSettings.setQuoteAllFields(false) writerSettings.setHeaders(headers: _*) - def writeRow(row: Seq[String], includeHeader: Boolean): String = { -val buffer = new ByteArrayOutputStream() -val outputWriter = new OutputStreamWriter(buffer, StandardCharsets.UTF_8) -val writer = new CsvWriter(outputWriter, writerSettings) + private var buffer = new ByteArrayOutputStream() + private var writer = new CsvWriter( +new OutputStreamWriter(buffer, StandardCharsets.UTF_8), +writerSettings) + def writeRow(row: Seq[String], includeHeader: Boolean): Unit = { if (includeHeader) { writer.writeHeaders() } writer.writeRow(row.toArray: _*) + } + + def flush(): String = { writer.close() -buffer.toString.stripLineEnd +val lines = buffer.toString.stripLineEnd +buffer = new ByteArrayOutputStream() +writer = new CsvWriter( + new OutputStreamWriter(buffer, StandardCharsets.UTF_8), + writerSettings) +lines } } http://git-wip-us.apache.org/repos/asf/spark/blob/80091b8a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVRelation.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVRelation.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVRelation.scala index 4f2d438..9849484 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVRelation.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVRelation.scala @@ -176,8 +176,8 @@ private[sql] class CsvOutputWriter( }.getRecordWriter(context) } - private var firstRow: Boolean = params.headerFlag - + private val FLUSH_BATCH_SIZE = 1024L + private var records: Long = 0L private val csvWriter = new LineCsvWriter(params, dataSchema.fieldNames.toSeq) private def rowToString(row: Seq[Any]): Seq[String] = row.map { field => @@ -191,16 +191,23 @@ private[sql] class CsvOutputWriter( override def write(row: Row): Unit = throw new UnsupportedOperationException("call writeInternal") override protected[sql] def writeInternal(row: InternalRow): Unit = { -// TODO: Instead of converting and writing every row, we should use the univocity buffer -val resultString = csvWriter.writeRow(rowToString(row.toSeq(dataSchema)), firstRow) -if (firstRow) { - firstRow = false +csvWriter.writeRow(rowToString(row.toSeq(dataSchema)), records == 0L && params.headerFlag) +records += 1 +if (records % FLUSH_BATCH_SIZE == 0) { + flush() +} + } + + private def flush(): Unit = { +val lines = csvWriter.flush() +i
spark git commit: [SPARK-15334][SQL][HOTFIX] Fixes compilation error for Scala 2.10
Repository: spark Updated Branches: refs/heads/branch-2.0 14751cd0d -> a122a3e70 [SPARK-15334][SQL][HOTFIX] Fixes compilation error for Scala 2.10 ## What changes were proposed in this pull request? This PR fixes a Scala 2.10 compilation failure introduced in PR #13127. ## How was this patch tested? Jenkins build. Author: Cheng Lian <l...@databricks.com> Closes #13166 from liancheng/hotfix-for-scala-2.10. (cherry picked from commit c4a45fd855252f0f7130cdcc08486591fadab7df) Signed-off-by: Cheng Lian <l...@databricks.com> Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a122a3e7 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a122a3e7 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a122a3e7 Branch: refs/heads/branch-2.0 Commit: a122a3e709479a6510d97dfa81f3d6219bb79de0 Parents: 14751cd Author: Cheng Lian <l...@databricks.com> Authored: Wed May 18 18:58:24 2016 +0800 Committer: Cheng Lian <l...@databricks.com> Committed: Wed May 18 18:58:54 2016 +0800 -- .../scala/org/apache/spark/sql/hive/client/VersionsSuite.scala | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/a122a3e7/sql/hive/src/test/scala/org/apache/spark/sql/hive/client/VersionsSuite.scala -- diff --git a/sql/hive/src/test/scala/org/apache/spark/sql/hive/client/VersionsSuite.scala b/sql/hive/src/test/scala/org/apache/spark/sql/hive/client/VersionsSuite.scala index 57e8db7..d46c4e7 100644 --- a/sql/hive/src/test/scala/org/apache/spark/sql/hive/client/VersionsSuite.scala +++ b/sql/hive/src/test/scala/org/apache/spark/sql/hive/client/VersionsSuite.scala @@ -137,7 +137,7 @@ class VersionsSuite extends SparkFunSuite with Logging { hiveMetastoreVersion = version, hadoopVersion = VersionInfo.getVersion, sparkConf = sparkConf, - hadoopConf, + hadoopConf = hadoopConf, config = buildConf(), ivyPath = ivyPath).createClient() } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-15334][SQL][HOTFIX] Fixes compilation error for Scala 2.10
Repository: spark Updated Branches: refs/heads/master d2f81df1b -> c4a45fd85 [SPARK-15334][SQL][HOTFIX] Fixes compilation error for Scala 2.10 ## What changes were proposed in this pull request? This PR fixes a Scala 2.10 compilation failure introduced in PR #13127. ## How was this patch tested? Jenkins build. Author: Cheng Lian <l...@databricks.com> Closes #13166 from liancheng/hotfix-for-scala-2.10. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c4a45fd8 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c4a45fd8 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c4a45fd8 Branch: refs/heads/master Commit: c4a45fd855252f0f7130cdcc08486591fadab7df Parents: d2f81df Author: Cheng Lian <l...@databricks.com> Authored: Wed May 18 18:58:24 2016 +0800 Committer: Cheng Lian <l...@databricks.com> Committed: Wed May 18 18:58:24 2016 +0800 -- .../scala/org/apache/spark/sql/hive/client/VersionsSuite.scala | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/c4a45fd8/sql/hive/src/test/scala/org/apache/spark/sql/hive/client/VersionsSuite.scala -- diff --git a/sql/hive/src/test/scala/org/apache/spark/sql/hive/client/VersionsSuite.scala b/sql/hive/src/test/scala/org/apache/spark/sql/hive/client/VersionsSuite.scala index 57e8db7..d46c4e7 100644 --- a/sql/hive/src/test/scala/org/apache/spark/sql/hive/client/VersionsSuite.scala +++ b/sql/hive/src/test/scala/org/apache/spark/sql/hive/client/VersionsSuite.scala @@ -137,7 +137,7 @@ class VersionsSuite extends SparkFunSuite with Logging { hiveMetastoreVersion = version, hadoopVersion = VersionInfo.getVersion, sparkConf = sparkConf, - hadoopConf, + hadoopConf = hadoopConf, config = buildConf(), ivyPath = ivyPath).createClient() } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-15307][SQL] speed up listing files for data source
Repository: spark Updated Branches: refs/heads/branch-2.0 c66da74b1 -> 35c25beac [SPARK-15307][SQL] speed up listing files for data source ## What changes were proposed in this pull request? Currently, listing files is very slow if there is thousands files, especially on local file system, because: 1) FileStatus.getPermission() is very slow on local file system, which is launch a subprocess and parse the stdout. 2) Create an JobConf is very expensive (ClassUtil.findContainingJar() is slow). This PR improve these by: 1) Use another constructor of LocatedFileStatus to avoid calling FileStatus.getPermission, the permissions are not used for data sources. 2) Only create an JobConf once within one task. ## How was this patch tested? Manually tests on a partitioned table with 1828 partitions, decrease the time to load the table from 22 seconds to 1.6 seconds (Most of time are spent in merging schema now). Author: Davies Liu <dav...@databricks.com> Closes #13094 from davies/listing. (cherry picked from commit 33814f887aea339c99e14ce7f14ca6fcc6875015) Signed-off-by: Cheng Lian <l...@databricks.com> Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/35c25bea Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/35c25bea Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/35c25bea Branch: refs/heads/branch-2.0 Commit: 35c25beaca3e9070ccb6dcca33c415a26e6f1f1c Parents: c66da74 Author: Davies Liu <dav...@databricks.com> Authored: Wed May 18 18:46:57 2016 +0800 Committer: Cheng Lian <l...@databricks.com> Committed: Wed May 18 18:47:19 2016 +0800 -- .../datasources/ListingFileCatalog.scala| 9 ++--- .../datasources/fileSourceInterfaces.scala | 38 ++-- 2 files changed, 33 insertions(+), 14 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/35c25bea/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ListingFileCatalog.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ListingFileCatalog.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ListingFileCatalog.scala index 5cee2b9..644e5d6 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ListingFileCatalog.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ListingFileCatalog.scala @@ -77,12 +77,12 @@ class ListingFileCatalog( if (paths.length >= sparkSession.sessionState.conf.parallelPartitionDiscoveryThreshold) { HadoopFsRelation.listLeafFilesInParallel(paths, hadoopConf, sparkSession.sparkContext) } else { + // Dummy jobconf to get to the pathFilter defined in configuration + val jobConf = new JobConf(hadoopConf, this.getClass) + val pathFilter = FileInputFormat.getInputPathFilter(jobConf) val statuses: Seq[FileStatus] = paths.flatMap { path => val fs = path.getFileSystem(hadoopConf) logInfo(s"Listing $path on driver") -// Dummy jobconf to get to the pathFilter defined in configuration -val jobConf = new JobConf(hadoopConf, this.getClass) -val pathFilter = FileInputFormat.getInputPathFilter(jobConf) val statuses = { val stats = Try(fs.listStatus(path)).getOrElse(Array.empty[FileStatus]) @@ -101,7 +101,8 @@ class ListingFileCatalog( // - Here we are calling `getFileBlockLocations` in a sequential manner, but it should a // a big deal since we always use to `listLeafFilesInParallel` when the number of paths // exceeds threshold. - case f => new LocatedFileStatus(f, fs.getFileBlockLocations(f, 0, f.getLen)) + case f => +HadoopFsRelation.createLocatedFileStatus(f, fs.getFileBlockLocations(f, 0, f.getLen)) } }.filterNot { status => val name = status.getPath.getName http://git-wip-us.apache.org/repos/asf/spark/blob/35c25bea/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala index b516297..8d332df 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala @@ -348,28 +348,40 @@ private[sql] object HadoopFsRelation extends Logging { pathName == "_SUCCESS" || pathName == "
spark git commit: [SPARK-15307][SQL] speed up listing files for data source
Repository: spark Updated Branches: refs/heads/master 6e02aec44 -> 33814f887 [SPARK-15307][SQL] speed up listing files for data source ## What changes were proposed in this pull request? Currently, listing files is very slow if there is thousands files, especially on local file system, because: 1) FileStatus.getPermission() is very slow on local file system, which is launch a subprocess and parse the stdout. 2) Create an JobConf is very expensive (ClassUtil.findContainingJar() is slow). This PR improve these by: 1) Use another constructor of LocatedFileStatus to avoid calling FileStatus.getPermission, the permissions are not used for data sources. 2) Only create an JobConf once within one task. ## How was this patch tested? Manually tests on a partitioned table with 1828 partitions, decrease the time to load the table from 22 seconds to 1.6 seconds (Most of time are spent in merging schema now). Author: Davies Liu <dav...@databricks.com> Closes #13094 from davies/listing. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/33814f88 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/33814f88 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/33814f88 Branch: refs/heads/master Commit: 33814f887aea339c99e14ce7f14ca6fcc6875015 Parents: 6e02aec Author: Davies Liu <dav...@databricks.com> Authored: Wed May 18 18:46:57 2016 +0800 Committer: Cheng Lian <l...@databricks.com> Committed: Wed May 18 18:46:57 2016 +0800 -- .../datasources/ListingFileCatalog.scala| 9 ++--- .../datasources/fileSourceInterfaces.scala | 38 ++-- 2 files changed, 33 insertions(+), 14 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/33814f88/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ListingFileCatalog.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ListingFileCatalog.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ListingFileCatalog.scala index 5cee2b9..644e5d6 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ListingFileCatalog.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ListingFileCatalog.scala @@ -77,12 +77,12 @@ class ListingFileCatalog( if (paths.length >= sparkSession.sessionState.conf.parallelPartitionDiscoveryThreshold) { HadoopFsRelation.listLeafFilesInParallel(paths, hadoopConf, sparkSession.sparkContext) } else { + // Dummy jobconf to get to the pathFilter defined in configuration + val jobConf = new JobConf(hadoopConf, this.getClass) + val pathFilter = FileInputFormat.getInputPathFilter(jobConf) val statuses: Seq[FileStatus] = paths.flatMap { path => val fs = path.getFileSystem(hadoopConf) logInfo(s"Listing $path on driver") -// Dummy jobconf to get to the pathFilter defined in configuration -val jobConf = new JobConf(hadoopConf, this.getClass) -val pathFilter = FileInputFormat.getInputPathFilter(jobConf) val statuses = { val stats = Try(fs.listStatus(path)).getOrElse(Array.empty[FileStatus]) @@ -101,7 +101,8 @@ class ListingFileCatalog( // - Here we are calling `getFileBlockLocations` in a sequential manner, but it should a // a big deal since we always use to `listLeafFilesInParallel` when the number of paths // exceeds threshold. - case f => new LocatedFileStatus(f, fs.getFileBlockLocations(f, 0, f.getLen)) + case f => +HadoopFsRelation.createLocatedFileStatus(f, fs.getFileBlockLocations(f, 0, f.getLen)) } }.filterNot { status => val name = status.getPath.getName http://git-wip-us.apache.org/repos/asf/spark/blob/33814f88/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala index b516297..8d332df 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala @@ -348,28 +348,40 @@ private[sql] object HadoopFsRelation extends Logging { pathName == "_SUCCESS" || pathName == "_temporary" || pathName.startsWith(".") } + /** + * Create a LocatedFileStatus using FileStatus and block loca
[3/3] spark git commit: [SPARK-15171][SQL] Remove the references to deprecated method dataset.registerTempTable
[SPARK-15171][SQL] Remove the references to deprecated method dataset.registerTempTable ## What changes were proposed in this pull request? Update the unit test code, examples, and documents to remove calls to deprecated method `dataset.registerTempTable`. ## How was this patch tested? This PR only changes the unit test code, examples, and comments. It should be safe. This is a follow up of PR https://github.com/apache/spark/pull/12945 which was merged. Author: Sean Zhong <seanzh...@databricks.com> Closes #13098 from clockfly/spark-15171-remove-deprecation. (cherry picked from commit 25b315e6cad7c27b62dcaa2c194293c1115fdfb3) Signed-off-by: Cheng Lian <l...@databricks.com> Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/5f5270ea Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/5f5270ea Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/5f5270ea Branch: refs/heads/branch-2.0 Commit: 5f5270ead86d5294af6c871e36112e2a833e9d64 Parents: 1db3741 Author: Sean Zhong <seanzh...@databricks.com> Authored: Wed May 18 09:01:59 2016 +0800 Committer: Cheng Lian <l...@databricks.com> Committed: Wed May 18 09:05:34 2016 +0800 -- docs/sql-programming-guide.md | 48 - docs/streaming-programming-guide.md | 12 +-- .../apache/spark/examples/sql/JavaSparkSQL.java | 8 +- .../streaming/JavaSqlNetworkWordCount.java | 2 +- examples/src/main/python/sql.py | 2 +- .../python/streaming/sql_network_wordcount.py | 2 +- .../apache/spark/examples/sql/RDDRelation.scala | 6 +- .../spark/examples/sql/hive/HiveFromSpark.scala | 4 +- .../streaming/SqlNetworkWordCount.scala | 2 +- .../org/apache/spark/ml/JavaPipelineSuite.java | 2 +- .../JavaLogisticRegressionSuite.java| 10 +- .../regression/JavaLinearRegressionSuite.java | 4 +- python/pyspark/sql/context.py | 4 +- python/pyspark/sql/readwriter.py| 2 +- python/pyspark/sql/session.py | 2 +- python/pyspark/sql/tests.py | 25 ++--- .../scala/org/apache/spark/sql/SQLContext.scala | 2 +- .../apache/spark/sql/JavaApplySchemaSuite.java | 8 +- .../spark/sql/sources/JavaSaveLoadSuite.java| 2 +- .../org/apache/spark/sql/CachedTableSuite.scala | 60 +-- .../spark/sql/ColumnExpressionSuite.scala | 2 +- .../spark/sql/DataFrameTimeWindowingSuite.scala | 2 +- .../apache/spark/sql/DataFrameWindowSuite.scala | 22 ++-- .../scala/org/apache/spark/sql/JoinSuite.scala | 4 +- .../org/apache/spark/sql/ListTablesSuite.scala | 4 +- .../org/apache/spark/sql/SQLContextSuite.scala | 2 +- .../org/apache/spark/sql/SQLQuerySuite.scala| 103 ++- .../sql/ScalaReflectionRelationSuite.scala | 10 +- .../org/apache/spark/sql/SubquerySuite.scala| 8 +- .../scala/org/apache/spark/sql/UDFSuite.scala | 12 +-- .../apache/spark/sql/UserDefinedTypeSuite.scala | 2 +- .../spark/sql/execution/PlannerSuite.scala | 10 +- .../benchmark/AggregateBenchmark.scala | 3 +- .../columnar/InMemoryColumnarQuerySuite.scala | 8 +- .../columnar/PartitionBatchPruningSuite.scala | 2 +- .../execution/datasources/json/JsonSuite.scala | 58 +-- .../ParquetPartitionDiscoverySuite.scala| 10 +- .../datasources/parquet/ParquetQuerySuite.scala | 4 +- .../parquet/ParquetReadBenchmark.scala | 20 ++-- .../datasources/parquet/TPCDSBenchmark.scala| 2 +- .../sql/execution/metric/SQLMetricsSuite.scala | 8 +- .../org/apache/spark/sql/jdbc/JDBCSuite.scala | 2 +- .../sql/sources/CreateTableAsSelectSuite.scala | 2 +- .../apache/spark/sql/sources/InsertSuite.scala | 6 +- .../spark/sql/sources/SaveLoadSuite.scala | 4 +- .../spark/sql/streaming/StreamSuite.scala | 2 +- .../org/apache/spark/sql/test/SQLTestData.scala | 46 - .../spark/sql/hive/JavaDataFrameSuite.java | 2 +- .../sql/hive/JavaMetastoreDataSourcesSuite.java | 2 +- .../spark/sql/hive/ErrorPositionSuite.scala | 4 +- .../spark/sql/hive/HiveParquetSuite.scala | 4 +- .../spark/sql/hive/HiveSparkSubmitSuite.scala | 8 +- .../sql/hive/InsertIntoHiveTableSuite.scala | 12 +-- .../sql/hive/MetastoreDataSourcesSuite.scala| 8 +- .../hive/ParquetHiveCompatibilitySuite.scala| 2 +- .../spark/sql/hive/QueryPartitionSuite.scala| 2 +- .../apache/spark/sql/hive/StatisticsSuite.scala | 2 +- .../org/apache/spark/sql/hive/UDFSuite.scala| 2 +- .../hive/execution/AggregationQuerySuite.scala | 8 +- .../sql/hive/execution/HiveExplainSuite.scala | 2 +- .../execution/HiveOperatorQueryableSuite.scala | 4 +- .../spark/sql/hive/execution/HivePlanTest.scala | 2 +- .../sql/hive/execut
[2/3] spark git commit: [SPARK-15171][SQL] Remove the references to deprecated method dataset.registerTempTable
http://git-wip-us.apache.org/repos/asf/spark/blob/5f5270ea/sql/core/src/test/scala/org/apache/spark/sql/execution/PlannerSuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/PlannerSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/PlannerSuite.scala index d2e1ea1..2a5295d 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/PlannerSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/PlannerSuite.scala @@ -78,7 +78,7 @@ class PlannerSuite extends SharedSQLContext { val schema = StructType(fields) val row = Row.fromSeq(Seq.fill(fields.size)(null)) val rowRDD = sparkContext.parallelize(row :: Nil) -spark.createDataFrame(rowRDD, schema).registerTempTable("testLimit") +spark.createDataFrame(rowRDD, schema).createOrReplaceTempView("testLimit") val planned = sql( """ @@ -132,7 +132,7 @@ class PlannerSuite extends SharedSQLContext { test("InMemoryRelation statistics propagation") { withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "81920") { withTempTable("tiny") { -testData.limit(3).registerTempTable("tiny") +testData.limit(3).createOrReplaceTempView("tiny") sql("CACHE TABLE tiny") val a = testData.as("a") @@ -199,9 +199,9 @@ class PlannerSuite extends SharedSQLContext { test("PartitioningCollection") { withTempTable("normal", "small", "tiny") { - testData.registerTempTable("normal") - testData.limit(10).registerTempTable("small") - testData.limit(3).registerTempTable("tiny") + testData.createOrReplaceTempView("normal") + testData.limit(10).createOrReplaceTempView("small") + testData.limit(3).createOrReplaceTempView("tiny") // Disable broadcast join withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") { http://git-wip-us.apache.org/repos/asf/spark/blob/5f5270ea/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/AggregateBenchmark.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/AggregateBenchmark.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/AggregateBenchmark.scala index b31338e..bf3a39c 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/AggregateBenchmark.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/AggregateBenchmark.scala @@ -134,7 +134,8 @@ class AggregateBenchmark extends BenchmarkBase { val N = 20 << 22 val benchmark = new Benchmark("Aggregate w keys", N) -sparkSession.range(N).selectExpr("id", "floor(rand() * 1) as k").registerTempTable("test") +sparkSession.range(N).selectExpr("id", "floor(rand() * 1) as k") + .createOrReplaceTempView("test") def f(): Unit = sparkSession.sql("select k, k, sum(id) from test group by k, k").collect() http://git-wip-us.apache.org/repos/asf/spark/blob/5f5270ea/sql/core/src/test/scala/org/apache/spark/sql/execution/columnar/InMemoryColumnarQuerySuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/columnar/InMemoryColumnarQuerySuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/columnar/InMemoryColumnarQuerySuite.scala index 2099d4e..e2fb913 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/columnar/InMemoryColumnarQuerySuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/columnar/InMemoryColumnarQuerySuite.scala @@ -42,7 +42,7 @@ class InMemoryColumnarQuerySuite extends QueryTest with SharedSQLContext { test("default size avoids broadcast") { // TODO: Improve this test when we have better statistics sparkContext.parallelize(1 to 10).map(i => TestData(i, i.toString)) - .toDF().registerTempTable("sizeTst") + .toDF().createOrReplaceTempView("sizeTst") spark.catalog.cacheTable("sizeTst") assert( spark.table("sizeTst").queryExecution.analyzed.statistics.sizeInBytes > @@ -92,7 +92,7 @@ class InMemoryColumnarQuerySuite extends QueryTest with SharedSQLContext { test("SPARK-2729 regression: timestamp data type") { val timestamps = (0 to 3).map(i => Tuple1(new Timestamp(i))).toDF("time") -timestamps.registerTempTable("timestamps") +timestamps.createOrReplaceTempView("timestamps") checkAnswer( sql("SELECT time FROM timestamps"), @@ -133,7 +133,7 @@ class InMemoryColumnarQuerySuite extends QueryTest with SharedSQLContext { assert(df.schema.head.dataType === DecimalType(15, 10)) -df.cache().registerTempTable("test_fixed_decimal") +df.cache().createOrReplaceTempView("test_fixed_decimal") checkAnswer( sql("SELECT * FROM test_fixed_decimal"),
[2/3] spark git commit: [SPARK-15171][SQL] Remove the references to deprecated method dataset.registerTempTable
http://git-wip-us.apache.org/repos/asf/spark/blob/25b315e6/sql/core/src/test/scala/org/apache/spark/sql/execution/PlannerSuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/PlannerSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/PlannerSuite.scala index d2e1ea1..2a5295d 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/PlannerSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/PlannerSuite.scala @@ -78,7 +78,7 @@ class PlannerSuite extends SharedSQLContext { val schema = StructType(fields) val row = Row.fromSeq(Seq.fill(fields.size)(null)) val rowRDD = sparkContext.parallelize(row :: Nil) -spark.createDataFrame(rowRDD, schema).registerTempTable("testLimit") +spark.createDataFrame(rowRDD, schema).createOrReplaceTempView("testLimit") val planned = sql( """ @@ -132,7 +132,7 @@ class PlannerSuite extends SharedSQLContext { test("InMemoryRelation statistics propagation") { withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "81920") { withTempTable("tiny") { -testData.limit(3).registerTempTable("tiny") +testData.limit(3).createOrReplaceTempView("tiny") sql("CACHE TABLE tiny") val a = testData.as("a") @@ -199,9 +199,9 @@ class PlannerSuite extends SharedSQLContext { test("PartitioningCollection") { withTempTable("normal", "small", "tiny") { - testData.registerTempTable("normal") - testData.limit(10).registerTempTable("small") - testData.limit(3).registerTempTable("tiny") + testData.createOrReplaceTempView("normal") + testData.limit(10).createOrReplaceTempView("small") + testData.limit(3).createOrReplaceTempView("tiny") // Disable broadcast join withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") { http://git-wip-us.apache.org/repos/asf/spark/blob/25b315e6/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/AggregateBenchmark.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/AggregateBenchmark.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/AggregateBenchmark.scala index b31338e..bf3a39c 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/AggregateBenchmark.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/AggregateBenchmark.scala @@ -134,7 +134,8 @@ class AggregateBenchmark extends BenchmarkBase { val N = 20 << 22 val benchmark = new Benchmark("Aggregate w keys", N) -sparkSession.range(N).selectExpr("id", "floor(rand() * 1) as k").registerTempTable("test") +sparkSession.range(N).selectExpr("id", "floor(rand() * 1) as k") + .createOrReplaceTempView("test") def f(): Unit = sparkSession.sql("select k, k, sum(id) from test group by k, k").collect() http://git-wip-us.apache.org/repos/asf/spark/blob/25b315e6/sql/core/src/test/scala/org/apache/spark/sql/execution/columnar/InMemoryColumnarQuerySuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/columnar/InMemoryColumnarQuerySuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/columnar/InMemoryColumnarQuerySuite.scala index 2099d4e..e2fb913 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/columnar/InMemoryColumnarQuerySuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/columnar/InMemoryColumnarQuerySuite.scala @@ -42,7 +42,7 @@ class InMemoryColumnarQuerySuite extends QueryTest with SharedSQLContext { test("default size avoids broadcast") { // TODO: Improve this test when we have better statistics sparkContext.parallelize(1 to 10).map(i => TestData(i, i.toString)) - .toDF().registerTempTable("sizeTst") + .toDF().createOrReplaceTempView("sizeTst") spark.catalog.cacheTable("sizeTst") assert( spark.table("sizeTst").queryExecution.analyzed.statistics.sizeInBytes > @@ -92,7 +92,7 @@ class InMemoryColumnarQuerySuite extends QueryTest with SharedSQLContext { test("SPARK-2729 regression: timestamp data type") { val timestamps = (0 to 3).map(i => Tuple1(new Timestamp(i))).toDF("time") -timestamps.registerTempTable("timestamps") +timestamps.createOrReplaceTempView("timestamps") checkAnswer( sql("SELECT time FROM timestamps"), @@ -133,7 +133,7 @@ class InMemoryColumnarQuerySuite extends QueryTest with SharedSQLContext { assert(df.schema.head.dataType === DecimalType(15, 10)) -df.cache().registerTempTable("test_fixed_decimal") +df.cache().createOrReplaceTempView("test_fixed_decimal") checkAnswer( sql("SELECT * FROM test_fixed_decimal"),
[3/3] spark git commit: [SPARK-15171][SQL] Remove the references to deprecated method dataset.registerTempTable
[SPARK-15171][SQL] Remove the references to deprecated method dataset.registerTempTable ## What changes were proposed in this pull request? Update the unit test code, examples, and documents to remove calls to deprecated method `dataset.registerTempTable`. ## How was this patch tested? This PR only changes the unit test code, examples, and comments. It should be safe. This is a follow up of PR https://github.com/apache/spark/pull/12945 which was merged. Author: Sean Zhong <seanzh...@databricks.com> Closes #13098 from clockfly/spark-15171-remove-deprecation. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/25b315e6 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/25b315e6 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/25b315e6 Branch: refs/heads/master Commit: 25b315e6cad7c27b62dcaa2c194293c1115fdfb3 Parents: b674e67 Author: Sean Zhong <seanzh...@databricks.com> Authored: Wed May 18 09:01:59 2016 +0800 Committer: Cheng Lian <l...@databricks.com> Committed: Wed May 18 09:01:59 2016 +0800 -- docs/sql-programming-guide.md | 48 - docs/streaming-programming-guide.md | 12 +-- .../apache/spark/examples/sql/JavaSparkSQL.java | 8 +- .../streaming/JavaSqlNetworkWordCount.java | 2 +- examples/src/main/python/sql.py | 2 +- .../python/streaming/sql_network_wordcount.py | 2 +- .../apache/spark/examples/sql/RDDRelation.scala | 6 +- .../spark/examples/sql/hive/HiveFromSpark.scala | 4 +- .../streaming/SqlNetworkWordCount.scala | 2 +- .../org/apache/spark/ml/JavaPipelineSuite.java | 2 +- .../JavaLogisticRegressionSuite.java| 10 +- .../regression/JavaLinearRegressionSuite.java | 4 +- python/pyspark/sql/context.py | 4 +- python/pyspark/sql/readwriter.py| 2 +- python/pyspark/sql/session.py | 2 +- python/pyspark/sql/tests.py | 25 ++--- .../scala/org/apache/spark/sql/SQLContext.scala | 2 +- .../apache/spark/sql/JavaApplySchemaSuite.java | 8 +- .../spark/sql/sources/JavaSaveLoadSuite.java| 2 +- .../org/apache/spark/sql/CachedTableSuite.scala | 60 +-- .../spark/sql/ColumnExpressionSuite.scala | 2 +- .../spark/sql/DataFrameTimeWindowingSuite.scala | 2 +- .../apache/spark/sql/DataFrameWindowSuite.scala | 22 ++-- .../scala/org/apache/spark/sql/JoinSuite.scala | 4 +- .../org/apache/spark/sql/ListTablesSuite.scala | 4 +- .../org/apache/spark/sql/SQLContextSuite.scala | 2 +- .../org/apache/spark/sql/SQLQuerySuite.scala| 103 ++- .../sql/ScalaReflectionRelationSuite.scala | 10 +- .../org/apache/spark/sql/SubquerySuite.scala| 8 +- .../scala/org/apache/spark/sql/UDFSuite.scala | 12 +-- .../apache/spark/sql/UserDefinedTypeSuite.scala | 2 +- .../spark/sql/execution/PlannerSuite.scala | 10 +- .../benchmark/AggregateBenchmark.scala | 3 +- .../columnar/InMemoryColumnarQuerySuite.scala | 8 +- .../columnar/PartitionBatchPruningSuite.scala | 2 +- .../execution/datasources/json/JsonSuite.scala | 58 +-- .../ParquetPartitionDiscoverySuite.scala| 10 +- .../datasources/parquet/ParquetQuerySuite.scala | 4 +- .../parquet/ParquetReadBenchmark.scala | 20 ++-- .../datasources/parquet/TPCDSBenchmark.scala| 2 +- .../sql/execution/metric/SQLMetricsSuite.scala | 8 +- .../org/apache/spark/sql/jdbc/JDBCSuite.scala | 2 +- .../sql/sources/CreateTableAsSelectSuite.scala | 2 +- .../apache/spark/sql/sources/InsertSuite.scala | 6 +- .../spark/sql/sources/SaveLoadSuite.scala | 4 +- .../spark/sql/streaming/StreamSuite.scala | 2 +- .../org/apache/spark/sql/test/SQLTestData.scala | 46 - .../spark/sql/hive/JavaDataFrameSuite.java | 2 +- .../sql/hive/JavaMetastoreDataSourcesSuite.java | 2 +- .../spark/sql/hive/ErrorPositionSuite.scala | 4 +- .../spark/sql/hive/HiveParquetSuite.scala | 4 +- .../spark/sql/hive/HiveSparkSubmitSuite.scala | 8 +- .../sql/hive/InsertIntoHiveTableSuite.scala | 12 +-- .../sql/hive/MetastoreDataSourcesSuite.scala| 8 +- .../hive/ParquetHiveCompatibilitySuite.scala| 2 +- .../spark/sql/hive/QueryPartitionSuite.scala| 2 +- .../apache/spark/sql/hive/StatisticsSuite.scala | 2 +- .../org/apache/spark/sql/hive/UDFSuite.scala| 2 +- .../hive/execution/AggregationQuerySuite.scala | 8 +- .../sql/hive/execution/HiveExplainSuite.scala | 2 +- .../execution/HiveOperatorQueryableSuite.scala | 4 +- .../spark/sql/hive/execution/HivePlanTest.scala | 2 +- .../sql/hive/execution/HiveQuerySuite.scala | 18 ++-- .../hive/execution/HiveResolutionSuite.scala| 10 +- .../sql/hive/execution/HiveTa
[1/3] spark git commit: [SPARK-15171][SQL] Remove the references to deprecated method dataset.registerTempTable
Repository: spark Updated Branches: refs/heads/master b674e67c2 -> 25b315e6c http://git-wip-us.apache.org/repos/asf/spark/blob/25b315e6/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala -- diff --git a/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala b/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala index ac9a393..81f3ea8 100644 --- a/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala +++ b/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala @@ -102,14 +102,14 @@ class SQLQuerySuite extends QueryTest with SQLTestUtils with TestHiveSingleton { test("SPARK-6835: udtf in lateral view") { val df = Seq((1, 1)).toDF("c1", "c2") -df.registerTempTable("table1") +df.createOrReplaceTempView("table1") val query = sql("SELECT c1, v FROM table1 LATERAL VIEW stack(3, 1, c1 + 1, c1 + 2) d AS v") checkAnswer(query, Row(1, 1) :: Row(1, 2) :: Row(1, 3) :: Nil) } test("SPARK-13651: generator outputs shouldn't be resolved from its child's output") { withTempTable("src") { - Seq(("id1", "value1")).toDF("key", "value").registerTempTable("src") + Seq(("id1", "value1")).toDF("key", "value").createOrReplaceTempView("src") val query = sql("SELECT genoutput.* FROM src " + "LATERAL VIEW explode(map('key1', 100, 'key2', 200)) genoutput AS key, value") @@ -135,8 +135,8 @@ class SQLQuerySuite extends QueryTest with SQLTestUtils with TestHiveSingleton { Order(1, "Atlas", "MTB", 434, "2015-01-07", "John D", "Pacifica", "CA", 20151), Order(11, "Swift", "YFlikr", 137, "2015-01-23", "John D", "Hayward", "CA", 20151)) -orders.toDF.registerTempTable("orders1") -orderUpdates.toDF.registerTempTable("orderupdates1") +orders.toDF.createOrReplaceTempView("orders1") +orderUpdates.toDF.createOrReplaceTempView("orderupdates1") sql( """CREATE TABLE orders( @@ -305,7 +305,7 @@ class SQLQuerySuite extends QueryTest with SQLTestUtils with TestHiveSingleton { test("SPARK-5371: union with null and sum") { val df = Seq((1, 1)).toDF("c1", "c2") -df.registerTempTable("table1") +df.createOrReplaceTempView("table1") val query = sql( """ @@ -329,7 +329,7 @@ class SQLQuerySuite extends QueryTest with SQLTestUtils with TestHiveSingleton { test("CTAS with WITH clause") { val df = Seq((1, 1)).toDF("c1", "c2") -df.registerTempTable("table1") +df.createOrReplaceTempView("table1") sql( """ @@ -346,7 +346,7 @@ class SQLQuerySuite extends QueryTest with SQLTestUtils with TestHiveSingleton { } test("explode nested Field") { -Seq(NestedArray1(NestedArray2(Seq(1, 2, 3.toDF.registerTempTable("nestedArray") +Seq(NestedArray1(NestedArray2(Seq(1, 2, 3.toDF.createOrReplaceTempView("nestedArray") checkAnswer( sql("SELECT ints FROM nestedArray LATERAL VIEW explode(a.b) a AS ints"), Row(1) :: Row(2) :: Row(3) :: Nil) @@ -543,7 +543,7 @@ class SQLQuerySuite extends QueryTest with SQLTestUtils with TestHiveSingleton { } test("specifying the column list for CTAS") { -Seq((1, "11"), (2, "22")).toDF("key", "value").registerTempTable("mytable1") +Seq((1, "11"), (2, "22")).toDF("key", "value").createOrReplaceTempView("mytable1") sql("create table gen__tmp(a int, b string) as select key, value from mytable1") checkAnswer( @@ -598,7 +598,7 @@ class SQLQuerySuite extends QueryTest with SQLTestUtils with TestHiveSingleton { test("double nested data") { sparkContext.parallelize(Nested1(Nested2(Nested3(1))) :: Nil) - .toDF().registerTempTable("nested") + .toDF().createOrReplaceTempView("nested") checkAnswer( sql("SELECT f1.f2.f3 FROM nested"), Row(1)) @@ -682,7 +682,7 @@ class SQLQuerySuite extends QueryTest with SQLTestUtils with TestHiveSingleton { test("SPARK-4963 DataFrame sample on mutable row return wrong result") { sql("SELECT * FROM src WHERE key % 2 = 0") .sample(withReplacement = false, fraction = 0.3) - .registerTempTable("sampled") + .createOrReplaceTempView("sampled") (1 to 10).foreach { i => checkAnswer( sql("SELECT * FROM sampled WHERE key % 2 = 1"), @@ -707,7 +707,7 @@ class SQLQuerySuite extends QueryTest with SQLTestUtils with TestHiveSingleton { val rowRdd = sparkContext.parallelize(row :: Nil) -hiveContext.createDataFrame(rowRdd, schema).registerTempTable("testTable") +hiveContext.createDataFrame(rowRdd, schema).createOrReplaceTempView("testTable") sql( """CREATE TABLE nullValuesInInnerComplexTypes @@ -733,14 +733,14 @@ class SQLQuerySuite extends QueryTest with SQLTestUtils with TestHiveSingleton { test("SPARK-4296 Grouping field with Hive
spark git commit: [SPARK-15171][SQL] Deprecate registerTempTable and add dataset.createTempView
Repository: spark Updated Branches: refs/heads/branch-2.0 6b69b8c0c -> 9098b1a17 [SPARK-15171][SQL] Deprecate registerTempTable and add dataset.createTempView ## What changes were proposed in this pull request? Deprecates registerTempTable and add dataset.createTempView, dataset.createOrReplaceTempView. ## How was this patch tested? Unit tests. Author: Sean Zhong <seanzh...@databricks.com> Closes #12945 from clockfly/spark-15171. (cherry picked from commit 33c6eb5218ce3c31cc9f632a67fd2c7057569683) Signed-off-by: Cheng Lian <l...@databricks.com> Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9098b1a1 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9098b1a1 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9098b1a1 Branch: refs/heads/branch-2.0 Commit: 9098b1a1746d9affd894cb44ed169f3cf3566b14 Parents: 6b69b8c Author: Sean Zhong <seanzh...@databricks.com> Authored: Thu May 12 15:51:53 2016 +0800 Committer: Cheng Lian <l...@databricks.com> Committed: Thu May 12 15:54:33 2016 +0800 -- .../sbt_app_sql/src/main/scala/SqlApp.scala | 4 +- .../apache/spark/examples/sql/JavaSparkSQL.java | 8 +-- .../streaming/JavaSqlNetworkWordCount.java | 2 +- examples/src/main/python/sql.py | 2 +- .../python/streaming/sql_network_wordcount.py | 2 +- .../apache/spark/examples/sql/RDDRelation.scala | 4 +- .../streaming/SqlNetworkWordCount.scala | 2 +- .../spark/ml/feature/SQLTransformer.scala | 5 +- python/pyspark/sql/catalog.py | 26 +++--- python/pyspark/sql/context.py | 4 +- python/pyspark/sql/dataframe.py | 51 ++-- python/pyspark/sql/session.py | 6 +-- .../analysis/DistinctAggregationRewriter.scala | 2 +- .../sql/catalyst/catalog/SessionCatalog.scala | 2 +- .../sql/catalyst/analysis/AnalysisTest.scala| 2 +- .../analysis/DecimalPrecisionSuite.scala| 2 +- .../catalyst/catalog/SessionCatalogSuite.scala | 26 +- .../org/apache/spark/sql/DataFrameWriter.scala | 2 +- .../scala/org/apache/spark/sql/Dataset.scala| 30 +++- .../scala/org/apache/spark/sql/SQLContext.scala | 4 +- .../org/apache/spark/sql/SparkSession.scala | 17 +++ .../org/apache/spark/sql/catalog/Catalog.scala | 8 +-- .../spark/sql/execution/SparkSqlParser.scala| 2 +- .../spark/sql/execution/command/cache.scala | 3 +- .../spark/sql/execution/command/views.scala | 2 +- .../spark/sql/execution/datasources/ddl.scala | 4 +- .../apache/spark/sql/internal/CatalogImpl.scala | 12 ++--- .../org/apache/spark/sql/CachedTableSuite.scala | 6 +-- .../org/apache/spark/sql/DataFrameSuite.scala | 19 .../spark/sql/DataFrameTimeWindowingSuite.scala | 2 +- .../org/apache/spark/sql/DatasetSuite.scala | 18 +++ .../org/apache/spark/sql/ListTablesSuite.scala | 2 +- .../org/apache/spark/sql/SQLQuerySuite.scala| 6 +-- .../scala/org/apache/spark/sql/UDFSuite.scala | 4 +- .../columnar/InMemoryColumnarQuerySuite.scala | 2 +- .../parquet/ParquetReadBenchmark.scala | 2 +- .../spark/sql/internal/CatalogSuite.scala | 2 +- .../apache/spark/sql/test/SQLTestUtils.scala| 2 +- .../spark/sql/hive/HiveMetastoreCatalog.scala | 2 +- .../spark/sql/hive/ErrorPositionSuite.scala | 6 +-- .../spark/sql/hive/HiveDDLCommandSuite.scala| 2 +- .../apache/spark/sql/hive/ListTablesSuite.scala | 2 +- .../hive/execution/AggregationQuerySuite.scala | 2 +- .../spark/sql/hive/execution/HiveUDFSuite.scala | 2 +- .../hive/execution/SQLWindowFunctionSuite.scala | 2 +- 45 files changed, 197 insertions(+), 120 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/9098b1a1/dev/audit-release/sbt_app_sql/src/main/scala/SqlApp.scala -- diff --git a/dev/audit-release/sbt_app_sql/src/main/scala/SqlApp.scala b/dev/audit-release/sbt_app_sql/src/main/scala/SqlApp.scala index 69c1154..1002631 100644 --- a/dev/audit-release/sbt_app_sql/src/main/scala/SqlApp.scala +++ b/dev/audit-release/sbt_app_sql/src/main/scala/SqlApp.scala @@ -41,7 +41,7 @@ object SparkSqlExample { import sqlContext._ val people = sc.makeRDD(1 to 100, 10).map(x => Person(s"Name$x", x)).toDF() -people.registerTempTable("people") +people.createOrReplaceTempView("people") val teenagers = sql("SELECT name FROM people WHERE age >= 13 AND age <= 19") val teenagerNames = teenagers.map(t => "Name: " + t(0)).collect() teenagerNames.foreach(println) @@ -52,7 +52,7 @@ object SparkSqlExample { System.exit(-1) }
spark git commit: [SPARK-15171][SQL] Deprecate registerTempTable and add dataset.createTempView
Repository: spark Updated Branches: refs/heads/master 5207a005c -> 33c6eb521 [SPARK-15171][SQL] Deprecate registerTempTable and add dataset.createTempView ## What changes were proposed in this pull request? Deprecates registerTempTable and add dataset.createTempView, dataset.createOrReplaceTempView. ## How was this patch tested? Unit tests. Author: Sean Zhong <seanzh...@databricks.com> Closes #12945 from clockfly/spark-15171. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/33c6eb52 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/33c6eb52 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/33c6eb52 Branch: refs/heads/master Commit: 33c6eb5218ce3c31cc9f632a67fd2c7057569683 Parents: 5207a00 Author: Sean Zhong <seanzh...@databricks.com> Authored: Thu May 12 15:51:53 2016 +0800 Committer: Cheng Lian <l...@databricks.com> Committed: Thu May 12 15:51:53 2016 +0800 -- .../sbt_app_sql/src/main/scala/SqlApp.scala | 4 +- .../apache/spark/examples/sql/JavaSparkSQL.java | 8 +-- .../streaming/JavaSqlNetworkWordCount.java | 2 +- examples/src/main/python/sql.py | 2 +- .../python/streaming/sql_network_wordcount.py | 2 +- .../apache/spark/examples/sql/RDDRelation.scala | 4 +- .../streaming/SqlNetworkWordCount.scala | 2 +- .../spark/ml/feature/SQLTransformer.scala | 5 +- python/pyspark/sql/catalog.py | 26 +++--- python/pyspark/sql/context.py | 4 +- python/pyspark/sql/dataframe.py | 51 ++-- python/pyspark/sql/session.py | 6 +-- .../analysis/DistinctAggregationRewriter.scala | 2 +- .../sql/catalyst/catalog/SessionCatalog.scala | 2 +- .../sql/catalyst/analysis/AnalysisTest.scala| 2 +- .../analysis/DecimalPrecisionSuite.scala| 2 +- .../catalyst/catalog/SessionCatalogSuite.scala | 26 +- .../org/apache/spark/sql/DataFrameWriter.scala | 2 +- .../scala/org/apache/spark/sql/Dataset.scala| 30 +++- .../scala/org/apache/spark/sql/SQLContext.scala | 4 +- .../org/apache/spark/sql/SparkSession.scala | 17 +++ .../org/apache/spark/sql/catalog/Catalog.scala | 8 +-- .../spark/sql/execution/SparkSqlParser.scala| 2 +- .../spark/sql/execution/command/cache.scala | 3 +- .../spark/sql/execution/command/views.scala | 2 +- .../spark/sql/execution/datasources/ddl.scala | 4 +- .../apache/spark/sql/internal/CatalogImpl.scala | 12 ++--- .../org/apache/spark/sql/CachedTableSuite.scala | 6 +-- .../org/apache/spark/sql/DataFrameSuite.scala | 19 .../spark/sql/DataFrameTimeWindowingSuite.scala | 2 +- .../org/apache/spark/sql/DatasetSuite.scala | 18 +++ .../org/apache/spark/sql/ListTablesSuite.scala | 2 +- .../org/apache/spark/sql/SQLQuerySuite.scala| 6 +-- .../scala/org/apache/spark/sql/UDFSuite.scala | 4 +- .../columnar/InMemoryColumnarQuerySuite.scala | 2 +- .../parquet/ParquetReadBenchmark.scala | 2 +- .../spark/sql/internal/CatalogSuite.scala | 2 +- .../apache/spark/sql/test/SQLTestUtils.scala| 2 +- .../spark/sql/hive/HiveMetastoreCatalog.scala | 2 +- .../spark/sql/hive/ErrorPositionSuite.scala | 6 +-- .../spark/sql/hive/HiveDDLCommandSuite.scala| 2 +- .../apache/spark/sql/hive/ListTablesSuite.scala | 2 +- .../hive/execution/AggregationQuerySuite.scala | 2 +- .../spark/sql/hive/execution/HiveUDFSuite.scala | 2 +- .../hive/execution/SQLWindowFunctionSuite.scala | 2 +- 45 files changed, 197 insertions(+), 120 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/33c6eb52/dev/audit-release/sbt_app_sql/src/main/scala/SqlApp.scala -- diff --git a/dev/audit-release/sbt_app_sql/src/main/scala/SqlApp.scala b/dev/audit-release/sbt_app_sql/src/main/scala/SqlApp.scala index 69c1154..1002631 100644 --- a/dev/audit-release/sbt_app_sql/src/main/scala/SqlApp.scala +++ b/dev/audit-release/sbt_app_sql/src/main/scala/SqlApp.scala @@ -41,7 +41,7 @@ object SparkSqlExample { import sqlContext._ val people = sc.makeRDD(1 to 100, 10).map(x => Person(s"Name$x", x)).toDF() -people.registerTempTable("people") +people.createOrReplaceTempView("people") val teenagers = sql("SELECT name FROM people WHERE age >= 13 AND age <= 19") val teenagerNames = teenagers.map(t => "Name: " + t(0)).collect() teenagerNames.foreach(println) @@ -52,7 +52,7 @@ object SparkSqlExample { System.exit(-1) } } - + test(teenagerNames.size == 7, "Unexpected number of selected elements: " + teenagerNames)
spark git commit: [SPARK-14933][SQL] Failed to create view out of a parquet or orc table
Repository: spark Updated Branches: refs/heads/branch-2.0 3bd7a89bd -> 749c29bc0 [SPARK-14933][SQL] Failed to create view out of a parquet or orc table ## What changes were proposed in this pull request? Symptom If a table is created as parquet or ORC table with hive syntaxt DDL, such as ```SQL create table t1 (c1 int, c2 string) stored as parquet ``` The following command will fail ```SQL create view v1 as select * from t1 ``` Root Cause Currently, `HiveMetaStoreCatalog` converts Paruqet/Orc tables to `LogicalRelation` without giving any `tableIdentifier`. `SQLBuilder` expects the `LogicalRelation` to have an associated `tableIdentifier`. However, the `LogicalRelation` created earlier does not have such a `tableIdentifier`. Thus, `SQLBuilder.toSQL` can not recognize this logical plan and issue an exception. This PR is to assign a `TableIdentifier` to the `LogicalRelation` when resolving parquet or orc tables in `HiveMetaStoreCatalog`. ## How was this patch tested? testcases created and dev/run-tests is run. Author: xin Wu <xi...@us.ibm.com> Closes #12716 from xwu0226/SPARK_14933. (cherry picked from commit 427c20dd6d84cb9de1aac322183bc6e7b72ca25d) Signed-off-by: Cheng Lian <l...@databricks.com> Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/749c29bc Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/749c29bc Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/749c29bc Branch: refs/heads/branch-2.0 Commit: 749c29bc099c20aa6156b843cf7c9216315cd5a6 Parents: 3bd7a89 Author: xin Wu <xi...@us.ibm.com> Authored: Wed May 11 22:17:59 2016 +0800 Committer: Cheng Lian <l...@databricks.com> Committed: Wed May 11 22:21:40 2016 +0800 -- .../spark/sql/hive/HiveMetastoreCatalog.scala | 10 +-- .../sql/catalyst/LogicalPlanToSQLSuite.scala| 24 .../spark/sql/hive/execution/SQLViewSuite.scala | 30 3 files changed, 62 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/749c29bc/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala -- diff --git a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala index 7a799b6..607f0a1 100644 --- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala +++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala @@ -293,7 +293,10 @@ private[hive] class HiveMetastoreCatalog(sparkSession: SparkSession) extends Log fileFormat = defaultSource, options = options) -val created = LogicalRelation(relation) +val created = LogicalRelation( + relation, + metastoreTableIdentifier = +Some(TableIdentifier(tableIdentifier.name, Some(tableIdentifier.database cachedDataSourceTables.put(tableIdentifier, created) created } @@ -317,7 +320,10 @@ private[hive] class HiveMetastoreCatalog(sparkSession: SparkSession) extends Log userSpecifiedSchema = Some(metastoreRelation.schema), bucketSpec = bucketSpec, options = options, - className = fileType).resolveRelation()) + className = fileType).resolveRelation(), + metastoreTableIdentifier = +Some(TableIdentifier(tableIdentifier.name, Some(tableIdentifier.database + cachedDataSourceTables.put(tableIdentifier, created) created http://git-wip-us.apache.org/repos/asf/spark/blob/749c29bc/sql/hive/src/test/scala/org/apache/spark/sql/catalyst/LogicalPlanToSQLSuite.scala -- diff --git a/sql/hive/src/test/scala/org/apache/spark/sql/catalyst/LogicalPlanToSQLSuite.scala b/sql/hive/src/test/scala/org/apache/spark/sql/catalyst/LogicalPlanToSQLSuite.scala index 9abefa5..4315197 100644 --- a/sql/hive/src/test/scala/org/apache/spark/sql/catalyst/LogicalPlanToSQLSuite.scala +++ b/sql/hive/src/test/scala/org/apache/spark/sql/catalyst/LogicalPlanToSQLSuite.scala @@ -741,4 +741,28 @@ class LogicalPlanToSQLSuite extends SQLBuilderTest with SQLTestUtils { test("filter after subquery") { checkHiveQl("SELECT a FROM (SELECT key + 1 AS a FROM parquet_t1) t WHERE a > 5") } + + test("SPARK-14933 - select parquet table") { +withTable("parquet_t") { + sql( +""" + |create table parquet_t (c1 int, c2 string) + |stored as parquet select 1, 'abc' +""".stripMargin) + + checkHiveQl
spark git commit: [SPARK-14933][SQL] Failed to create view out of a parquet or orc table
Repository: spark Updated Branches: refs/heads/master d88afabdf -> 427c20dd6 [SPARK-14933][SQL] Failed to create view out of a parquet or orc table ## What changes were proposed in this pull request? Symptom If a table is created as parquet or ORC table with hive syntaxt DDL, such as ```SQL create table t1 (c1 int, c2 string) stored as parquet ``` The following command will fail ```SQL create view v1 as select * from t1 ``` Root Cause Currently, `HiveMetaStoreCatalog` converts Paruqet/Orc tables to `LogicalRelation` without giving any `tableIdentifier`. `SQLBuilder` expects the `LogicalRelation` to have an associated `tableIdentifier`. However, the `LogicalRelation` created earlier does not have such a `tableIdentifier`. Thus, `SQLBuilder.toSQL` can not recognize this logical plan and issue an exception. This PR is to assign a `TableIdentifier` to the `LogicalRelation` when resolving parquet or orc tables in `HiveMetaStoreCatalog`. ## How was this patch tested? testcases created and dev/run-tests is run. Author: xin Wu <xi...@us.ibm.com> Closes #12716 from xwu0226/SPARK_14933. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/427c20dd Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/427c20dd Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/427c20dd Branch: refs/heads/master Commit: 427c20dd6d84cb9de1aac322183bc6e7b72ca25d Parents: d88afab Author: xin Wu <xi...@us.ibm.com> Authored: Wed May 11 22:17:59 2016 +0800 Committer: Cheng Lian <l...@databricks.com> Committed: Wed May 11 22:17:59 2016 +0800 -- .../spark/sql/hive/HiveMetastoreCatalog.scala | 10 +-- .../sql/catalyst/LogicalPlanToSQLSuite.scala| 24 .../spark/sql/hive/execution/SQLViewSuite.scala | 30 3 files changed, 62 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/427c20dd/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala -- diff --git a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala index 7a799b6..607f0a1 100644 --- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala +++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala @@ -293,7 +293,10 @@ private[hive] class HiveMetastoreCatalog(sparkSession: SparkSession) extends Log fileFormat = defaultSource, options = options) -val created = LogicalRelation(relation) +val created = LogicalRelation( + relation, + metastoreTableIdentifier = +Some(TableIdentifier(tableIdentifier.name, Some(tableIdentifier.database cachedDataSourceTables.put(tableIdentifier, created) created } @@ -317,7 +320,10 @@ private[hive] class HiveMetastoreCatalog(sparkSession: SparkSession) extends Log userSpecifiedSchema = Some(metastoreRelation.schema), bucketSpec = bucketSpec, options = options, - className = fileType).resolveRelation()) + className = fileType).resolveRelation(), + metastoreTableIdentifier = +Some(TableIdentifier(tableIdentifier.name, Some(tableIdentifier.database + cachedDataSourceTables.put(tableIdentifier, created) created http://git-wip-us.apache.org/repos/asf/spark/blob/427c20dd/sql/hive/src/test/scala/org/apache/spark/sql/catalyst/LogicalPlanToSQLSuite.scala -- diff --git a/sql/hive/src/test/scala/org/apache/spark/sql/catalyst/LogicalPlanToSQLSuite.scala b/sql/hive/src/test/scala/org/apache/spark/sql/catalyst/LogicalPlanToSQLSuite.scala index 9abefa5..4315197 100644 --- a/sql/hive/src/test/scala/org/apache/spark/sql/catalyst/LogicalPlanToSQLSuite.scala +++ b/sql/hive/src/test/scala/org/apache/spark/sql/catalyst/LogicalPlanToSQLSuite.scala @@ -741,4 +741,28 @@ class LogicalPlanToSQLSuite extends SQLBuilderTest with SQLTestUtils { test("filter after subquery") { checkHiveQl("SELECT a FROM (SELECT key + 1 AS a FROM parquet_t1) t WHERE a > 5") } + + test("SPARK-14933 - select parquet table") { +withTable("parquet_t") { + sql( +""" + |create table parquet_t (c1 int, c2 string) + |stored as parquet select 1, 'abc' +""".stripMargin) + + checkHiveQl("select * from parquet_t") +} + } + + test("SPARK-14933 - select orc table") { +withTable("orc_t&q
spark git commit: [SPARK-14459][SQL] Detect relation partitioning and adjust the logical plan
Repository: spark Updated Branches: refs/heads/branch-2.0 cbb4fa1b1 -> 8caaaed95 [SPARK-14459][SQL] Detect relation partitioning and adjust the logical plan ## What changes were proposed in this pull request? This detects a relation's partitioning and adds checks to the analyzer. If an InsertIntoTable node has no partitioning, it is replaced by the relation's partition scheme and input columns are correctly adjusted, placing the partition columns at the end in partition order. If an InsertIntoTable node has partitioning, it is checked against the table's reported partitions. These changes required adding a PartitionedRelation trait to the catalog interface because Hive's MetastoreRelation doesn't extend CatalogRelation. This commit also includes a fix to InsertIntoTable's resolved logic, which now detects that all expected columns are present, including dynamic partition columns. Previously, the number of expected columns was not checked and resolved was true if there were missing columns. ## How was this patch tested? This adds new tests to the InsertIntoTableSuite that are fixed by this PR. Author: Ryan Blue <b...@apache.org> Closes #12239 from rdblue/SPARK-14459-detect-hive-partitioning. (cherry picked from commit 652bbb1bf62722b08a062c7a2bf72019f85e179e) Signed-off-by: Cheng Lian <l...@databricks.com> Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/8caaaed9 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/8caaaed9 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/8caaaed9 Branch: refs/heads/branch-2.0 Commit: 8caaaed952cef533c36cf3c0020245f1f3eb62c8 Parents: cbb4fa1 Author: Ryan Blue <b...@apache.org> Authored: Mon May 9 17:01:23 2016 +0800 Committer: Cheng Lian <l...@databricks.com> Committed: Mon May 9 17:03:21 2016 +0800 -- .../spark/sql/catalyst/analysis/Analyzer.scala | 41 +- .../plans/logical/basicLogicalOperators.scala | 19 - .../hive/execution/InsertIntoHiveTable.scala| 10 ++- .../sql/hive/InsertIntoHiveTableSuite.scala | 83 +++- .../sql/hive/execution/HiveQuerySuite.scala | 2 +- 5 files changed, 143 insertions(+), 12 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/8caaaed9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala index 9e9a856..b7884f9 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala @@ -22,7 +22,7 @@ import scala.collection.mutable.ArrayBuffer import org.apache.spark.sql.AnalysisException import org.apache.spark.sql.catalyst.{CatalystConf, ScalaReflection, SimpleCatalystConf} -import org.apache.spark.sql.catalyst.catalog.{InMemoryCatalog, SessionCatalog} +import org.apache.spark.sql.catalyst.catalog.{CatalogRelation, InMemoryCatalog, SessionCatalog} import org.apache.spark.sql.catalyst.encoders.OuterScopes import org.apache.spark.sql.catalyst.expressions._ import org.apache.spark.sql.catalyst.expressions.aggregate._ @@ -444,8 +444,43 @@ class Analyzer( } def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators { - case i @ InsertIntoTable(u: UnresolvedRelation, _, _, _, _) => -i.copy(table = EliminateSubqueryAliases(lookupTableFromCatalog(u))) + case i @ InsertIntoTable(u: UnresolvedRelation, parts, child, _, _) if child.resolved => +val table = lookupTableFromCatalog(u) +// adding the table's partitions or validate the query's partition info +table match { + case relation: CatalogRelation if relation.catalogTable.partitionColumns.nonEmpty => +val tablePartitionNames = relation.catalogTable.partitionColumns.map(_.name) +if (parts.keys.nonEmpty) { + // the query's partitioning must match the table's partitioning + // this is set for queries like: insert into ... partition (one = "a", two = ) + // TODO: add better checking to pre-inserts to avoid needing this here + if (tablePartitionNames.size != parts.keySet.size) { +throw new AnalysisException( + s"""Requested partitioning does not match the ${u.tableIdentifier} table: + |Requested partitions: ${parts.keys.mkString(",")} + |Table partitions: ${tablePartitionNames.mkString(",")}""".stripMar
spark git commit: [SPARK-14459][SQL] Detect relation partitioning and adjust the logical plan
Repository: spark Updated Branches: refs/heads/master 16a503cf0 -> 652bbb1bf [SPARK-14459][SQL] Detect relation partitioning and adjust the logical plan ## What changes were proposed in this pull request? This detects a relation's partitioning and adds checks to the analyzer. If an InsertIntoTable node has no partitioning, it is replaced by the relation's partition scheme and input columns are correctly adjusted, placing the partition columns at the end in partition order. If an InsertIntoTable node has partitioning, it is checked against the table's reported partitions. These changes required adding a PartitionedRelation trait to the catalog interface because Hive's MetastoreRelation doesn't extend CatalogRelation. This commit also includes a fix to InsertIntoTable's resolved logic, which now detects that all expected columns are present, including dynamic partition columns. Previously, the number of expected columns was not checked and resolved was true if there were missing columns. ## How was this patch tested? This adds new tests to the InsertIntoTableSuite that are fixed by this PR. Author: Ryan Blue <b...@apache.org> Closes #12239 from rdblue/SPARK-14459-detect-hive-partitioning. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/652bbb1b Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/652bbb1b Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/652bbb1b Branch: refs/heads/master Commit: 652bbb1bf62722b08a062c7a2bf72019f85e179e Parents: 16a503c Author: Ryan Blue <b...@apache.org> Authored: Mon May 9 17:01:23 2016 +0800 Committer: Cheng Lian <l...@databricks.com> Committed: Mon May 9 17:01:23 2016 +0800 -- .../spark/sql/catalyst/analysis/Analyzer.scala | 41 +- .../plans/logical/basicLogicalOperators.scala | 19 - .../hive/execution/InsertIntoHiveTable.scala| 10 ++- .../sql/hive/InsertIntoHiveTableSuite.scala | 83 +++- .../sql/hive/execution/HiveQuerySuite.scala | 2 +- 5 files changed, 143 insertions(+), 12 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/652bbb1b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala index 9e9a856..b7884f9 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala @@ -22,7 +22,7 @@ import scala.collection.mutable.ArrayBuffer import org.apache.spark.sql.AnalysisException import org.apache.spark.sql.catalyst.{CatalystConf, ScalaReflection, SimpleCatalystConf} -import org.apache.spark.sql.catalyst.catalog.{InMemoryCatalog, SessionCatalog} +import org.apache.spark.sql.catalyst.catalog.{CatalogRelation, InMemoryCatalog, SessionCatalog} import org.apache.spark.sql.catalyst.encoders.OuterScopes import org.apache.spark.sql.catalyst.expressions._ import org.apache.spark.sql.catalyst.expressions.aggregate._ @@ -444,8 +444,43 @@ class Analyzer( } def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators { - case i @ InsertIntoTable(u: UnresolvedRelation, _, _, _, _) => -i.copy(table = EliminateSubqueryAliases(lookupTableFromCatalog(u))) + case i @ InsertIntoTable(u: UnresolvedRelation, parts, child, _, _) if child.resolved => +val table = lookupTableFromCatalog(u) +// adding the table's partitions or validate the query's partition info +table match { + case relation: CatalogRelation if relation.catalogTable.partitionColumns.nonEmpty => +val tablePartitionNames = relation.catalogTable.partitionColumns.map(_.name) +if (parts.keys.nonEmpty) { + // the query's partitioning must match the table's partitioning + // this is set for queries like: insert into ... partition (one = "a", two = ) + // TODO: add better checking to pre-inserts to avoid needing this here + if (tablePartitionNames.size != parts.keySet.size) { +throw new AnalysisException( + s"""Requested partitioning does not match the ${u.tableIdentifier} table: + |Requested partitions: ${parts.keys.mkString(",")} + |Table partitions: ${tablePartitionNames.mkString(",")}""".stripMargin) + } + // Assume partition columns are correctly placed at the end of the child's
spark git commit: [SPARK-15211][SQL] Select features column from LibSVMRelation causes failure
Repository: spark Updated Branches: refs/heads/branch-2.0 c0c5c264f -> 238b7b416 [SPARK-15211][SQL] Select features column from LibSVMRelation causes failure ## What changes were proposed in this pull request? We need to use `requiredSchema` in `LibSVMRelation` to project the fetch required columns when loading data from this data source. Otherwise, when users try to select `features` column, it will cause failure. ## How was this patch tested? `LibSVMRelationSuite`. Author: Liang-Chi Hsieh <sim...@tw.ibm.com> Closes #12986 from viirya/fix-libsvmrelation. (cherry picked from commit 635ef407e11dec41ae9bc428935fb8fdaa482f7e) Signed-off-by: Cheng Lian <l...@databricks.com> Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/238b7b41 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/238b7b41 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/238b7b41 Branch: refs/heads/branch-2.0 Commit: 238b7b4162191fc6346ab8c7af365f83f14e8946 Parents: c0c5c26 Author: Liang-Chi Hsieh <sim...@tw.ibm.com> Authored: Mon May 9 15:05:06 2016 +0800 Committer: Cheng Lian <l...@databricks.com> Committed: Mon May 9 15:05:26 2016 +0800 -- .../apache/spark/ml/source/libsvm/LibSVMRelation.scala| 10 +- .../spark/ml/source/libsvm/LibSVMRelationSuite.scala | 1 + 2 files changed, 10 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/238b7b41/mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala b/mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala index 5f78fab..68a855c 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala @@ -203,10 +203,18 @@ class DefaultSource extends FileFormat with DataSourceRegister { } val converter = RowEncoder(dataSchema) + val fullOutput = dataSchema.map { f => +AttributeReference(f.name, f.dataType, f.nullable, f.metadata)() + } + val requiredOutput = fullOutput.filter { a => +requiredSchema.fieldNames.contains(a.name) + } + + val requiredColumns = GenerateUnsafeProjection.generate(requiredOutput, fullOutput) points.map { pt => val features = if (sparse) pt.features.toSparse else pt.features.toDense -converter.toRow(Row(pt.label, features)) +requiredColumns(converter.toRow(Row(pt.label, features))) } } } http://git-wip-us.apache.org/repos/asf/spark/blob/238b7b41/mllib/src/test/scala/org/apache/spark/ml/source/libsvm/LibSVMRelationSuite.scala -- diff --git a/mllib/src/test/scala/org/apache/spark/ml/source/libsvm/LibSVMRelationSuite.scala b/mllib/src/test/scala/org/apache/spark/ml/source/libsvm/LibSVMRelationSuite.scala index e52fbd7..1d7144f 100644 --- a/mllib/src/test/scala/org/apache/spark/ml/source/libsvm/LibSVMRelationSuite.scala +++ b/mllib/src/test/scala/org/apache/spark/ml/source/libsvm/LibSVMRelationSuite.scala @@ -108,5 +108,6 @@ class LibSVMRelationSuite extends SparkFunSuite with MLlibTestSparkContext { test("select features from libsvm relation") { val df = sqlContext.read.format("libsvm").load(path) df.select("features").rdd.map { case Row(d: Vector) => d }.first +df.select("features").collect } } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-15211][SQL] Select features column from LibSVMRelation causes failure
Repository: spark Updated Branches: refs/heads/master a59ab594c -> 635ef407e [SPARK-15211][SQL] Select features column from LibSVMRelation causes failure ## What changes were proposed in this pull request? We need to use `requiredSchema` in `LibSVMRelation` to project the fetch required columns when loading data from this data source. Otherwise, when users try to select `features` column, it will cause failure. ## How was this patch tested? `LibSVMRelationSuite`. Author: Liang-Chi Hsieh <sim...@tw.ibm.com> Closes #12986 from viirya/fix-libsvmrelation. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/635ef407 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/635ef407 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/635ef407 Branch: refs/heads/master Commit: 635ef407e11dec41ae9bc428935fb8fdaa482f7e Parents: a59ab59 Author: Liang-Chi Hsieh <sim...@tw.ibm.com> Authored: Mon May 9 15:05:06 2016 +0800 Committer: Cheng Lian <l...@databricks.com> Committed: Mon May 9 15:05:06 2016 +0800 -- .../apache/spark/ml/source/libsvm/LibSVMRelation.scala| 10 +- .../spark/ml/source/libsvm/LibSVMRelationSuite.scala | 1 + 2 files changed, 10 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/635ef407/mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala b/mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala index 5f78fab..68a855c 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala @@ -203,10 +203,18 @@ class DefaultSource extends FileFormat with DataSourceRegister { } val converter = RowEncoder(dataSchema) + val fullOutput = dataSchema.map { f => +AttributeReference(f.name, f.dataType, f.nullable, f.metadata)() + } + val requiredOutput = fullOutput.filter { a => +requiredSchema.fieldNames.contains(a.name) + } + + val requiredColumns = GenerateUnsafeProjection.generate(requiredOutput, fullOutput) points.map { pt => val features = if (sparse) pt.features.toSparse else pt.features.toDense -converter.toRow(Row(pt.label, features)) +requiredColumns(converter.toRow(Row(pt.label, features))) } } } http://git-wip-us.apache.org/repos/asf/spark/blob/635ef407/mllib/src/test/scala/org/apache/spark/ml/source/libsvm/LibSVMRelationSuite.scala -- diff --git a/mllib/src/test/scala/org/apache/spark/ml/source/libsvm/LibSVMRelationSuite.scala b/mllib/src/test/scala/org/apache/spark/ml/source/libsvm/LibSVMRelationSuite.scala index e52fbd7..1d7144f 100644 --- a/mllib/src/test/scala/org/apache/spark/ml/source/libsvm/LibSVMRelationSuite.scala +++ b/mllib/src/test/scala/org/apache/spark/ml/source/libsvm/LibSVMRelationSuite.scala @@ -108,5 +108,6 @@ class LibSVMRelationSuite extends SparkFunSuite with MLlibTestSparkContext { test("select features from libsvm relation") { val df = sqlContext.read.format("libsvm").load(path) df.select("features").rdd.map { case Row(d: Vector) => d }.first +df.select("features").collect } } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-14962][SQL] Do not push down isnotnull/isnull on unsuportted types in ORC
Repository: spark Updated Branches: refs/heads/branch-2.0 3f6a13c8a -> d7c755561 [SPARK-14962][SQL] Do not push down isnotnull/isnull on unsuportted types in ORC ## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-14962 ORC filters were being pushed down for all types for both `IsNull` and `IsNotNull`. This is apparently OK because both `IsNull` and `IsNotNull` do not take a type as an argument (Hive 1.2.x) during building filters (`SearchArgument`) in Spark-side but they do not filter correctly because stored statistics always produces `null` for not supported types (eg `ArrayType`) in ORC-side. So, it is always `true` for `IsNull` which ends up with always `false` for `IsNotNull`. (Please see [RecordReaderImpl.java#L296-L318](https://github.com/apache/hive/blob/branch-1.2/ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java#L296-L318) and [RecordReaderImpl.java#L359-L365](https://github.com/apache/hive/blob/branch-1.2/ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java#L359-L365) in Hive 1.2) This looks prevented in Hive 1.3.x >= by forcing to give a type ([`PredicateLeaf.Type`](https://github.com/apache/hive/blob/e085b7e9bd059d91aaf013df0db4d71dca90ec6f/storage-api/src/java/org/apache/hadoop/hive/ql/io/sarg/PredicateLeaf.java#L50-L56)) when building a filter ([`SearchArgument`](https://github.com/apache/hive/blob/26b5c7b56a4f28ce3eabc0207566cce46b29b558/storage-api/src/java/org/apache/hadoop/hive/ql/io/sarg/SearchArgument.java#L260)) but Hive 1.2.x seems not doing this. This PR prevents ORC filter creation for `IsNull` and `IsNotNull` on unsupported types. `OrcFilters` resembles `ParquetFilters`. ## How was this patch tested? Unittests in `OrcQuerySuite` and `OrcFilterSuite` and `sbt scalastyle`. Author: hyukjinkwon <gurwls...@gmail.com> Author: Hyukjin Kwon <gurwls...@gmail.com> Closes #12777 from HyukjinKwon/SPARK-14962. (cherry picked from commit fa928ff9a3c1de5d5aff9d14e6bc1bd03fcca087) Signed-off-by: Cheng Lian <l...@databricks.com> Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d7c75556 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d7c75556 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d7c75556 Branch: refs/heads/branch-2.0 Commit: d7c755561270ee8ec1c44df2e10a8bcb4985c3de Parents: 3f6a13c Author: hyukjinkwon <gurwls...@gmail.com> Authored: Sat May 7 01:46:45 2016 +0800 Committer: Cheng Lian <l...@databricks.com> Committed: Sat May 7 01:53:08 2016 +0800 -- .../apache/spark/sql/test/SQLTestUtils.scala| 2 +- .../apache/spark/sql/hive/orc/OrcFilters.scala | 63 .../apache/spark/sql/hive/orc/OrcRelation.scala | 19 ++--- .../spark/sql/hive/orc/OrcFilterSuite.scala | 75 .../spark/sql/hive/orc/OrcQuerySuite.scala | 14 .../spark/sql/hive/orc/OrcSourceSuite.scala | 9 ++- 6 files changed, 126 insertions(+), 56 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/d7c75556/sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala b/sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala index ffb206a..6d2b95e 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala @@ -213,7 +213,7 @@ private[sql] trait SQLTestUtils */ protected def stripSparkFilter(df: DataFrame): DataFrame = { val schema = df.schema -val withoutFilters = df.queryExecution.sparkPlan transform { +val withoutFilters = df.queryExecution.sparkPlan.transform { case FilterExec(_, child) => child } http://git-wip-us.apache.org/repos/asf/spark/blob/d7c75556/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFilters.scala -- diff --git a/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFilters.scala b/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFilters.scala index c025c12..c463bc8 100644 --- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFilters.scala +++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFilters.scala @@ -17,13 +17,12 @@ package org.apache.spark.sql.hive.orc -import org.apache.hadoop.hive.common.`type`.{HiveChar, HiveDecimal, HiveVarchar} import org.apache.hadoop.hive.ql.io.sarg.{SearchArgument, SearchArgumentFactory} import org.apache.hadoop.hive.ql.io.sarg.SearchArgument.Builder -import org.apache.hadoop.hive.serde2.io.DateWritable import or
spark git commit: [SPARK-14962][SQL] Do not push down isnotnull/isnull on unsuportted types in ORC
Repository: spark Updated Branches: refs/heads/master a03c5e68a -> fa928ff9a [SPARK-14962][SQL] Do not push down isnotnull/isnull on unsuportted types in ORC ## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-14962 ORC filters were being pushed down for all types for both `IsNull` and `IsNotNull`. This is apparently OK because both `IsNull` and `IsNotNull` do not take a type as an argument (Hive 1.2.x) during building filters (`SearchArgument`) in Spark-side but they do not filter correctly because stored statistics always produces `null` for not supported types (eg `ArrayType`) in ORC-side. So, it is always `true` for `IsNull` which ends up with always `false` for `IsNotNull`. (Please see [RecordReaderImpl.java#L296-L318](https://github.com/apache/hive/blob/branch-1.2/ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java#L296-L318) and [RecordReaderImpl.java#L359-L365](https://github.com/apache/hive/blob/branch-1.2/ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java#L359-L365) in Hive 1.2) This looks prevented in Hive 1.3.x >= by forcing to give a type ([`PredicateLeaf.Type`](https://github.com/apache/hive/blob/e085b7e9bd059d91aaf013df0db4d71dca90ec6f/storage-api/src/java/org/apache/hadoop/hive/ql/io/sarg/PredicateLeaf.java#L50-L56)) when building a filter ([`SearchArgument`](https://github.com/apache/hive/blob/26b5c7b56a4f28ce3eabc0207566cce46b29b558/storage-api/src/java/org/apache/hadoop/hive/ql/io/sarg/SearchArgument.java#L260)) but Hive 1.2.x seems not doing this. This PR prevents ORC filter creation for `IsNull` and `IsNotNull` on unsupported types. `OrcFilters` resembles `ParquetFilters`. ## How was this patch tested? Unittests in `OrcQuerySuite` and `OrcFilterSuite` and `sbt scalastyle`. Author: hyukjinkwon <gurwls...@gmail.com> Author: Hyukjin Kwon <gurwls...@gmail.com> Closes #12777 from HyukjinKwon/SPARK-14962. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/fa928ff9 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/fa928ff9 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/fa928ff9 Branch: refs/heads/master Commit: fa928ff9a3c1de5d5aff9d14e6bc1bd03fcca087 Parents: a03c5e6 Author: hyukjinkwon <gurwls...@gmail.com> Authored: Sat May 7 01:46:45 2016 +0800 Committer: Cheng Lian <l...@databricks.com> Committed: Sat May 7 01:46:45 2016 +0800 -- .../apache/spark/sql/test/SQLTestUtils.scala| 2 +- .../apache/spark/sql/hive/orc/OrcFilters.scala | 63 .../apache/spark/sql/hive/orc/OrcRelation.scala | 19 ++--- .../spark/sql/hive/orc/OrcFilterSuite.scala | 75 .../spark/sql/hive/orc/OrcQuerySuite.scala | 14 .../spark/sql/hive/orc/OrcSourceSuite.scala | 9 ++- 6 files changed, 126 insertions(+), 56 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/fa928ff9/sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala b/sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala index ffb206a..6d2b95e 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala @@ -213,7 +213,7 @@ private[sql] trait SQLTestUtils */ protected def stripSparkFilter(df: DataFrame): DataFrame = { val schema = df.schema -val withoutFilters = df.queryExecution.sparkPlan transform { +val withoutFilters = df.queryExecution.sparkPlan.transform { case FilterExec(_, child) => child } http://git-wip-us.apache.org/repos/asf/spark/blob/fa928ff9/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFilters.scala -- diff --git a/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFilters.scala b/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFilters.scala index c025c12..c463bc8 100644 --- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFilters.scala +++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFilters.scala @@ -17,13 +17,12 @@ package org.apache.spark.sql.hive.orc -import org.apache.hadoop.hive.common.`type`.{HiveChar, HiveDecimal, HiveVarchar} import org.apache.hadoop.hive.ql.io.sarg.{SearchArgument, SearchArgumentFactory} import org.apache.hadoop.hive.ql.io.sarg.SearchArgument.Builder -import org.apache.hadoop.hive.serde2.io.DateWritable import org.apache.spark.internal.Logging import org.apache.spark.sql.sources._ +import org.apache.spark.sql.types._ /** * Helper object f
spark git commit: [SPARK-14139][SQL] RowEncoder should preserve schema nullability
Repository: spark Updated Branches: refs/heads/master 77361a433 -> 55cc1c991 [SPARK-14139][SQL] RowEncoder should preserve schema nullability ## What changes were proposed in this pull request? The problem is: In `RowEncoder`, we use `Invoke` to get the field of an external row, which lose the nullability information. This PR creates a `GetExternalRowField` expression, so that we can preserve the nullability info. TODO: simplify the null handling logic in `RowEncoder`, to remove so many if branches, in follow-up PR. ## How was this patch tested? new tests in `RowEncoderSuite` Note that, This PR takes over https://github.com/apache/spark/pull/11980, with a little simplification, so all credits should go to koertkuipers Author: Wenchen Fan <wenc...@databricks.com> Author: Koert Kuipers <ko...@tresata.com> Closes #12364 from cloud-fan/nullable. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/55cc1c99 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/55cc1c99 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/55cc1c99 Branch: refs/heads/master Commit: 55cc1c991a9e39efb14177a948b09b7909e53e25 Parents: 77361a4 Author: Wenchen Fan <wenc...@databricks.com> Authored: Fri May 6 01:08:04 2016 +0800 Committer: Cheng Lian <l...@databricks.com> Committed: Fri May 6 01:08:04 2016 +0800 -- .../sql/catalyst/encoders/RowEncoder.scala | 36 ++--- .../sql/catalyst/expressions/objects.scala | 42 .../sql/catalyst/encoders/RowEncoderSuite.scala | 8 .../org/apache/spark/sql/DatasetSuite.scala | 18 - 4 files changed, 88 insertions(+), 16 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/55cc1c99/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/RowEncoder.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/RowEncoder.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/RowEncoder.scala index 44e135c..cfde3bf 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/RowEncoder.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/RowEncoder.scala @@ -35,9 +35,8 @@ import org.apache.spark.unsafe.types.UTF8String object RowEncoder { def apply(schema: StructType): ExpressionEncoder[Row] = { val cls = classOf[Row] -val inputObject = BoundReference(0, ObjectType(cls), nullable = true) -// We use an If expression to wrap extractorsFor result of StructType -val serializer = serializerFor(inputObject, schema).asInstanceOf[If].falseValue +val inputObject = BoundReference(0, ObjectType(cls), nullable = false) +val serializer = serializerFor(inputObject, schema) val deserializer = deserializerFor(schema) new ExpressionEncoder[Row]( schema, @@ -130,21 +129,28 @@ object RowEncoder { case StructType(fields) => val convertedFields = fields.zipWithIndex.map { case (f, i) => -val method = if (f.dataType.isInstanceOf[StructType]) { - "getStruct" +val fieldValue = serializerFor( + GetExternalRowField(inputObject, i, externalDataTypeForInput(f.dataType)), + f.dataType +) +if (f.nullable) { + If( +Invoke(inputObject, "isNullAt", BooleanType, Literal(i) :: Nil), +Literal.create(null, f.dataType), +fieldValue + ) } else { - "get" + fieldValue } -If( - Invoke(inputObject, "isNullAt", BooleanType, Literal(i) :: Nil), - Literal.create(null, f.dataType), - serializerFor( -Invoke(inputObject, method, externalDataTypeForInput(f.dataType), Literal(i) :: Nil), -f.dataType)) } - If(IsNull(inputObject), -Literal.create(null, inputType), -CreateStruct(convertedFields)) + + if (inputObject.nullable) { +If(IsNull(inputObject), + Literal.create(null, inputType), + CreateStruct(convertedFields)) + } else { +CreateStruct(convertedFields) + } } /** http://git-wip-us.apache.org/repos/asf/spark/blob/55cc1c99/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects.scala index 523eed8..dbaff16 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expr
spark git commit: [SPARK-14127][SQL] Native "DESC [EXTENDED | FORMATTED] " DDL command
Repository: spark Updated Branches: refs/heads/branch-2.0 b99f715e8 -> 1fba7595f [SPARK-14127][SQL] Native "DESC [EXTENDED | FORMATTED] " DDL command ## What changes were proposed in this pull request? This PR implements native `DESC [EXTENDED | FORMATTED] ` DDL command. Sample output: ``` scala> spark.sql("desc extended src").show(100, truncate = false) ++-+---+ |col_name|data_type|comment| ++-+---+ |key |int | | |value |string | | || | | |# Detailed Table Information|CatalogTable(`default`.`src`, ...| | ++-+---+ scala> spark.sql("desc formatted src").show(100, truncate = false) ++--+---+ |col_name|data_type |comment| ++--+---+ |key |int | | |value |string | | || | | |# Detailed Table Information| | | |Database: |default | | |Owner: |lian | | |Create Time:|Mon Jan 04 17:06:00 CST 2016 | | |Last Access Time: |Thu Jan 01 08:00:00 CST 1970 | | |Location: |hdfs://localhost:9000/user/hive/warehouse_hive121/src | | |Table Type: |MANAGED | | |Table Parameters: | | | | transient_lastDdlTime |1451898360 | | || | | |# Storage Information | | | |SerDe Library: |org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe| | |InputFormat:|org.apache.hadoop.mapred.TextInputFormat | | |OutputFormat: |org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat| | |Num Buckets:|-1 | | |Bucket Columns: |[] | | |Sort Columns: |[] | | |Storage Desc Parameters:| | | | serialization.format |1 | | ++--+---+ ``` ## How was this patch tested? A test case is added to `HiveDDLSuite` to check command output. Author: Cheng Lian <l...@databricks.com> Closes #12844 from liancheng/spark-14127-desc-table. (cherry picked from commit f152fae306dc75565cb4648ee1211416d7c0bb23) Signed-off-by: Cheng Lian <l...@databricks.com> Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1fba7595 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1fba7595 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1fba7595 Branch: refs/heads/branch-2.0 Commit: 1fba7595f2e9cac984692238c6698bd5cfb0a453 Parents: b99f715 Author: Cheng Lian <l...@databricks.com> Authored: Wed May 4 16:44:09 2016 +0800 Committer: Cheng Lian <l...@databricks.com> Committed: Wed May 4 16:44:39 2016 +0800 -- .../spark/sql/catalyst/catalog/interface.scala | 5 +- .../catalyst/catalog/ExternalCatalogSuite.scala | 1 + .../spark/sql/execution/SparkSqlParser.scala| 14 ++- .../command/createDataSourceTables.scala| 2 + .../spark/sql/execution/command/tables.scala| 104 +++ .../spark/sql/execution/command/DDLSuite.scala | 6 +- .../spark/sql/hive/client/HiveClientImpl.scala | 4 +- .../hive/exec
spark git commit: [SPARK-14127][SQL] Native "DESC [EXTENDED | FORMATTED] " DDL command
Repository: spark Updated Branches: refs/heads/master 6c12e801e -> f152fae30 [SPARK-14127][SQL] Native "DESC [EXTENDED | FORMATTED] " DDL command ## What changes were proposed in this pull request? This PR implements native `DESC [EXTENDED | FORMATTED] ` DDL command. Sample output: ``` scala> spark.sql("desc extended src").show(100, truncate = false) ++-+---+ |col_name|data_type|comment| ++-+---+ |key |int | | |value |string | | || | | |# Detailed Table Information|CatalogTable(`default`.`src`, ...| | ++-+---+ scala> spark.sql("desc formatted src").show(100, truncate = false) ++--+---+ |col_name|data_type |comment| ++--+---+ |key |int | | |value |string | | || | | |# Detailed Table Information| | | |Database: |default | | |Owner: |lian | | |Create Time:|Mon Jan 04 17:06:00 CST 2016 | | |Last Access Time: |Thu Jan 01 08:00:00 CST 1970 | | |Location: |hdfs://localhost:9000/user/hive/warehouse_hive121/src | | |Table Type: |MANAGED | | |Table Parameters: | | | | transient_lastDdlTime |1451898360 | | || | | |# Storage Information | | | |SerDe Library: |org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe| | |InputFormat:|org.apache.hadoop.mapred.TextInputFormat | | |OutputFormat: |org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat| | |Num Buckets:|-1 | | |Bucket Columns: |[] | | |Sort Columns: |[] | | |Storage Desc Parameters:| | | | serialization.format |1 | | ++--+---+ ``` ## How was this patch tested? A test case is added to `HiveDDLSuite` to check command output. Author: Cheng Lian <l...@databricks.com> Closes #12844 from liancheng/spark-14127-desc-table. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f152fae3 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f152fae3 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f152fae3 Branch: refs/heads/master Commit: f152fae306dc75565cb4648ee1211416d7c0bb23 Parents: 6c12e80 Author: Cheng Lian <l...@databricks.com> Authored: Wed May 4 16:44:09 2016 +0800 Committer: Cheng Lian <l...@databricks.com> Committed: Wed May 4 16:44:09 2016 +0800 -- .../spark/sql/catalyst/catalog/interface.scala | 5 +- .../catalyst/catalog/ExternalCatalogSuite.scala | 1 + .../spark/sql/execution/SparkSqlParser.scala| 14 ++- .../command/createDataSourceTables.scala| 2 + .../spark/sql/execution/command/tables.scala| 104 +++ .../spark/sql/execution/command/DDLSuite.scala | 6 +- .../spark/sql/hive/client/HiveClientImpl.scala | 4 +- .../hive/execution/CreateTableAsSelect.scala| 3 +- .../spark/sql/hive/HiveDDLCommandSuite.scala| 2 +- .../sql/hive/MetastoreDat
spark git commit: [SPARK-14237][SQL] De-duplicate partition value appending logic in various buildReader() implementations
Repository: spark Updated Branches: refs/heads/branch-2.0 2ff667606 -> 64ad9ba27 [SPARK-14237][SQL] De-duplicate partition value appending logic in various buildReader() implementations ## What changes were proposed in this pull request? Currently, various `FileFormat` data sources share approximately the same code for partition value appending. This PR tries to eliminate this duplication. A new method `buildReaderWithPartitionValues()` is added to `FileFormat` with a default implementation that appends partition values to `InternalRow`s produced by the reader function returned by `buildReader()`. Special data sources like Parquet, which implements partition value appending inside `buildReader()` because of the vectorized reader, and the Text data source, which doesn't support partitioning, override `buildReaderWithPartitionValues()` and simply delegate to `buildReader()`. This PR brings two benefits: 1. Apparently, it de-duplicates partition value appending logic 2. Now the reader function returned by `buildReader()` is only required to produce `InternalRow`s rather than `UnsafeRow`s if the data source doesn't override `buildReaderWithPartitionValues()`. Because the safe-to-unsafe conversion is also performed while appending partition values. This makes 3rd-party data sources (e.g. spark-avro) easier to implement since they no longer need to access private APIs involving `UnsafeRow`. ## How was this patch tested? Existing tests should do the work. Author: Cheng Lian <l...@databricks.com> Closes #12866 from liancheng/spark-14237-simplify-partition-values-appending. (cherry picked from commit bc3760d405cc8c3ffcd957b188afa8b7e3b1f824) Signed-off-by: Cheng Lian <l...@databricks.com> Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/64ad9ba2 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/64ad9ba2 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/64ad9ba2 Branch: refs/heads/branch-2.0 Commit: 64ad9ba27aafa333a6897861fe2aeb4bb83506fc Parents: 2ff6676 Author: Cheng Lian <l...@databricks.com> Authored: Wed May 4 14:16:57 2016 +0800 Committer: Cheng Lian <l...@databricks.com> Committed: Wed May 4 14:17:26 2016 +0800 -- .../spark/ml/source/libsvm/LibSVMRelation.scala | 17 + .../datasources/FileSourceStrategy.scala| 2 +- .../datasources/csv/DefaultSource.scala | 17 ++--- .../datasources/fileSourceInterfaces.scala | 40 .../datasources/json/JSONRelation.scala | 10 + .../datasources/parquet/ParquetRelation.scala | 14 +++ .../datasources/text/DefaultSource.scala| 13 +++ .../execution/datasources/csv/CSVSuite.scala| 3 -- .../apache/spark/sql/hive/orc/OrcRelation.scala | 11 +- 9 files changed, 74 insertions(+), 53 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/64ad9ba2/mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala b/mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala index ba2e1e2..5f78fab 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala @@ -204,25 +204,10 @@ class DefaultSource extends FileFormat with DataSourceRegister { val converter = RowEncoder(dataSchema) - val unsafeRowIterator = points.map { pt => + points.map { pt => val features = if (sparse) pt.features.toSparse else pt.features.toDense converter.toRow(Row(pt.label, features)) } - - def toAttribute(f: StructField): AttributeReference = -AttributeReference(f.name, f.dataType, f.nullable, f.metadata)() - - // Appends partition values - val fullOutput = (dataSchema ++ partitionSchema).map(toAttribute) - val requiredOutput = fullOutput.filter { a => -requiredSchema.fieldNames.contains(a.name) || partitionSchema.fieldNames.contains(a.name) - } - val joinedRow = new JoinedRow() - val appendPartitionColumns = GenerateUnsafeProjection.generate(requiredOutput, fullOutput) - - unsafeRowIterator.map { dataRow => -appendPartitionColumns(joinedRow(dataRow, file.partitionValues)) - } } } } http://git-wip-us.apache.org/repos/asf/spark/blob/64ad9ba2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStr
spark git commit: [SPARK-14237][SQL] De-duplicate partition value appending logic in various buildReader() implementations
Repository: spark Updated Branches: refs/heads/master 695f0e919 -> bc3760d40 [SPARK-14237][SQL] De-duplicate partition value appending logic in various buildReader() implementations ## What changes were proposed in this pull request? Currently, various `FileFormat` data sources share approximately the same code for partition value appending. This PR tries to eliminate this duplication. A new method `buildReaderWithPartitionValues()` is added to `FileFormat` with a default implementation that appends partition values to `InternalRow`s produced by the reader function returned by `buildReader()`. Special data sources like Parquet, which implements partition value appending inside `buildReader()` because of the vectorized reader, and the Text data source, which doesn't support partitioning, override `buildReaderWithPartitionValues()` and simply delegate to `buildReader()`. This PR brings two benefits: 1. Apparently, it de-duplicates partition value appending logic 2. Now the reader function returned by `buildReader()` is only required to produce `InternalRow`s rather than `UnsafeRow`s if the data source doesn't override `buildReaderWithPartitionValues()`. Because the safe-to-unsafe conversion is also performed while appending partition values. This makes 3rd-party data sources (e.g. spark-avro) easier to implement since they no longer need to access private APIs involving `UnsafeRow`. ## How was this patch tested? Existing tests should do the work. Author: Cheng Lian <l...@databricks.com> Closes #12866 from liancheng/spark-14237-simplify-partition-values-appending. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/bc3760d4 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/bc3760d4 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/bc3760d4 Branch: refs/heads/master Commit: bc3760d405cc8c3ffcd957b188afa8b7e3b1f824 Parents: 695f0e9 Author: Cheng Lian <l...@databricks.com> Authored: Wed May 4 14:16:57 2016 +0800 Committer: Cheng Lian <l...@databricks.com> Committed: Wed May 4 14:16:57 2016 +0800 -- .../spark/ml/source/libsvm/LibSVMRelation.scala | 17 + .../datasources/FileSourceStrategy.scala| 2 +- .../datasources/csv/DefaultSource.scala | 17 ++--- .../datasources/fileSourceInterfaces.scala | 40 .../datasources/json/JSONRelation.scala | 10 + .../datasources/parquet/ParquetRelation.scala | 14 +++ .../datasources/text/DefaultSource.scala| 13 +++ .../execution/datasources/csv/CSVSuite.scala| 3 -- .../apache/spark/sql/hive/orc/OrcRelation.scala | 11 +- 9 files changed, 74 insertions(+), 53 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/bc3760d4/mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala b/mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala index ba2e1e2..5f78fab 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala @@ -204,25 +204,10 @@ class DefaultSource extends FileFormat with DataSourceRegister { val converter = RowEncoder(dataSchema) - val unsafeRowIterator = points.map { pt => + points.map { pt => val features = if (sparse) pt.features.toSparse else pt.features.toDense converter.toRow(Row(pt.label, features)) } - - def toAttribute(f: StructField): AttributeReference = -AttributeReference(f.name, f.dataType, f.nullable, f.metadata)() - - // Appends partition values - val fullOutput = (dataSchema ++ partitionSchema).map(toAttribute) - val requiredOutput = fullOutput.filter { a => -requiredSchema.fieldNames.contains(a.name) || partitionSchema.fieldNames.contains(a.name) - } - val joinedRow = new JoinedRow() - val appendPartitionColumns = GenerateUnsafeProjection.generate(requiredOutput, fullOutput) - - unsafeRowIterator.map { dataRow => -appendPartitionColumns(joinedRow(dataRow, file.partitionValues)) - } } } } http://git-wip-us.apache.org/repos/asf/spark/blob/bc3760d4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala index 615906a..8a93c6f 1006
spark git commit: [SPARK-14875][SQL] Makes OutputWriterFactory.newInstance public
Repository: spark Updated Branches: refs/heads/master cef77d1f6 -> e66afd5c6 [SPARK-14875][SQL] Makes OutputWriterFactory.newInstance public ## What changes were proposed in this pull request? This method was accidentally made `private[sql]` in Spark 2.0. This PR makes it public again, since 3rd party data sources like spark-avro depend on it. ## How was this patch tested? N/A Author: Cheng Lian <l...@databricks.com> Closes #12652 from liancheng/spark-14875. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e66afd5c Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e66afd5c Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e66afd5c Branch: refs/heads/master Commit: e66afd5c66a6d8f33c90ce08c35c3823735bde83 Parents: cef77d1 Author: Cheng Lian <l...@databricks.com> Authored: Mon Apr 25 20:42:49 2016 +0800 Committer: Cheng Lian <l...@databricks.com> Committed: Mon Apr 25 20:42:49 2016 +0800 -- .../spark/sql/execution/datasources/fileSourceInterfaces.scala | 2 +- .../scala/org/apache/spark/sql/sources/SimpleTextRelation.scala| 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/e66afd5c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala index ed24bdd..4063c6e 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala @@ -59,7 +59,7 @@ abstract class OutputWriterFactory extends Serializable { * @param context The Hadoop MapReduce task context. * @since 1.4.0 */ - private[sql] def newInstance( + def newInstance( path: String, bucketId: Option[Int], // TODO: This doesn't belong here... dataSchema: StructType, http://git-wip-us.apache.org/repos/asf/spark/blob/e66afd5c/sql/hive/src/test/scala/org/apache/spark/sql/sources/SimpleTextRelation.scala -- diff --git a/sql/hive/src/test/scala/org/apache/spark/sql/sources/SimpleTextRelation.scala b/sql/hive/src/test/scala/org/apache/spark/sql/sources/SimpleTextRelation.scala index 16996ae..dad4f87 100644 --- a/sql/hive/src/test/scala/org/apache/spark/sql/sources/SimpleTextRelation.scala +++ b/sql/hive/src/test/scala/org/apache/spark/sql/sources/SimpleTextRelation.scala @@ -48,7 +48,7 @@ class SimpleTextSource extends FileFormat with DataSourceRegister { job: Job, options: Map[String, String], dataSchema: StructType): OutputWriterFactory = new OutputWriterFactory { -override private[sql] def newInstance( +override def newInstance( path: String, bucketId: Option[Int], dataSchema: StructType, - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-14843][ML] Fix encoding error in LibSVMRelation
Repository: spark Updated Branches: refs/heads/master c089c6f4e -> 8098f1585 [SPARK-14843][ML] Fix encoding error in LibSVMRelation ## What changes were proposed in this pull request? We use `RowEncoder` in libsvm data source to serialize the label and features read from libsvm files. However, the schema passed in this encoder is not correct. As the result, we can't correctly select `features` column from the DataFrame. We should use full data schema instead of `requiredSchema` to serialize the data read in. Then do projection to select required columns later. ## How was this patch tested? `LibSVMRelationSuite`. Author: Liang-Chi Hsieh <sim...@tw.ibm.com> Closes #12611 from viirya/fix-libsvm. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/8098f158 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/8098f158 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/8098f158 Branch: refs/heads/master Commit: 8098f158576b07343f74e2061d217b106c71b62d Parents: c089c6f Author: Liang-Chi Hsieh <sim...@tw.ibm.com> Authored: Sat Apr 23 01:11:36 2016 +0800 Committer: Cheng Lian <l...@databricks.com> Committed: Sat Apr 23 01:11:36 2016 +0800 -- .../org/apache/spark/ml/source/libsvm/LibSVMRelation.scala | 9 ++--- .../apache/spark/ml/source/libsvm/LibSVMRelationSuite.scala | 9 +++-- 2 files changed, 13 insertions(+), 5 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/8098f158/mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala b/mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala index e8b0dd6..dc2a6f5 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala @@ -202,7 +202,7 @@ class DefaultSource extends FileFormat with DataSourceRegister { LabeledPoint(label, Vectors.sparse(numFeatures, indices, values)) } - val converter = RowEncoder(requiredSchema) + val converter = RowEncoder(dataSchema) val unsafeRowIterator = points.map { pt => val features = if (sparse) pt.features.toSparse else pt.features.toDense @@ -213,9 +213,12 @@ class DefaultSource extends FileFormat with DataSourceRegister { AttributeReference(f.name, f.dataType, f.nullable, f.metadata)() // Appends partition values - val fullOutput = (requiredSchema ++ partitionSchema).map(toAttribute) + val fullOutput = (dataSchema ++ partitionSchema).map(toAttribute) + val requiredOutput = fullOutput.filter { a => +requiredSchema.fieldNames.contains(a.name) || partitionSchema.fieldNames.contains(a.name) + } val joinedRow = new JoinedRow() - val appendPartitionColumns = GenerateUnsafeProjection.generate(fullOutput, fullOutput) + val appendPartitionColumns = GenerateUnsafeProjection.generate(requiredOutput, fullOutput) unsafeRowIterator.map { dataRow => appendPartitionColumns(joinedRow(dataRow, file.partitionValues)) http://git-wip-us.apache.org/repos/asf/spark/blob/8098f158/mllib/src/test/scala/org/apache/spark/ml/source/libsvm/LibSVMRelationSuite.scala -- diff --git a/mllib/src/test/scala/org/apache/spark/ml/source/libsvm/LibSVMRelationSuite.scala b/mllib/src/test/scala/org/apache/spark/ml/source/libsvm/LibSVMRelationSuite.scala index 0bd1497..e52fbd7 100644 --- a/mllib/src/test/scala/org/apache/spark/ml/source/libsvm/LibSVMRelationSuite.scala +++ b/mllib/src/test/scala/org/apache/spark/ml/source/libsvm/LibSVMRelationSuite.scala @@ -23,9 +23,9 @@ import java.nio.charset.StandardCharsets import com.google.common.io.Files import org.apache.spark.{SparkException, SparkFunSuite} -import org.apache.spark.mllib.linalg.{DenseVector, SparseVector, Vectors} +import org.apache.spark.mllib.linalg.{DenseVector, SparseVector, Vector, Vectors} import org.apache.spark.mllib.util.MLlibTestSparkContext -import org.apache.spark.sql.SaveMode +import org.apache.spark.sql.{Row, SaveMode} import org.apache.spark.util.Utils @@ -104,4 +104,9 @@ class LibSVMRelationSuite extends SparkFunSuite with MLlibTestSparkContext { df.write.format("libsvm").save(path + "_2") } } + + test("select features from libsvm relation") { +val df = sqlContext.read.format("libsvm").load(path) +df.select("features").rdd.map { case Row(d: Vector) => d }.first + } } ---
spark git commit: [SPARK-14488][SPARK-14493][SQL] "CREATE TEMPORARY TABLE ... USING ... AS SELECT" shouldn't create persisted table
Repository: spark Updated Branches: refs/heads/master b0f5497e9 -> 124cbfb68 [SPARK-14488][SPARK-14493][SQL] "CREATE TEMPORARY TABLE ... USING ... AS SELECT" shouldn't create persisted table ## What changes were proposed in this pull request? When planning logical plan node `CreateTableUsingAsSelect`, we neglected its `temporary` field and always generates a `CreateMetastoreDataSourceAsSelect`. This PR fixes this issue generating `CreateTempTableUsingAsSelect` when `temporary` is true. This PR also fixes SPARK-14493 since the root cause of SPARK-14493 is that we were `CreateMetastoreDataSourceAsSelect` uses default Hive warehouse location when `PATH` data source option is absent. ## How was this patch tested? Added a test case to create a temporary table using the target syntax and check whether it's indeed a temporary table. Author: Cheng Lian <l...@databricks.com> Closes #12303 from liancheng/spark-14488-fix-ctas-using. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/124cbfb6 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/124cbfb6 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/124cbfb6 Branch: refs/heads/master Commit: 124cbfb683a5e959e1b5181d4d0cc56956b50385 Parents: b0f5497 Author: Cheng Lian <l...@databricks.com> Authored: Tue Apr 12 22:28:57 2016 +0800 Committer: Cheng Lian <l...@databricks.com> Committed: Tue Apr 12 22:28:57 2016 +0800 -- .../apache/spark/sql/hive/HiveStrategies.scala | 10 ++-- .../sql/hive/execution/SQLQuerySuite.scala | 49 ++-- 2 files changed, 53 insertions(+), 6 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/124cbfb6/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala -- diff --git a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala index f44937e..010361a 100644 --- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala +++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala @@ -23,9 +23,8 @@ import org.apache.spark.sql.catalyst.planning._ import org.apache.spark.sql.catalyst.plans._ import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan import org.apache.spark.sql.execution._ -import org.apache.spark.sql.execution.command.{DescribeCommand => RunnableDescribeCommand, _} -import org.apache.spark.sql.execution.datasources.{CreateTableUsing, CreateTableUsingAsSelect, - DescribeCommand} +import org.apache.spark.sql.execution.command.{DescribeCommand => _, _} +import org.apache.spark.sql.execution.datasources.{CreateTableUsing, CreateTableUsingAsSelect, CreateTempTableUsingAsSelect, DescribeCommand} import org.apache.spark.sql.hive.execution._ private[hive] trait HiveStrategies { @@ -90,6 +89,11 @@ private[hive] trait HiveStrategies { tableIdent, userSpecifiedSchema, provider, opts, allowExisting, managedIfNoPath) ExecutedCommand(cmd) :: Nil + case c: CreateTableUsingAsSelect if c.temporary => +val cmd = CreateTempTableUsingAsSelect( + c.tableIdent, c.provider, c.partitionColumns, c.mode, c.options, c.child) +ExecutedCommand(cmd) :: Nil + case c: CreateTableUsingAsSelect => val cmd = CreateMetastoreDataSourceAsSelect(c.tableIdent, c.provider, c.partitionColumns, c.bucketSpec, c.mode, c.options, c.child) http://git-wip-us.apache.org/repos/asf/spark/blob/124cbfb6/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala -- diff --git a/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala b/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala index b4886eb..7eaf19d 100644 --- a/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala +++ b/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala @@ -19,12 +19,9 @@ package org.apache.spark.sql.hive.execution import java.sql.{Date, Timestamp} -import scala.collection.JavaConverters._ - import org.apache.spark.sql._ import org.apache.spark.sql.catalyst.TableIdentifier import org.apache.spark.sql.catalyst.analysis.{EliminateSubqueryAliases, FunctionRegistry} -import org.apache.spark.sql.catalyst.parser.ParseException import org.apache.spark.sql.execution.datasources.LogicalRelation import org.apache.spark.sql.functions._ import org.apache.spark.sql.hive.{HiveContext, MetastoreRelation} @@ -1852,4 +1849,50 @@ class SQLQuerySuite extends QueryTest with SQLTestUtils with TestHiv
spark git commit: [SPARK-14295][MLLIB][HOTFIX] Fixes Scala 2.10 compilation failure
Repository: spark Updated Branches: refs/heads/master 22249afb4 -> 3715ecdf4 [SPARK-14295][MLLIB][HOTFIX] Fixes Scala 2.10 compilation failure ## What changes were proposed in this pull request? Fixes a compilation failure introduced in PR #12088 under Scala 2.10. ## How was this patch tested? Compilation. Author: Cheng Lian <l...@databricks.com> Closes #12107 from liancheng/spark-14295-hotfix. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/3715ecdf Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/3715ecdf Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/3715ecdf Branch: refs/heads/master Commit: 3715ecdf417b47423ff07145a5623d8d817c45ef Parents: 22249af Author: Cheng Lian <l...@databricks.com> Authored: Fri Apr 1 17:02:48 2016 +0800 Committer: Cheng Lian <l...@databricks.com> Committed: Fri Apr 1 17:02:48 2016 +0800 -- mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/3715ecdf/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala b/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala index 4b9d779..774170f 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala @@ -120,7 +120,7 @@ object MLUtils { i += 1 } -(label, indices, values) +(label, indices.toArray, values.toArray) } /** - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-14114][SQL] implement buildReader for text data source
Repository: spark Updated Branches: refs/heads/master 7320f9bd1 -> 816f359cf [SPARK-14114][SQL] implement buildReader for text data source ## What changes were proposed in this pull request? This PR implements buildReader for text data source and enable it in the new data source code path. ## How was this patch tested? Existing tests. Author: Wenchen Fan <wenc...@databricks.com> Closes #11934 from cloud-fan/text. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/816f359c Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/816f359c Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/816f359c Branch: refs/heads/master Commit: 816f359cf043ef719a0bc7df0506a3a830fff70d Parents: 7320f9b Author: Wenchen Fan <wenc...@databricks.com> Authored: Wed Mar 30 17:32:53 2016 +0800 Committer: Cheng Lian <l...@databricks.com> Committed: Wed Mar 30 17:32:53 2016 +0800 -- .../datasources/FileSourceStrategy.scala| 3 ++- .../datasources/text/DefaultSource.scala| 28 +++- 2 files changed, 29 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/816f359c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala index 20fda95..4448796 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala @@ -59,7 +59,8 @@ private[sql] object FileSourceStrategy extends Strategy with Logging { if (files.fileFormat.toString == "TestFileFormat" || files.fileFormat.isInstanceOf[parquet.DefaultSource] || files.fileFormat.toString == "ORC" || - files.fileFormat.isInstanceOf[json.DefaultSource]) && + files.fileFormat.isInstanceOf[json.DefaultSource] || + files.fileFormat.isInstanceOf[text.DefaultSource]) && files.sqlContext.conf.useFileScan => // Filters on this relation fall into four categories based on where we can use them to avoid // reading unneeded data: http://git-wip-us.apache.org/repos/asf/spark/blob/816f359c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/DefaultSource.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/DefaultSource.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/DefaultSource.scala index 5cfc9e9..d6ab5fc 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/DefaultSource.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/DefaultSource.scala @@ -17,6 +17,7 @@ package org.apache.spark.sql.execution.datasources.text +import org.apache.hadoop.conf.Configuration import org.apache.hadoop.fs.{FileStatus, Path} import org.apache.hadoop.io.{LongWritable, NullWritable, Text} import org.apache.hadoop.mapred.{JobConf, TextInputFormat} @@ -30,7 +31,7 @@ import org.apache.spark.sql.{AnalysisException, Row, SQLContext} import org.apache.spark.sql.catalyst.InternalRow import org.apache.spark.sql.catalyst.expressions.UnsafeRow import org.apache.spark.sql.catalyst.expressions.codegen.{BufferHolder, UnsafeRowWriter} -import org.apache.spark.sql.execution.datasources.CompressionCodecs +import org.apache.spark.sql.execution.datasources.{CompressionCodecs, HadoopFileLinesReader, PartitionedFile} import org.apache.spark.sql.sources._ import org.apache.spark.sql.types.{StringType, StructType} import org.apache.spark.util.SerializableConfiguration @@ -125,6 +126,31 @@ class DefaultSource extends FileFormat with DataSourceRegister { } } } + + override def buildReader( + sqlContext: SQLContext, + partitionSchema: StructType, + dataSchema: StructType, + filters: Seq[Filter], + options: Map[String, String]): PartitionedFile => Iterator[InternalRow] = { +val conf = new Configuration(sqlContext.sparkContext.hadoopConfiguration) +val broadcastedConf = + sqlContext.sparkContext.broadcast(new SerializableConfiguration(conf)) + +file => { + val unsafeRow = new UnsafeRow(1) + val bufferHolder = new BufferHolder(unsafeRow) + val unsafeRowWriter = new UnsafeRowWriter(bufferHolder, 1) + + new HadoopFileLinesReader(file, broadcastedConf.value
spark git commit: [SPARK-14208][SQL] Renames spark.sql.parquet.fileScan
Repository: spark Updated Branches: refs/heads/master 425bcf6d6 -> a632bb56f [SPARK-14208][SQL] Renames spark.sql.parquet.fileScan ## What changes were proposed in this pull request? Renames SQL option `spark.sql.parquet.fileScan` since now all `HadoopFsRelation` based data sources are being migrated to `FileScanRDD` code path. ## How was this patch tested? None. Author: Cheng Lian <l...@databricks.com> Closes #12003 from liancheng/spark-14208-option-renaming. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a632bb56 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a632bb56 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a632bb56 Branch: refs/heads/master Commit: a632bb56f8867df39a78d7f01fb870f548b09815 Parents: 425bcf6 Author: Cheng Lian <l...@databricks.com> Authored: Tue Mar 29 20:56:01 2016 +0800 Committer: Cheng Lian <l...@databricks.com> Committed: Tue Mar 29 20:56:01 2016 +0800 -- .../spark/sql/execution/datasources/FileSourceStrategy.scala | 2 +- .../main/scala/org/apache/spark/sql/internal/SQLConf.scala | 8 2 files changed, 5 insertions(+), 5 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/a632bb56/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala index 76a724e..20fda95 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala @@ -60,7 +60,7 @@ private[sql] object FileSourceStrategy extends Strategy with Logging { files.fileFormat.isInstanceOf[parquet.DefaultSource] || files.fileFormat.toString == "ORC" || files.fileFormat.isInstanceOf[json.DefaultSource]) && - files.sqlContext.conf.parquetFileScan => + files.sqlContext.conf.useFileScan => // Filters on this relation fall into four categories based on where we can use them to avoid // reading unneeded data: // - partition keys only - used to prune directories to read http://git-wip-us.apache.org/repos/asf/spark/blob/a632bb56/sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala b/sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala index 77af0e0..ca6ba4c 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala @@ -288,9 +288,9 @@ object SQLConf { defaultValue = Some(true), doc = "Whether the query analyzer should be case sensitive or not.") - val PARQUET_FILE_SCAN = booleanConf("spark.sql.parquet.fileScan", + val USE_FILE_SCAN = booleanConf("spark.sql.sources.fileScan", defaultValue = Some(true), -doc = "Use the new FileScanRDD path for reading parquet data.", +doc = "Use the new FileScanRDD path for reading HDSF based data sources.", isPublic = false) val PARQUET_SCHEMA_MERGING_ENABLED = booleanConf("spark.sql.parquet.mergeSchema", @@ -583,9 +583,9 @@ class SQLConf extends Serializable with CatalystConf with ParserConf with Loggin def useCompression: Boolean = getConf(COMPRESS_CACHED) - def parquetCompressionCodec: String = getConf(PARQUET_COMPRESSION) + def useFileScan: Boolean = getConf(USE_FILE_SCAN) - def parquetFileScan: Boolean = getConf(PARQUET_FILE_SCAN) + def parquetCompressionCodec: String = getConf(PARQUET_COMPRESSION) def parquetCacheMetadata: Boolean = getConf(PARQUET_CACHE_METADATA) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-13456][SQL][FOLLOW-UP] lazily generate the outer pointer for case class defined in REPL
Repository: spark Updated Branches: refs/heads/master 70a6f0bb5 -> e9b6e7d85 [SPARK-13456][SQL][FOLLOW-UP] lazily generate the outer pointer for case class defined in REPL ## What changes were proposed in this pull request? In https://github.com/apache/spark/pull/11410, we missed a corner case: define the inner class and use it in `Dataset` at the same time by using paste mode. For this case, the inner class and the `Dataset` are inside same line object, when we build the `Dataset`, we try to get outer pointer from line object, and it will fail because the line object is not initialized yet. https://issues.apache.org/jira/browse/SPARK-13456?focusedCommentId=15209174=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15209174 is an example for this corner case. This PR make the process of getting outer pointer from line object lazy, so that we can successfully build the `Dataset` and finish initializing the line object. ## How was this patch tested? new test in repl suite. Author: Wenchen Fan <wenc...@databricks.com> Closes #11931 from cloud-fan/repl. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e9b6e7d8 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e9b6e7d8 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e9b6e7d8 Branch: refs/heads/master Commit: e9b6e7d8577cd721a433130f29e8b112d98768b9 Parents: 70a6f0b Author: Wenchen Fan <wenc...@databricks.com> Authored: Fri Mar 25 20:19:04 2016 +0800 Committer: Cheng Lian <l...@databricks.com> Committed: Fri Mar 25 20:19:04 2016 +0800 -- .../scala/org/apache/spark/repl/ReplSuite.scala | 15 .../spark/sql/catalyst/analysis/Analyzer.scala | 2 +- .../sql/catalyst/encoders/OuterScopes.scala | 39 +--- .../sql/catalyst/expressions/objects.scala | 12 +++--- 4 files changed, 48 insertions(+), 20 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/e9b6e7d8/repl/scala-2.11/src/test/scala/org/apache/spark/repl/ReplSuite.scala -- diff --git a/repl/scala-2.11/src/test/scala/org/apache/spark/repl/ReplSuite.scala b/repl/scala-2.11/src/test/scala/org/apache/spark/repl/ReplSuite.scala index f148a6d..dbfacba 100644 --- a/repl/scala-2.11/src/test/scala/org/apache/spark/repl/ReplSuite.scala +++ b/repl/scala-2.11/src/test/scala/org/apache/spark/repl/ReplSuite.scala @@ -59,6 +59,10 @@ class ReplSuite extends SparkFunSuite { return out.toString } + // Simulate the paste mode in Scala REPL. + def runInterpreterInPasteMode(master: String, input: String): String = +runInterpreter(master, ":paste\n" + input + 4.toChar) // 4 is the ascii code of CTRL + D + def assertContains(message: String, output: String) { val isContain = output.contains(message) assert(isContain, @@ -381,4 +385,15 @@ class ReplSuite extends SparkFunSuite { assertDoesNotContain("error:", output) assertDoesNotContain("Exception", output) } + + test("define case class and create Dataset together with paste mode") { +val output = runInterpreterInPasteMode("local-cluster[1,1,1024]", + """ +|import sqlContext.implicits._ +|case class TestClass(value: Int) +|Seq(TestClass(1)).toDS() + """.stripMargin) +assertDoesNotContain("error:", output) +assertDoesNotContain("Exception", output) + } } http://git-wip-us.apache.org/repos/asf/spark/blob/e9b6e7d8/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala index b344e04..89b18af 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala @@ -605,7 +605,7 @@ class Analyzer( "access to the scope that this class was defined in.\n" + "Try moving this class out of its parent class.") } - n.copy(outerPointer = Some(Literal.fromObject(outer))) + n.copy(outerPointer = Some(outer)) } } http://git-wip-us.apache.org/repos/asf/spark/blob/e9b6e7d8/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/OuterScopes.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/OuterScopes.scala b/sq
spark git commit: [SPARK-13549][SQL] Refactor the Optimizer Rule CollapseProject
Repository: spark Updated Branches: refs/heads/master cde086cb2 -> 6ce008ba4 [SPARK-13549][SQL] Refactor the Optimizer Rule CollapseProject What changes were proposed in this pull request? The PR https://github.com/apache/spark/pull/10541 changed the rule `CollapseProject` by enabling collapsing `Project` into `Aggregate`. It leaves a to-do item to remove the duplicate code. This PR is to finish this to-do item. Also added a test case for covering this change. How was this patch tested? Added a new test case. liancheng Could you check if the code refactoring is fine? Thanks! Author: gatorsmile <gatorsm...@gmail.com> Closes #11427 from gatorsmile/collapseProjectRefactor. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6ce008ba Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6ce008ba Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6ce008ba Branch: refs/heads/master Commit: 6ce008ba46aa1fc8a5c222ce0f25a6d81f53588e Parents: cde086c Author: gatorsmile <gatorsm...@gmail.com> Authored: Thu Mar 24 00:51:31 2016 +0800 Committer: Cheng Lian <l...@databricks.com> Committed: Thu Mar 24 00:51:31 2016 +0800 -- .../sql/catalyst/optimizer/Optimizer.scala | 101 +-- .../optimizer/CollapseProjectSuite.scala| 26 - 2 files changed, 70 insertions(+), 57 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/6ce008ba/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala index 0840d46..4cfdcf9 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala @@ -417,68 +417,57 @@ object ColumnPruning extends Rule[LogicalPlan] { object CollapseProject extends Rule[LogicalPlan] { def apply(plan: LogicalPlan): LogicalPlan = plan transformUp { -case p @ Project(projectList1, Project(projectList2, child)) => - // Create a map of Aliases to their values from the child projection. - // e.g., 'SELECT ... FROM (SELECT a + b AS c, d ...)' produces Map(c -> Alias(a + b, c)). - val aliasMap = AttributeMap(projectList2.collect { -case a: Alias => (a.toAttribute, a) - }) - - // We only collapse these two Projects if their overlapped expressions are all - // deterministic. - val hasNondeterministic = projectList1.exists(_.collect { -case a: Attribute if aliasMap.contains(a) => aliasMap(a).child - }.exists(!_.deterministic)) - - if (hasNondeterministic) { +case p1 @ Project(_, p2: Project) => + if (haveCommonNonDeterministicOutput(p1.projectList, p2.projectList)) { +p1 + } else { +p2.copy(projectList = buildCleanedProjectList(p1.projectList, p2.projectList)) + } +case p @ Project(_, agg: Aggregate) => + if (haveCommonNonDeterministicOutput(p.projectList, agg.aggregateExpressions)) { p } else { -// Substitute any attributes that are produced by the child projection, so that we safely -// eliminate it. -// e.g., 'SELECT c + 1 FROM (SELECT a + b AS C ...' produces 'SELECT a + b + 1 ...' -// TODO: Fix TransformBase to avoid the cast below. -val substitutedProjection = projectList1.map(_.transform { - case a: Attribute => aliasMap.getOrElse(a, a) -}).asInstanceOf[Seq[NamedExpression]] -// collapse 2 projects may introduce unnecessary Aliases, trim them here. -val cleanedProjection = substitutedProjection.map(p => - CleanupAliases.trimNonTopLevelAliases(p).asInstanceOf[NamedExpression] -) -Project(cleanedProjection, child) +agg.copy(aggregateExpressions = buildCleanedProjectList( + p.projectList, agg.aggregateExpressions)) } + } -// TODO Eliminate duplicate code -// This clause is identical to the one above except that the inner operator is an `Aggregate` -// rather than a `Project`. -case p @ Project(projectList1, agg @ Aggregate(_, projectList2, child)) => - // Create a map of Aliases to their values from the child projection. - // e.g., 'SELECT ... FROM (SELECT a + b AS c, d ...)' produces Map(c -> Alias(a + b, c)). - val aliasMap = AttributeMap(projectList2.collect { -case a: Alias => (a.toAttribute, a) - }) + private def collectAliases(projectList: Seq[NamedExpression]): Attribu
spark git commit: [SPARK-13817][SQL][MINOR] Renames Dataset.newDataFrame to Dataset.ofRows
Repository: spark Updated Branches: refs/heads/master 7d1175011 -> cde086cb2 [SPARK-13817][SQL][MINOR] Renames Dataset.newDataFrame to Dataset.ofRows ## What changes were proposed in this pull request? This PR does the renaming as suggested by marmbrus in [this comment][1]. ## How was this patch tested? Existing tests. [1]: https://github.com/apache/spark/commit/6d37e1eb90054cdb6323b75fb202f78ece604b15#commitcomment-16654694 Author: Cheng Lian <l...@databricks.com> Closes #11889 from liancheng/spark-13817-follow-up. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/cde086cb Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/cde086cb Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/cde086cb Branch: refs/heads/master Commit: cde086cb2a9a85406fc18d8e63e46425f614c15f Parents: 7d11750 Author: Cheng Lian <l...@databricks.com> Authored: Thu Mar 24 00:42:13 2016 +0800 Committer: Cheng Lian <l...@databricks.com> Committed: Thu Mar 24 00:42:13 2016 +0800 -- .../org/apache/spark/sql/DataFrameReader.scala | 8 +++--- .../scala/org/apache/spark/sql/Dataset.scala| 4 +-- .../spark/sql/KeyValueGroupedDataset.scala | 2 +- .../spark/sql/RelationalGroupedDataset.scala| 8 +++--- .../scala/org/apache/spark/sql/SQLContext.scala | 26 ++-- .../spark/sql/execution/command/commands.scala | 2 +- .../sql/execution/datasources/DataSource.scala | 2 +- .../datasources/InsertIntoDataSource.scala | 2 +- .../InsertIntoHadoopFsRelation.scala| 2 +- .../spark/sql/execution/datasources/ddl.scala | 8 +++--- .../sql/execution/stat/FrequentItems.scala | 2 +- .../sql/execution/stat/StatFunctions.scala | 2 +- .../execution/streaming/StreamExecution.scala | 2 +- .../spark/sql/execution/streaming/memory.scala | 2 +- .../scala/org/apache/spark/sql/functions.scala | 2 +- .../org/apache/spark/sql/DataFrameSuite.scala | 2 +- .../scala/org/apache/spark/sql/StreamTest.scala | 2 +- .../datasources/FileSourceStrategySuite.scala | 2 +- .../apache/spark/sql/test/SQLTestUtils.scala| 2 +- .../spark/sql/hive/execution/commands.scala | 2 +- .../apache/spark/sql/hive/SQLBuilderTest.scala | 2 +- .../hive/execution/AggregationQuerySuite.scala | 2 +- 22 files changed, 44 insertions(+), 44 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/cde086cb/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala index 1d4693f..704535a 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala @@ -129,7 +129,7 @@ class DataFrameReader private[sql](sqlContext: SQLContext) extends Logging { userSpecifiedSchema = userSpecifiedSchema, className = source, options = extraOptions.toMap) -Dataset.newDataFrame(sqlContext, LogicalRelation(dataSource.resolveRelation())) +Dataset.ofRows(sqlContext, LogicalRelation(dataSource.resolveRelation())) } /** @@ -176,7 +176,7 @@ class DataFrameReader private[sql](sqlContext: SQLContext) extends Logging { userSpecifiedSchema = userSpecifiedSchema, className = source, options = extraOptions.toMap) -Dataset.newDataFrame(sqlContext, StreamingRelation(dataSource.createSource())) +Dataset.ofRows(sqlContext, StreamingRelation(dataSource.createSource())) } /** @@ -376,7 +376,7 @@ class DataFrameReader private[sql](sqlContext: SQLContext) extends Logging { parsedOptions) } -Dataset.newDataFrame( +Dataset.ofRows( sqlContext, LogicalRDD( schema.toAttributes, @@ -424,7 +424,7 @@ class DataFrameReader private[sql](sqlContext: SQLContext) extends Logging { * @since 1.4.0 */ def table(tableName: String): DataFrame = { -Dataset.newDataFrame(sqlContext, +Dataset.ofRows(sqlContext, sqlContext.sessionState.catalog.lookupRelation( sqlContext.sessionState.sqlParser.parseTableIdentifier(tableName))) } http://git-wip-us.apache.org/repos/asf/spark/blob/cde086cb/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala index 31864d6..ec0b3c7 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala @@ -53,7
spark git commit: [SPARK-13772][SQL] Fix data type mismatch for decimal
Repository: spark Updated Branches: refs/heads/branch-1.6 f9221ad79 -> 3243d46ab [SPARK-13772][SQL] Fix data type mismatch for decimal Fix data type mismatch for decimal, patch for branch-1.6. Author: cenyuhai <cenyu...@didichuxing.com> Closes #11605 from cenyuhai/SPARK-13772. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/3243d46a Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/3243d46a Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/3243d46a Branch: refs/heads/branch-1.6 Commit: 3243d46ab8c3f76c1a7ee2cc5588f08ec7b51cfe Parents: f9221ad Author: cenyuhai <cenyu...@didichuxing.com> Authored: Tue Mar 22 20:53:11 2016 +0800 Committer: Cheng Lian <l...@databricks.com> Committed: Tue Mar 22 20:53:18 2016 +0800 -- .../sql/catalyst/analysis/HiveTypeCoercion.scala | 2 +- .../sql/catalyst/analysis/HiveTypeCoercionSuite.scala | 14 ++ 2 files changed, 15 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/3243d46a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/HiveTypeCoercion.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/HiveTypeCoercion.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/HiveTypeCoercion.scala index dbcbd68..470af35 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/HiveTypeCoercion.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/HiveTypeCoercion.scala @@ -677,7 +677,7 @@ object HiveTypeCoercion { case e if !e.childrenResolved => e // Find tightest common type for If, if the true value and false value have different types. case i @ If(pred, left, right) if left.dataType != right.dataType => -findTightestCommonTypeToString(left.dataType, right.dataType).map { widestType => +findWiderTypeForTwo(left.dataType, right.dataType).map { widestType => val newLeft = if (left.dataType == widestType) left else Cast(left, widestType) val newRight = if (right.dataType == widestType) right else Cast(right, widestType) If(pred, newLeft, newRight) http://git-wip-us.apache.org/repos/asf/spark/blob/3243d46a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/HiveTypeCoercionSuite.scala -- diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/HiveTypeCoercionSuite.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/HiveTypeCoercionSuite.scala index 1429150..3b5bd77 100644 --- a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/HiveTypeCoercionSuite.scala +++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/HiveTypeCoercionSuite.scala @@ -299,6 +299,20 @@ class HiveTypeCoercionSuite extends PlanTest { ) } + test("test for SPARK-13772") { +val rule = HiveTypeCoercion.IfCoercion +ruleTest(rule, + If(Literal(true), Literal(1.0), Cast(Literal(1.0), DecimalType(19, 0))), + If(Literal(true), Literal(1.0), Cast(Cast(Literal(1.0), DecimalType(19, 0)), DoubleType)) +) + +ruleTest(rule, + If(Literal(true), Literal(Decimal(1)), Cast(Literal(1.0), DecimalType(19, 9))), + If(Literal(true), Cast(Literal(Decimal(1)), DecimalType(19, 9)), +Cast(Literal(1.0), DecimalType(19, 9))) +) + } + test("type coercion for CaseKeyWhen") { ruleTest(HiveTypeCoercion.CaseWhenCoercion, CaseKeyWhen(Literal(1.toShort), Seq(Literal(1), Literal("a"))), - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-13774][SQL] - Improve error message for non-existent paths and add tests
Repository: spark Updated Branches: refs/heads/master 4e09a0d5e -> 0ce01635c [SPARK-13774][SQL] - Improve error message for non-existent paths and add tests SPARK-13774: IllegalArgumentException: Can not create a Path from an empty string for incorrect file path **Overview:** - If a non-existent path is given in this call `` scala> sqlContext.read.format("csv").load("file-path-is-incorrect.csv") `` it throws the following error: `java.lang.IllegalArgumentException: Can not create a Path from an empty string` â¦.. `It gets called from inferSchema call in org.apache.spark.sql.execution.datasources.DataSource.resolveRelation` - The purpose of this JIRA is to throw a better error message. - With the fix, you will now get a _Path does not exist_ error message. ``` scala> sqlContext.read.format("csv").load("file-path-is-incorrect.csv") org.apache.spark.sql.AnalysisException: Path does not exist: file:/Users/ksunitha/trunk/spark/file-path-is-incorrect.csv; at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:215) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:204) ... at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:204) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:131) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:141) ... 49 elided ``` **Details** _Changes include:_ - Check if path exists or not in resolveRelation in DataSource, and throw an AnalysisException with message like âPath does not exist: $pathâ - AnalysisException is thrown similar to the exceptions thrown in resolveRelation. - The glob path and the non glob path is checked with minimal calls to path exists. If the globPath is empty, then it is a nonexistent glob pattern and an error will be thrown. In the scenario that it is not globPath, it is necessary to only check if the first element in the Seq is valid or not. _Test modifications:_ - Changes went in for 3 tests to account for this error checking. - SQLQuerySuite:test("run sql directly on files") â Error message needed to be updated. - 2 tests failed in MetastoreDataSourcesSuite because they had a dummy path and so test is modified to give a tempdir and allow it to move past so it can continue to test the codepath it meant to test _New Tests:_ 2 new tests are added to DataFrameSuite to validate that glob and non-glob path will throw the new error message. _Testing:_ Unit tests were run with the fix. **Notes/Questions to reviewers:** - There is some code duplication in DataSource.scala in resolveRelation method and also createSource with respect to getting the paths. I have not made any changes to the createSource codepath. Should we make the change there as well ? - From other JIRAs, I know there is restructuring and changes going on in this area, not sure how that will affect these changes, but since this seemed like a starter issue, I looked into it. If we prefer not to add the overhead of the checks, or if there is a better place to do so, let me know. I would appreciate your review. Thanks for your time and comments. Author: Sunitha Kambhampati <skam...@us.ibm.com> Closes #11775 from skambha/improve_errmsg. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/0ce01635 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/0ce01635 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/0ce01635 Branch: refs/heads/master Commit: 0ce01635cc66ca5f9d8962235054335b16f7507e Parents: 4e09a0d Author: Sunitha Kambhampati <skam...@us.ibm.com> Authored: Tue Mar 22 20:47:57 2016 +0800 Committer: Cheng Lian <l...@databricks.com> Committed: Tue Mar 22 20:47:57 2016 +0800 -- .../sql/execution/datasources/DataSource.scala | 11 ++- .../org/apache/spark/sql/DataFrameSuite.scala | 16 .../org/apache/spark/sql/SQLQuerySuite.scala| 2 +- .../sql/hive/MetastoreDataSourcesSuite.scala| 97 ++-- 4 files changed, 78 insertions(+), 48 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/0ce01635/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala index e2d5f42..e2a14ed 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala +++ b/sql/core/src/main/scala/org/apache/spa
spark git commit: [SPARK-13473][SQL] Simplifies PushPredicateThroughProject
Repository: spark Updated Branches: refs/heads/master 14464cadb -> f2e855fba [SPARK-13473][SQL] Simplifies PushPredicateThroughProject ## What changes were proposed in this pull request? This is a follow-up of PR #11348. After PR #11348, a predicate is never pushed through a project as long as the project contains any non-deterministic fields. Thus, it's impossible that the candidate filter condition can reference any non-deterministic projected fields, and related logic can be safely cleaned up. To be more specific, the following optimization is allowed: ```scala // From: df.select('a, 'b).filter('c > rand(42)) // To: df.filter('c > rand(42)).select('a, 'b) ``` while this isn't: ```scala // From: df.select('a, rand('b) as 'rb, 'c).filter('c > 'rb) // To: df.filter('c > rand('b)).select('a, rand('b) as 'rb, 'c) ``` ## How was this patch tested? Existing test cases should do the work. Author: Cheng Lian <l...@databricks.com> Closes #11864 from liancheng/spark-13473-cleanup. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f2e855fb Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f2e855fb Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f2e855fb Branch: refs/heads/master Commit: f2e855fba8eb73475cf312cdf880c1297d4323bb Parents: 14464ca Author: Cheng Lian <l...@databricks.com> Authored: Tue Mar 22 19:20:56 2016 +0800 Committer: Cheng Lian <l...@databricks.com> Committed: Tue Mar 22 19:20:56 2016 +0800 -- .../sql/catalyst/optimizer/Optimizer.scala | 24 +--- 1 file changed, 1 insertion(+), 23 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/f2e855fb/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala index 41e8dc0..0840d46 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala @@ -891,29 +891,7 @@ object PushPredicateThroughProject extends Rule[LogicalPlan] with PredicateHelpe case a: Alias => (a.toAttribute, a.child) }) - // Split the condition into small conditions by `And`, so that we can push down part of this - // condition without nondeterministic expressions. - val andConditions = splitConjunctivePredicates(condition) - - val (deterministic, nondeterministic) = andConditions.partition(_.collect { -case a: Attribute if aliasMap.contains(a) => aliasMap(a) - }.forall(_.deterministic)) - - // If there is no nondeterministic conditions, push down the whole condition. - if (nondeterministic.isEmpty) { -project.copy(child = Filter(replaceAlias(condition, aliasMap), grandChild)) - } else { -// If they are all nondeterministic conditions, leave it un-changed. -if (deterministic.isEmpty) { - filter -} else { - // Push down the small conditions without nondeterministic expressions. - val pushedCondition = -deterministic.map(replaceAlias(_, aliasMap)).reduce(And) - Filter(nondeterministic.reduce(And), -project.copy(child = Filter(pushedCondition, grandChild))) -} - } + project.copy(child = Filter(replaceAlias(condition, aliasMap), grandChild)) } } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-14000][SQL] case class with a tuple field can't work in Dataset
Repository: spark Updated Branches: refs/heads/master 2c5b18fb0 -> 17a3f0067 [SPARK-14000][SQL] case class with a tuple field can't work in Dataset ## What changes were proposed in this pull request? When we validate an encoder, we may call `dataType` on unresolved expressions. This PR fix the validation so that we will resolve attributes first. ## How was this patch tested? a new test in `DatasetSuite` Author: Wenchen Fan <wenc...@databricks.com> Closes #11816 from cloud-fan/encoder. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/17a3f006 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/17a3f006 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/17a3f006 Branch: refs/heads/master Commit: 17a3f00676ca0217f6ee55a1565e96893792 Parents: 2c5b18f Author: Wenchen Fan <wenc...@databricks.com> Authored: Mon Mar 21 22:22:15 2016 +0800 Committer: Cheng Lian <l...@databricks.com> Committed: Mon Mar 21 22:22:15 2016 +0800 -- .../apache/spark/sql/catalyst/analysis/Analyzer.scala | 5 - .../sql/catalyst/encoders/ExpressionEncoder.scala | 10 -- .../catalyst/expressions/complexTypeExtractors.scala | 7 ++- .../test/scala/org/apache/spark/sql/DatasetSuite.scala | 13 +++-- 4 files changed, 29 insertions(+), 6 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/17a3f006/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala index 333a54e..ccc65b4 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala @@ -600,7 +600,10 @@ class Analyzer( exprs.exists(_.collect { case _: Star => true }.nonEmpty) } - private def resolveExpression(expr: Expression, plan: LogicalPlan, throws: Boolean = false) = { + protected[sql] def resolveExpression( + expr: Expression, + plan: LogicalPlan, + throws: Boolean = false) = { // Resolve expression in one round. // If throws == false or the desired attribute doesn't exist // (like try to resolve `a.b` but `a` doesn't exist), fail and return the origin one. http://git-wip-us.apache.org/repos/asf/spark/blob/17a3f006/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala index 58f6d0e..918233d 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala @@ -282,8 +282,14 @@ case class ExpressionEncoder[T]( // If we have nested tuple, the `fromRowExpression` will contains `GetStructField` instead of // `UnresolvedExtractValue`, so we need to check if their ordinals are all valid. // Note that, `BoundReference` contains the expected type, but here we need the actual type, so -// we unbound it by the given `schema` and propagate the actual type to `GetStructField`. -val unbound = fromRowExpression transform { +// we unbound it by the given `schema` and propagate the actual type to `GetStructField`, after +// we resolve the `fromRowExpression`. +val resolved = SimpleAnalyzer.resolveExpression( + fromRowExpression, + LocalRelation(schema), + throws = true) + +val unbound = resolved transform { case b: BoundReference => schema(b.ordinal) } http://git-wip-us.apache.org/repos/asf/spark/blob/17a3f006/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeExtractors.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeExtractors.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeExtractors.scala index 44cdc8d..c06dcc9 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeExtractors.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeExtractors.scala @@ -110,7 +110,12 @@ case class GetStructField(child: Expression, ordinal: Int, name: Option[String] override def dataT
spark git commit: [SPARK-14004][SQL][MINOR] AttributeReference and Alias should only use the first qualifier to generate SQL strings
Repository: spark Updated Branches: refs/heads/master 0acb32a3f -> 14c7236dc [SPARK-14004][SQL][MINOR] AttributeReference and Alias should only use the first qualifier to generate SQL strings ## What changes were proposed in this pull request? Current implementations of `AttributeReference.sql` and `Alias.sql` joins all available qualifiers, which is logically wrong. But this implementation mistake doesn't cause any real SQL generation bugs though, since there is always at most one qualifier for any given `AttributeReference` or `Alias`. This PR fixes this issue by only picking the first qualifiers. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) Existing tests should be enough. Author: Cheng Lian <l...@databricks.com> Closes #11820 from liancheng/spark-14004-single-qualifier. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/14c7236d Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/14c7236d Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/14c7236d Branch: refs/heads/master Commit: 14c7236dc63fe362f052175886e9ad700419bc63 Parents: 0acb32a Author: Cheng Lian <l...@databricks.com> Authored: Sat Mar 19 00:22:17 2016 +0800 Committer: Cheng Lian <l...@databricks.com> Committed: Sat Mar 19 00:22:17 2016 +0800 -- .../spark/sql/catalyst/expressions/namedExpressions.scala | 6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/14c7236d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala index 271ef33..8c63cce 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala @@ -183,8 +183,7 @@ case class Alias(child: Expression, name: String)( } override def sql: String = { -val qualifiersString = - if (qualifiers.isEmpty) "" else qualifiers.map(quoteIdentifier).mkString("", ".", ".") +val qualifiersString = if (qualifiers.isEmpty) "" else qualifiers.head + "." s"${child.sql} AS $qualifiersString${quoteIdentifier(name)}" } } @@ -299,8 +298,7 @@ case class AttributeReference( override def simpleString: String = s"$name#${exprId.id}: ${dataType.simpleString}" override def sql: String = { -val qualifiersString = - if (qualifiers.isEmpty) "" else qualifiers.map(quoteIdentifier).mkString("", ".", ".") +val qualifiersString = if (qualifiers.isEmpty) "" else qualifiers.head + "." s"$qualifiersString${quoteIdentifier(name)}" } } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-14001][SQL] support multi-children Union in SQLBuilder
Repository: spark Updated Branches: refs/heads/master 7783b6f38 -> 0f1015ffd [SPARK-14001][SQL] support multi-children Union in SQLBuilder ## What changes were proposed in this pull request? The fix is simple, use the existing `CombineUnions` rule to combine adjacent Unions before build SQL string. ## How was this patch tested? The re-enabled test Author: Wenchen Fan <wenc...@databricks.com> Closes #11818 from cloud-fan/bug-fix. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/0f1015ff Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/0f1015ff Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/0f1015ff Branch: refs/heads/master Commit: 0f1015ffdd40cd8647f6acdd5cdd717b883e4875 Parents: 7783b6f Author: Wenchen Fan <wenc...@databricks.com> Authored: Fri Mar 18 19:42:33 2016 +0800 Committer: Cheng Lian <l...@databricks.com> Committed: Fri Mar 18 19:42:33 2016 +0800 -- .../scala/org/apache/spark/sql/hive/SQLBuilder.scala | 14 ++ .../apache/spark/sql/hive/LogicalPlanToSQLSuite.scala | 7 +-- 2 files changed, 11 insertions(+), 10 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/0f1015ff/sql/hive/src/main/scala/org/apache/spark/sql/hive/SQLBuilder.scala -- diff --git a/sql/hive/src/main/scala/org/apache/spark/sql/hive/SQLBuilder.scala b/sql/hive/src/main/scala/org/apache/spark/sql/hive/SQLBuilder.scala index 249a685..5175bd4 100644 --- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/SQLBuilder.scala +++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/SQLBuilder.scala @@ -24,9 +24,8 @@ import scala.util.control.NonFatal import org.apache.spark.internal.Logging import org.apache.spark.sql.{DataFrame, SQLContext} import org.apache.spark.sql.catalyst.TableIdentifier -import org.apache.spark.sql.catalyst.analysis.EliminateSubqueryAliases import org.apache.spark.sql.catalyst.expressions._ -import org.apache.spark.sql.catalyst.optimizer.CollapseProject +import org.apache.spark.sql.catalyst.optimizer.{CollapseProject, CombineUnions} import org.apache.spark.sql.catalyst.plans.logical._ import org.apache.spark.sql.catalyst.rules.{Rule, RuleExecutor} import org.apache.spark.sql.catalyst.util.quoteIdentifier @@ -384,11 +383,18 @@ class SQLBuilder(logicalPlan: LogicalPlan, sqlContext: SQLContext) extends Loggi object Canonicalizer extends RuleExecutor[LogicalPlan] { override protected def batches: Seq[Batch] = Seq( - Batch("Collapse Project", FixedPoint(100), + Batch("Prepare", FixedPoint(100), // The `WidenSetOperationTypes` analysis rule may introduce extra `Project`s over // `Aggregate`s to perform type casting. This rule merges these `Project`s into // `Aggregate`s. -CollapseProject), +CollapseProject, +// Parser is unable to parse the following query: +// SELECT `u_1`.`id` +// FROM (((SELECT `t0`.`id` FROM `default`.`t0`) +// UNION ALL (SELECT `t0`.`id` FROM `default`.`t0`)) +// UNION ALL (SELECT `t0`.`id` FROM `default`.`t0`)) AS u_1 +// This rule combine adjacent Unions together so we can generate flat UNION ALL SQL string. +CombineUnions), Batch("Recover Scoping Info", Once, // A logical plan is allowed to have same-name outputs with different qualifiers(e.g. the // `Join` operator). However, this kind of plan can't be put under a sub query as we will http://git-wip-us.apache.org/repos/asf/spark/blob/0f1015ff/sql/hive/src/test/scala/org/apache/spark/sql/hive/LogicalPlanToSQLSuite.scala -- diff --git a/sql/hive/src/test/scala/org/apache/spark/sql/hive/LogicalPlanToSQLSuite.scala b/sql/hive/src/test/scala/org/apache/spark/sql/hive/LogicalPlanToSQLSuite.scala index f86eba6..f6b9072 100644 --- a/sql/hive/src/test/scala/org/apache/spark/sql/hive/LogicalPlanToSQLSuite.scala +++ b/sql/hive/src/test/scala/org/apache/spark/sql/hive/LogicalPlanToSQLSuite.scala @@ -141,12 +141,7 @@ class LogicalPlanToSQLSuite extends SQLBuilderTest with SQLTestUtils { checkHiveQl("SELECT * FROM t0 UNION SELECT * FROM t0") } - // Parser is unable to parse the following query: - // SELECT `u_1`.`id` - // FROM (((SELECT `t0`.`id` FROM `default`.`t0`) - // UNION ALL (SELECT `t0`.`id` FROM `default`.`t0`)) - // UNION ALL (SELECT `t0`.`id` FROM `default`.`t0`)) AS u_1 - ignore("three-child union") { + test("three-child union") { checkHiveQl( """ |SELECT id FROM parquet_t0 --
spark git commit: [SPARK-13972][SQ] hive tests should fail if SQL generation failed
Repository: spark Updated Branches: refs/heads/master 53f32a22d -> 0acb32a3f [SPARK-13972][SQ] hive tests should fail if SQL generation failed ## What changes were proposed in this pull request? Now we should be able to convert all logical plans to SQL string, if they are parsed from hive query. This PR changes the error handling to throw exceptions instead of just log. We will send new PRs for spotted bugs, and merge this one after all bugs are fixed. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenc...@databricks.com> Closes #11782 from cloud-fan/test. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/0acb32a3 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/0acb32a3 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/0acb32a3 Branch: refs/heads/master Commit: 0acb32a3f19484d3ea3b60fccef596025c8a8f83 Parents: 53f32a2 Author: Wenchen Fan <wenc...@databricks.com> Authored: Fri Mar 18 23:16:14 2016 +0800 Committer: Cheng Lian <l...@databricks.com> Committed: Fri Mar 18 23:16:14 2016 +0800 -- .../sql/hive/execution/HiveComparisonTest.scala | 71 1 file changed, 28 insertions(+), 43 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/0acb32a3/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveComparisonTest.scala -- diff --git a/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveComparisonTest.scala b/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveComparisonTest.scala index d21bb57..019d752 100644 --- a/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveComparisonTest.scala +++ b/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveComparisonTest.scala @@ -132,25 +132,7 @@ abstract class HiveComparisonTest new java.math.BigInteger(1, digest.digest).toString(16) } - /** Used for testing [[SQLBuilder]] */ - private var numConvertibleQueries: Int = 0 - private var numTotalQueries: Int = 0 - override protected def afterAll(): Unit = { -logInfo({ - val percentage = if (numTotalQueries > 0) { -numConvertibleQueries.toDouble / numTotalQueries * 100 - } else { -0D - } - - s"""SQLBuilder statistics: - |- Total query number:$numTotalQueries - |- Number of convertible queries: $numConvertibleQueries - |- Percentage of convertible queries: $percentage% - """.stripMargin -}) - try { TestHive.reset() } finally { @@ -412,32 +394,35 @@ abstract class HiveComparisonTest if (containsCommands) { originalQuery } else { -numTotalQueries += 1 +val convertedSQL = try { + new SQLBuilder(originalQuery.analyzed, TestHive).toSQL +} catch { + case NonFatal(e) => fail( +s"""Cannot convert the following HiveQL query plan back to SQL query string: +| +|# Original HiveQL query string: +|$queryString +| +|# Resolved query plan: +|${originalQuery.analyzed.treeString} + """.stripMargin, e) +} + try { - val sql = new SQLBuilder(originalQuery.analyzed, TestHive).toSQL - numConvertibleQueries += 1 - logInfo( -s""" - |### Running SQL generation round-trip test {{{ - |${originalQuery.analyzed.treeString} - |Original SQL: - |$queryString - | - |Generated SQL: - |$sql - |}}} - """.stripMargin.trim) - new TestHive.QueryExecution(sql) -} catch { case NonFatal(e) => - logInfo( -s""" - |### Cannot convert the following logical plan back to SQL {{{ - |${originalQuery.analyzed.treeString} - |Original SQL: - |$queryString - |}}} - """.stripMargin.trim) - originalQuery + new TestHive.QueryExecution(convertedSQL) +} catch { + case NonFatal(e) => fail( +s"""
spark git commit: [MINOR][DOCS] Replace `DataFrame` with `Dataset` in Javadoc.
Repository: spark Updated Branches: refs/heads/master c079420d7 -> db88d0204 [MINOR][DOCS] Replace `DataFrame` with `Dataset` in Javadoc. ## What changes were proposed in this pull request? SPARK-13817 (PR #11656) replaces `DataFrame` with `Dataset` from Java. This PR fixes the remaining broken links and sample Java code in `package-info.java`. As a result, it will update the following Javadoc. * http://spark.apache.org/docs/latest/api/java/org/apache/spark/ml/attribute/package-summary.html * http://spark.apache.org/docs/latest/api/java/org/apache/spark/ml/feature/package-summary.html ## How was this patch tested? Manual. Author: Dongjoon Hyun <dongj...@apache.org> Closes #11675 from dongjoon-hyun/replace_dataframe_with_dataset_in_javadoc. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/db88d020 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/db88d020 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/db88d020 Branch: refs/heads/master Commit: db88d0204e3a9a05dbe6e67e1abb942639c50a06 Parents: c079420 Author: Dongjoon Hyun <dongj...@apache.org> Authored: Sun Mar 13 12:11:18 2016 +0800 Committer: Cheng Lian <l...@databricks.com> Committed: Sun Mar 13 12:11:18 2016 +0800 -- .../org/apache/spark/ml/attribute/package-info.java | 2 +- .../scala/org/apache/spark/ml/feature/package-info.java | 12 ++-- 2 files changed, 7 insertions(+), 7 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/db88d020/mllib/src/main/scala/org/apache/spark/ml/attribute/package-info.java -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/attribute/package-info.java b/mllib/src/main/scala/org/apache/spark/ml/attribute/package-info.java index e3474f3..464ed12 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/attribute/package-info.java +++ b/mllib/src/main/scala/org/apache/spark/ml/attribute/package-info.java @@ -20,7 +20,7 @@ /** * ML attributes * - * The ML pipeline API uses {@link org.apache.spark.sql.DataFrame}s as ML datasets. + * The ML pipeline API uses {@link org.apache.spark.sql.Dataset}s as ML datasets. * Each dataset consists of typed columns, e.g., string, double, vector, etc. * However, knowing only the column type may not be sufficient to handle the data properly. * For instance, a double column with values 0.0, 1.0, 2.0, ... may represent some label indices, http://git-wip-us.apache.org/repos/asf/spark/blob/db88d020/mllib/src/main/scala/org/apache/spark/ml/feature/package-info.java -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/package-info.java b/mllib/src/main/scala/org/apache/spark/ml/feature/package-info.java index 7a35f2d..dcff424 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/feature/package-info.java +++ b/mllib/src/main/scala/org/apache/spark/ml/feature/package-info.java @@ -22,7 +22,7 @@ * The `ml.feature` package provides common feature transformers that help convert raw data or * features into more suitable forms for model fitting. * Most feature transformers are implemented as {@link org.apache.spark.ml.Transformer}s, which - * transforms one {@link org.apache.spark.sql.DataFrame} into another, e.g., + * transforms one {@link org.apache.spark.sql.Dataset} into another, e.g., * {@link org.apache.spark.ml.feature.HashingTF}. * Some feature transformers are implemented as {@link org.apache.spark.ml.Estimator}}s, because the * transformation requires some aggregated information of the dataset, e.g., document @@ -31,7 +31,7 @@ * obtain the model first, e.g., {@link org.apache.spark.ml.feature.IDFModel}, in order to apply * transformation. * The transformation is usually done by appending new columns to the input - * {@link org.apache.spark.sql.DataFrame}, so all input columns are carried over. + * {@link org.apache.spark.sql.Dataset}, so all input columns are carried over. * * We try to make each transformer minimal, so it becomes flexible to assemble feature * transformation pipelines. @@ -46,7 +46,7 @@ * import org.apache.spark.api.java.JavaRDD; * import static org.apache.spark.sql.types.DataTypes.*; * import org.apache.spark.sql.types.StructType; - * import org.apache.spark.sql.DataFrame; + * import org.apache.spark.sql.Dataset; * import org.apache.spark.sql.RowFactory; * import org.apache.spark.sql.Row; * @@ -66,7 +66,7 @@ * RowFactory.create(0, "Hi I heard about Spark", 3.0), * RowFactory.create(1, "I wish Java could use case classes", 4.0), * RowFactory.create(2, "Logistic regression models are neat", 4.0))); - * DataFrame df =
spark git commit: [SPARK-13817][BUILD][SQL] Re-enable MiMA and removes object DataFrame
Repository: spark Updated Branches: refs/heads/master 07f1c5447 -> 6d37e1eb9 [SPARK-13817][BUILD][SQL] Re-enable MiMA and removes object DataFrame ## What changes were proposed in this pull request? PR #11443 temporarily disabled MiMA check, this PR re-enables it. One extra change is that `object DataFrame` is also removed. The only purpose of introducing `object DataFrame` was to use it as an internal factory for creating `Dataset[Row]`. By replacing this internal factory with `Dataset.newDataFrame`, both `DataFrame` and `DataFrame$` are entirely removed from the API, so that we can simply put a `MissingClassProblem` filter in `MimaExcludes.scala` for most DataFrame API changes. ## How was this patch tested? Tested by MiMA check triggered by Jenkins. Author: Cheng Lian <l...@databricks.com> Closes #11656 from liancheng/re-enable-mima. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6d37e1eb Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6d37e1eb Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6d37e1eb Branch: refs/heads/master Commit: 6d37e1eb90054cdb6323b75fb202f78ece604b15 Parents: 07f1c54 Author: Cheng Lian <l...@databricks.com> Authored: Fri Mar 11 22:17:50 2016 +0800 Committer: Cheng Lian <l...@databricks.com> Committed: Fri Mar 11 22:17:50 2016 +0800 -- dev/run-tests.py| 3 +-- .../ml/JavaAFTSurvivalRegressionExample.java| 1 - .../spark/examples/ml/JavaALSExample.java | 1 - .../spark/examples/ml/JavaBinarizerExample.java | 1 - .../examples/ml/JavaCrossValidatorExample.java | 1 - .../examples/ml/JavaDeveloperApiExample.java| 1 - .../JavaEstimatorTransformerParamExample.java | 1 - .../spark/examples/ml/JavaKMeansExample.java| 1 - .../spark/examples/ml/JavaLDAExample.java | 1 - .../spark/examples/ml/JavaOneVsRestExample.java | 1 - .../examples/ml/JavaSQLTransformerExample.java | 1 - project/MimaExcludes.scala | 22 +++ .../scala/org/apache/spark/sql/DataFrame.scala | 16 +-- .../org/apache/spark/sql/DataFrameReader.scala | 8 +++--- .../org/apache/spark/sql/GroupedData.scala | 8 +++--- .../org/apache/spark/sql/GroupedDataset.scala | 2 +- .../scala/org/apache/spark/sql/SQLContext.scala | 28 ++-- .../spark/sql/execution/command/commands.scala | 4 +-- .../sql/execution/datasources/DataSource.scala | 4 +-- .../datasources/InsertIntoDataSource.scala | 2 +- .../InsertIntoHadoopFsRelation.scala| 2 +- .../spark/sql/execution/datasources/ddl.scala | 10 +++ .../sql/execution/stat/FrequentItems.scala | 4 +-- .../sql/execution/stat/StatFunctions.scala | 4 +-- .../execution/streaming/StreamExecution.scala | 2 +- .../spark/sql/execution/streaming/memory.scala | 2 +- .../scala/org/apache/spark/sql/functions.scala | 2 +- .../org/apache/spark/sql/DataFrameSuite.scala | 2 +- .../scala/org/apache/spark/sql/StreamTest.scala | 2 +- .../apache/spark/sql/test/SQLTestUtils.scala| 2 +- .../spark/sql/hive/execution/commands.scala | 2 +- .../apache/spark/sql/hive/SQLBuilderTest.scala | 4 +-- .../hive/execution/AggregationQuerySuite.scala | 2 +- 33 files changed, 78 insertions(+), 69 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/6d37e1eb/dev/run-tests.py -- diff --git a/dev/run-tests.py b/dev/run-tests.py index ebeede5..a1e6f1b 100755 --- a/dev/run-tests.py +++ b/dev/run-tests.py @@ -573,8 +573,7 @@ def main(): # backwards compatibility checks if build_tool == "sbt": # Note: compatibility tests only supported in sbt for now -# TODO Temporarily disable MiMA check for DF-to-DS migration prototyping -# detect_binary_inop_with_mima() +detect_binary_inop_with_mima() # Since we did not build assembly/assembly before running dev/mima, we need to # do it here because the tests still rely on it; see SPARK-13294 for details. build_spark_assembly_sbt(hadoop_version) http://git-wip-us.apache.org/repos/asf/spark/blob/6d37e1eb/examples/src/main/java/org/apache/spark/examples/ml/JavaAFTSurvivalRegressionExample.java -- diff --git a/examples/src/main/java/org/apache/spark/examples/ml/JavaAFTSurvivalRegressionExample.java b/examples/src/main/java/org/apache/spark/examples/ml/JavaAFTSurvivalRegressionExample.java index 3905310..22b93a3 100644 --- a/examples/src/main/java/org/apache/spark/examples/ml/JavaAFTSurvivalRegressionExample.java +++ b/examples/src/main/java/org/apache/spark/examples/ml/JavaAFTSurvival
spark git commit: Revert "[SPARK-13616][SQL] Let SQLBuilder convert logical plan without a project on top of it"
Repository: spark Updated Branches: refs/heads/master 8290004d9 -> 8ff88094d Revert "[SPARK-13616][SQL] Let SQLBuilder convert logical plan without a project on top of it" This reverts commit f87ce0504ea0697969ac3e67690c78697b76e94a. According to discussion in #11466, let's revert PR #11466 for safe. Author: Cheng Lian <l...@databricks.com> Closes #11539 from liancheng/revert-pr-11466. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/8ff88094 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/8ff88094 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/8ff88094 Branch: refs/heads/master Commit: 8ff88094daa4945e7d718baa7b20703fd8087ab0 Parents: 8290004 Author: Cheng Lian <l...@databricks.com> Authored: Sun Mar 6 12:54:04 2016 +0800 Committer: Cheng Lian <l...@databricks.com> Committed: Sun Mar 6 12:54:04 2016 +0800 -- .../org/apache/spark/sql/hive/SQLBuilder.scala | 23 +-- .../spark/sql/hive/LogicalPlanToSQLSuite.scala | 41 2 files changed, 1 insertion(+), 63 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/8ff88094/sql/hive/src/main/scala/org/apache/spark/sql/hive/SQLBuilder.scala -- diff --git a/sql/hive/src/main/scala/org/apache/spark/sql/hive/SQLBuilder.scala b/sql/hive/src/main/scala/org/apache/spark/sql/hive/SQLBuilder.scala index 8d411a9..683f738 100644 --- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/SQLBuilder.scala +++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/SQLBuilder.scala @@ -65,7 +65,7 @@ class SQLBuilder(logicalPlan: LogicalPlan, sqlContext: SQLContext) extends Loggi case e => e } - val generatedSQL = toSQL(replaced, true) + val generatedSQL = toSQL(replaced) logDebug( s"""Built SQL query string successfully from given logical plan: | @@ -90,27 +90,6 @@ class SQLBuilder(logicalPlan: LogicalPlan, sqlContext: SQLContext) extends Loggi } } - private def toSQL(node: LogicalPlan, topNode: Boolean): String = { -if (topNode) { - node match { -case d: Distinct => toSQL(node) -case p: Project => toSQL(node) -case a: Aggregate => toSQL(node) -case s: Sort => toSQL(node) -case r: RepartitionByExpression => toSQL(node) -case _ => - build( -"SELECT", -node.output.map(_.sql).mkString(", "), -"FROM", -toSQL(node) - ) - } -} else { - toSQL(node) -} - } - private def toSQL(node: LogicalPlan): String = node match { case Distinct(p: Project) => projectToSQL(p, isDistinct = true) http://git-wip-us.apache.org/repos/asf/spark/blob/8ff88094/sql/hive/src/test/scala/org/apache/spark/sql/hive/LogicalPlanToSQLSuite.scala -- diff --git a/sql/hive/src/test/scala/org/apache/spark/sql/hive/LogicalPlanToSQLSuite.scala b/sql/hive/src/test/scala/org/apache/spark/sql/hive/LogicalPlanToSQLSuite.scala index f457d43..ed85856 100644 --- a/sql/hive/src/test/scala/org/apache/spark/sql/hive/LogicalPlanToSQLSuite.scala +++ b/sql/hive/src/test/scala/org/apache/spark/sql/hive/LogicalPlanToSQLSuite.scala @@ -19,8 +19,6 @@ package org.apache.spark.sql.hive import scala.util.control.NonFatal -import org.apache.spark.sql.{DataFrame, SQLContext} -import org.apache.spark.sql.catalyst.plans.logical._ import org.apache.spark.sql.functions._ import org.apache.spark.sql.test.SQLTestUtils @@ -56,33 +54,6 @@ class LogicalPlanToSQLSuite extends SQLBuilderTest with SQLTestUtils { sql("DROP TABLE IF EXISTS t0") } - private def checkPlan(plan: LogicalPlan, sqlContext: SQLContext, expected: String): Unit = { -val convertedSQL = try new SQLBuilder(plan, sqlContext).toSQL catch { - case NonFatal(e) => -fail( - s"""Cannot convert the following logical query plan back to SQL query string: - | - |# Original logical query plan: - |${plan.treeString} - """.stripMargin, e) -} - -try { - checkAnswer(sql(convertedSQL), DataFrame(sqlContext, plan)) -} catch { case cause: Throwable => - fail( -s"""Failed to execute converted SQL string or got wrong answer: - | - |# Converted SQL query string: - |$convertedSQL - | - |# Original logical query plan: - |${plan.treeString} - """.stripMargin, -cause) -} - } - private def checkHiveQ
spark git commit: [SPARK-12720][SQL] SQL Generation Support for Cube, Rollup, and Grouping Sets
Repository: spark Updated Branches: refs/heads/master f19228eed -> adce5ee72 [SPARK-12720][SQL] SQL Generation Support for Cube, Rollup, and Grouping Sets What changes were proposed in this pull request? This PR is for supporting SQL generation for cube, rollup and grouping sets. For example, a query using rollup: ```SQL SELECT count(*) as cnt, key % 5, grouping_id() FROM t1 GROUP BY key % 5 WITH ROLLUP ``` Original logical plan: ``` Aggregate [(key#17L % cast(5 as bigint))#47L,grouping__id#46], [(count(1),mode=Complete,isDistinct=false) AS cnt#43L, (key#17L % cast(5 as bigint))#47L AS _c1#45L, grouping__id#46 AS _c2#44] +- Expand [List(key#17L, value#18, (key#17L % cast(5 as bigint))#47L, 0), List(key#17L, value#18, null, 1)], [key#17L,value#18,(key#17L % cast(5 as bigint))#47L,grouping__id#46] +- Project [key#17L, value#18, (key#17L % cast(5 as bigint)) AS (key#17L % cast(5 as bigint))#47L] +- Subquery t1 +- Relation[key#17L,value#18] ParquetRelation ``` Converted SQL: ```SQL SELECT count( 1) AS `cnt`, (`t1`.`key` % CAST(5 AS BIGINT)), grouping_id() AS `_c2` FROM `default`.`t1` GROUP BY (`t1`.`key` % CAST(5 AS BIGINT)) GROUPING SETS (((`t1`.`key` % CAST(5 AS BIGINT))), ()) ``` How was the this patch tested? Added eight test cases in `LogicalPlanToSQLSuite`. Author: gatorsmile <gatorsm...@gmail.com> Author: xiaoli <lixiao1...@gmail.com> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> Closes #11283 from gatorsmile/groupingSetsToSQL. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/adce5ee7 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/adce5ee7 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/adce5ee7 Branch: refs/heads/master Commit: adce5ee721c6a844ff21dfcd8515859458fe611d Parents: f19228e Author: gatorsmile <gatorsm...@gmail.com> Authored: Sat Mar 5 19:25:03 2016 +0800 Committer: Cheng Lian <l...@databricks.com> Committed: Sat Mar 5 19:25:03 2016 +0800 -- python/pyspark/sql/functions.py | 14 +- .../sql/catalyst/expressions/grouping.scala | 1 + .../org/apache/spark/sql/hive/SQLBuilder.scala | 76 +- .../spark/sql/hive/LogicalPlanToSQLSuite.scala | 143 +++ 4 files changed, 226 insertions(+), 8 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/adce5ee7/python/pyspark/sql/functions.py -- diff --git a/python/pyspark/sql/functions.py b/python/pyspark/sql/functions.py index 92e724f..88924e2 100644 --- a/python/pyspark/sql/functions.py +++ b/python/pyspark/sql/functions.py @@ -348,13 +348,13 @@ def grouping_id(*cols): grouping columns). >>> df.cube("name").agg(grouping_id(), sum("age")).orderBy("name").show() -+-+++ -| name|groupingid()|sum(age)| -+-+++ -| null| 1| 7| -|Alice| 0| 2| -| Bob| 0| 5| -+-+++ ++-+-++ +| name|grouping_id()|sum(age)| ++-+-++ +| null|1| 7| +|Alice|0| 2| +| Bob|0| 5| ++-+-++ """ sc = SparkContext._active_spark_context jc = sc._jvm.functions.grouping_id(_to_seq(sc, cols, _to_java_column)) http://git-wip-us.apache.org/repos/asf/spark/blob/adce5ee7/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/grouping.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/grouping.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/grouping.scala index a204060..437e417 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/grouping.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/grouping.scala @@ -63,4 +63,5 @@ case class GroupingID(groupByExprs: Seq[Expression]) extends Expression with Une override def children: Seq[Expression] = groupByExprs override def dataType: DataType = IntegerType override def nullable: Boolean = false + override def prettyName: String = "grouping_id" } http://git-wip-us.apache.org/repos/asf/spark/blob/adce5ee7/sql/hive/src/main/scala/org/apache/spark/sql/hive/SQLBuilder.scala -- diff --git a/sql/hive/src/main/scala/org/apache/spar
spark git commit: [SPARK-13540][SQL] Supports using nested classes within Scala objects as Dataset element type
Repository: spark Updated Branches: refs/heads/master ac5c63528 -> 916fc34f9 [SPARK-13540][SQL] Supports using nested classes within Scala objects as Dataset element type ## What changes were proposed in this pull request? Nested classes defined within Scala objects are translated into Java static nested classes. Unlike inner classes, they don't need outer scopes. But the analyzer still thinks that an outer scope is required. This PR fixes this issue simply by checking whether a nested class is static before looking up its outer scope. ## How was this patch tested? A test case is added to `DatasetSuite`. It checks contents of a Dataset whose element type is a nested class declared in a Scala object. Author: Cheng Lian <l...@databricks.com> Closes #11421 from liancheng/spark-13540-object-as-outer-scope. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/916fc34f Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/916fc34f Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/916fc34f Branch: refs/heads/master Commit: 916fc34f98dd731f607d9b3ed657bad6cc30df2c Parents: ac5c635 Author: Cheng Lian <l...@databricks.com> Authored: Tue Mar 1 01:07:45 2016 +0800 Committer: Cheng Lian <l...@databricks.com> Committed: Tue Mar 1 01:07:45 2016 +0800 -- .../org/apache/spark/sql/catalyst/analysis/Analyzer.scala | 10 +- .../test/scala/org/apache/spark/sql/DatasetSuite.scala| 10 ++ 2 files changed, 19 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/916fc34f/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala index 23e4709..876aa0e 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala @@ -17,6 +17,8 @@ package org.apache.spark.sql.catalyst.analysis +import java.lang.reflect.Modifier + import scala.collection.mutable.ArrayBuffer import org.apache.spark.sql.AnalysisException @@ -559,7 +561,13 @@ class Analyzer( } resolveExpression(unbound, LocalRelation(attributes), throws = true) transform { -case n: NewInstance if n.outerPointer.isEmpty && n.cls.isMemberClass => +case n: NewInstance + // If this is an inner class of another class, register the outer object in `OuterScopes`. + // Note that static inner classes (e.g., inner classes within Scala objects) don't need + // outer pointer registration. + if n.outerPointer.isEmpty && + n.cls.isMemberClass && + !Modifier.isStatic(n.cls.getModifiers) => val outer = OuterScopes.outerScopes.get(n.cls.getDeclaringClass.getName) if (outer == null) { throw new AnalysisException( http://git-wip-us.apache.org/repos/asf/spark/blob/916fc34f/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala index 14fc37b..33df637 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala @@ -621,12 +621,22 @@ class DatasetSuite extends QueryTest with SharedSQLContext { ds.filter(_ => true), Some(1), Some(2), Some(3)) } + + test("SPARK-13540 Dataset of nested class defined in Scala object") { +checkAnswer( + Seq(OuterObject.InnerClass("foo")).toDS(), + OuterObject.InnerClass("foo")) + } } class OuterClass extends Serializable { case class InnerClass(a: String) } +object OuterObject { + case class InnerClass(a: String) +} + case class ClassData(a: String, b: Int) case class ClassData2(c: String, d: Int) case class ClassNullableData(a: String, b: Integer) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-13457][SQL] Removes DataFrame RDD operations
Repository: spark Updated Branches: refs/heads/master 5c3912e5c -> 99dfcedbf [SPARK-13457][SQL] Removes DataFrame RDD operations ## What changes were proposed in this pull request? This is another try of PR #11323. This PR removes DataFrame RDD operations except for `foreach` and `foreachPartitions` (they are actions rather than transformations). Original calls are now replaced by calls to methods of `DataFrame.rdd`. PR #11323 was reverted because it introduced a regression: both `DataFrame.foreach` and `DataFrame.foreachPartitions` wrap underlying RDD operations with `withNewExecutionId` to track Spark jobs. But they are removed in #11323. ## How was the this patch tested? No extra tests are added. Existing tests should do the work. Author: Cheng Lian <l...@databricks.com> Closes #11388 from liancheng/remove-df-rdd-ops. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/99dfcedb Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/99dfcedb Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/99dfcedb Branch: refs/heads/master Commit: 99dfcedbfd4c83c7b6a343456f03e8c6e29968c5 Parents: 5c3912e Author: Cheng Lian <l...@databricks.com> Authored: Sat Feb 27 00:28:30 2016 +0800 Committer: Cheng Lian <l...@databricks.com> Committed: Sat Feb 27 00:28:30 2016 +0800 -- .../spark/examples/ml/DataFrameExample.scala| 2 +- .../spark/examples/ml/DecisionTreeExample.scala | 8 +++ .../spark/examples/ml/OneVsRestExample.scala| 2 +- .../spark/examples/mllib/LDAExample.scala | 1 + .../apache/spark/examples/sql/RDDRelation.scala | 2 +- .../spark/examples/sql/hive/HiveFromSpark.scala | 2 +- .../scala/org/apache/spark/ml/Predictor.scala | 6 +++-- .../ml/classification/LogisticRegression.scala | 13 +++ .../spark/ml/clustering/BisectingKMeans.scala | 4 ++-- .../org/apache/spark/ml/clustering/KMeans.scala | 6 ++--- .../org/apache/spark/ml/clustering/LDA.scala| 1 + .../BinaryClassificationEvaluator.scala | 9 .../MulticlassClassificationEvaluator.scala | 6 ++--- .../ml/evaluation/RegressionEvaluator.scala | 3 ++- .../apache/spark/ml/feature/ChiSqSelector.scala | 2 +- .../spark/ml/feature/CountVectorizer.scala | 2 +- .../scala/org/apache/spark/ml/feature/IDF.scala | 2 +- .../apache/spark/ml/feature/MaxAbsScaler.scala | 2 +- .../apache/spark/ml/feature/MinMaxScaler.scala | 2 +- .../apache/spark/ml/feature/OneHotEncoder.scala | 2 +- .../scala/org/apache/spark/ml/feature/PCA.scala | 2 +- .../spark/ml/feature/StandardScaler.scala | 2 +- .../apache/spark/ml/feature/StringIndexer.scala | 1 + .../apache/spark/ml/feature/VectorIndexer.scala | 2 +- .../org/apache/spark/ml/feature/Word2Vec.scala | 2 +- .../apache/spark/ml/recommendation/ALS.scala| 1 + .../ml/regression/AFTSurvivalRegression.scala | 2 +- .../ml/regression/IsotonicRegression.scala | 6 ++--- .../spark/ml/regression/LinearRegression.scala | 16 - .../spark/mllib/api/python/PythonMLLibAPI.scala | 8 +++ .../spark/mllib/clustering/KMeansModel.scala| 2 +- .../spark/mllib/clustering/LDAModel.scala | 4 ++-- .../clustering/PowerIterationClustering.scala | 2 +- .../BinaryClassificationMetrics.scala | 2 +- .../mllib/evaluation/MulticlassMetrics.scala| 2 +- .../mllib/evaluation/MultilabelMetrics.scala| 4 +++- .../mllib/evaluation/RegressionMetrics.scala| 2 +- .../spark/mllib/feature/ChiSqSelector.scala | 2 +- .../org/apache/spark/mllib/fpm/FPGrowth.scala | 2 +- .../MatrixFactorizationModel.scala | 12 +- .../mllib/tree/model/DecisionTreeModel.scala| 2 +- .../mllib/tree/model/treeEnsembleModels.scala | 2 +- .../LogisticRegressionSuite.scala | 2 +- .../MultilayerPerceptronClassifierSuite.scala | 5 ++-- .../ml/classification/OneVsRestSuite.scala | 6 ++--- .../ml/clustering/BisectingKMeansSuite.scala| 3 ++- .../spark/ml/clustering/KMeansSuite.scala | 3 ++- .../apache/spark/ml/clustering/LDASuite.scala | 2 +- .../spark/ml/feature/OneHotEncoderSuite.scala | 4 ++-- .../spark/ml/feature/StringIndexerSuite.scala | 6 ++--- .../spark/ml/feature/VectorIndexerSuite.scala | 5 ++-- .../apache/spark/ml/feature/Word2VecSuite.scala | 8 +++ .../spark/ml/recommendation/ALSSuite.scala | 7 +++--- .../spark/ml/regression/GBTRegressorSuite.scala | 2 +- .../ml/regression/IsotonicRegressionSuite.scala | 6 ++--- .../ml/regression/LinearRegressionSuite.scala | 17 +++--- .../scala/org/apache/spark/sql/DataFrame.scala | 24 .../org/apache/spark/sql/GroupedData.scala | 1 + .../org/apache/spark/sql/api/r/SQLUtils.scala | 2 +- .../spark/sql/DataFrameAggregateSuite.scala
spark git commit: [SPARK-13457][SQL] Removes DataFrame RDD operations
Repository: spark Updated Branches: refs/heads/master 4460113d4 -> 157fe64f3 [SPARK-13457][SQL] Removes DataFrame RDD operations ## What changes were proposed in this pull request? This PR removes DataFrame RDD operations. Original calls are now replaced by calls to methods of `DataFrame.rdd`. ## How was the this patch tested? No extra tests are added. Existing tests should do the work. Author: Cheng Lian <l...@databricks.com> Closes #11323 from liancheng/remove-df-rdd-ops. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/157fe64f Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/157fe64f Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/157fe64f Branch: refs/heads/master Commit: 157fe64f3ecbd13b7286560286e50235eecfe30e Parents: 4460113 Author: Cheng Lian <l...@databricks.com> Authored: Thu Feb 25 23:07:59 2016 +0800 Committer: Cheng Lian <l...@databricks.com> Committed: Thu Feb 25 23:07:59 2016 +0800 -- .../spark/examples/ml/DataFrameExample.scala| 2 +- .../spark/examples/ml/DecisionTreeExample.scala | 8 ++-- .../spark/examples/ml/OneVsRestExample.scala| 2 +- .../spark/examples/mllib/LDAExample.scala | 1 + .../apache/spark/examples/sql/RDDRelation.scala | 2 +- .../spark/examples/sql/hive/HiveFromSpark.scala | 2 +- .../scala/org/apache/spark/ml/Predictor.scala | 6 ++- .../ml/classification/LogisticRegression.scala | 13 +++--- .../spark/ml/clustering/BisectingKMeans.scala | 4 +- .../org/apache/spark/ml/clustering/KMeans.scala | 6 +-- .../org/apache/spark/ml/clustering/LDA.scala| 1 + .../BinaryClassificationEvaluator.scala | 9 ++--- .../MulticlassClassificationEvaluator.scala | 6 +-- .../ml/evaluation/RegressionEvaluator.scala | 3 +- .../apache/spark/ml/feature/ChiSqSelector.scala | 2 +- .../spark/ml/feature/CountVectorizer.scala | 2 +- .../scala/org/apache/spark/ml/feature/IDF.scala | 2 +- .../apache/spark/ml/feature/MinMaxScaler.scala | 2 +- .../apache/spark/ml/feature/OneHotEncoder.scala | 2 +- .../scala/org/apache/spark/ml/feature/PCA.scala | 2 +- .../spark/ml/feature/StandardScaler.scala | 2 +- .../apache/spark/ml/feature/StringIndexer.scala | 1 + .../apache/spark/ml/feature/VectorIndexer.scala | 2 +- .../org/apache/spark/ml/feature/Word2Vec.scala | 2 +- .../apache/spark/ml/recommendation/ALS.scala| 1 + .../ml/regression/AFTSurvivalRegression.scala | 2 +- .../ml/regression/IsotonicRegression.scala | 6 +-- .../spark/ml/regression/LinearRegression.scala | 16 +--- .../spark/mllib/api/python/PythonMLLibAPI.scala | 8 ++-- .../spark/mllib/clustering/KMeansModel.scala| 2 +- .../spark/mllib/clustering/LDAModel.scala | 4 +- .../clustering/PowerIterationClustering.scala | 2 +- .../BinaryClassificationMetrics.scala | 2 +- .../mllib/evaluation/MulticlassMetrics.scala| 2 +- .../mllib/evaluation/MultilabelMetrics.scala| 4 +- .../mllib/evaluation/RegressionMetrics.scala| 2 +- .../spark/mllib/feature/ChiSqSelector.scala | 2 +- .../org/apache/spark/mllib/fpm/FPGrowth.scala | 2 +- .../MatrixFactorizationModel.scala | 12 +++--- .../mllib/tree/model/DecisionTreeModel.scala| 2 +- .../mllib/tree/model/treeEnsembleModels.scala | 2 +- .../LogisticRegressionSuite.scala | 2 +- .../MultilayerPerceptronClassifierSuite.scala | 5 ++- .../ml/classification/OneVsRestSuite.scala | 6 +-- .../ml/clustering/BisectingKMeansSuite.scala| 3 +- .../spark/ml/clustering/KMeansSuite.scala | 3 +- .../apache/spark/ml/clustering/LDASuite.scala | 2 +- .../spark/ml/feature/OneHotEncoderSuite.scala | 4 +- .../spark/ml/feature/StringIndexerSuite.scala | 6 +-- .../spark/ml/feature/VectorIndexerSuite.scala | 5 ++- .../apache/spark/ml/feature/Word2VecSuite.scala | 8 ++-- .../spark/ml/recommendation/ALSSuite.scala | 7 ++-- .../spark/ml/regression/GBTRegressorSuite.scala | 2 +- .../ml/regression/IsotonicRegressionSuite.scala | 6 +-- .../ml/regression/LinearRegressionSuite.scala | 17 .../scala/org/apache/spark/sql/DataFrame.scala | 42 .../org/apache/spark/sql/GroupedData.scala | 1 + .../org/apache/spark/sql/api/r/SQLUtils.scala | 2 +- .../execution/datasources/jdbc/JdbcUtils.scala | 2 +- .../spark/sql/DataFrameAggregateSuite.scala | 2 +- .../parquet/ParquetFilterSuite.scala| 2 +- .../datasources/parquet/ParquetIOSuite.scala| 2 +- .../sql/execution/ui/SQLListenerSuite.scala | 4 +- .../spark/sql/hive/HiveSparkSubmitSuite.scala | 2 +- .../sql/hive/execution/HiveQuerySuite.scala | 2 + .../spark/sql/hive/orc/OrcQuerySuite.scala | 5 ++- .../apache/spark/sql/hive/parquetSuites.scala | 2 +- 67 files changed, 1
[3/3] spark git commit: [SPARK-13486][SQL] Move SQLConf into an internal package
[SPARK-13486][SQL] Move SQLConf into an internal package ## What changes were proposed in this pull request? This patch moves SQLConf into org.apache.spark.sql.internal package to make it very explicit that it is internal. Soon I will also submit more API work that creates implementations of interfaces in this internal package. ## How was this patch tested? If it compiles, then the refactoring should work. Author: Reynold Xin <r...@databricks.com> Closes #11363 from rxin/SPARK-13486. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/2b2c8c33 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/2b2c8c33 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/2b2c8c33 Branch: refs/heads/master Commit: 2b2c8c33236677c916541f956f7b94bba014a9ce Parents: 07f92ef Author: Reynold Xin <r...@databricks.com> Authored: Thu Feb 25 17:49:50 2016 +0800 Committer: Cheng Lian <l...@databricks.com> Committed: Thu Feb 25 17:49:50 2016 +0800 -- project/MimaExcludes.scala | 6 + .../scala/org/apache/spark/sql/DataFrame.scala | 1 + .../org/apache/spark/sql/GroupedData.scala | 1 + .../scala/org/apache/spark/sql/SQLConf.scala| 730 --- .../scala/org/apache/spark/sql/SQLContext.scala | 3 +- .../spark/sql/execution/ExistingRDD.scala | 3 +- .../spark/sql/execution/SparkStrategies.scala | 3 +- .../apache/spark/sql/execution/commands.scala | 3 +- .../InsertIntoHadoopFsRelation.scala| 1 + .../execution/datasources/SqlNewHadoopRDD.scala | 3 +- .../execution/datasources/WriterContainer.scala | 1 + .../parquet/CatalystSchemaConverter.scala | 4 +- .../parquet/CatalystWriteSupport.scala | 2 +- .../datasources/parquet/ParquetRelation.scala | 1 + .../spark/sql/execution/debug/package.scala | 1 + .../execution/local/BinaryHashJoinNode.scala| 2 +- .../execution/local/BroadcastHashJoinNode.scala | 2 +- .../sql/execution/local/ConvertToSafeNode.scala | 2 +- .../execution/local/ConvertToUnsafeNode.scala | 2 +- .../spark/sql/execution/local/ExpandNode.scala | 2 +- .../spark/sql/execution/local/FilterNode.scala | 2 +- .../sql/execution/local/IntersectNode.scala | 4 +- .../spark/sql/execution/local/LimitNode.scala | 2 +- .../spark/sql/execution/local/LocalNode.scala | 3 +- .../execution/local/NestedLoopJoinNode.scala| 2 +- .../spark/sql/execution/local/ProjectNode.scala | 2 +- .../spark/sql/execution/local/SampleNode.scala | 2 +- .../spark/sql/execution/local/SeqScanNode.scala | 2 +- .../local/TakeOrderedAndProjectNode.scala | 2 +- .../spark/sql/execution/local/UnionNode.scala | 2 +- .../org/apache/spark/sql/internal/SQLConf.scala | 730 +++ .../apache/spark/sql/internal/package-info.java | 22 + .../org/apache/spark/sql/internal/package.scala | 24 + .../spark/sql/DataFrameAggregateSuite.scala | 1 + .../apache/spark/sql/DataFramePivotSuite.scala | 1 + .../org/apache/spark/sql/DataFrameSuite.scala | 1 + .../scala/org/apache/spark/sql/JoinSuite.scala | 1 + .../spark/sql/MultiSQLContextsSuite.scala | 1 + .../scala/org/apache/spark/sql/QueryTest.scala | 1 + .../apache/spark/sql/SQLConfEntrySuite.scala| 150 .../org/apache/spark/sql/SQLConfSuite.scala | 132 .../org/apache/spark/sql/SQLContextSuite.scala | 1 + .../org/apache/spark/sql/SQLQuerySuite.scala| 1 + .../execution/ExchangeCoordinatorSuite.scala| 1 + .../spark/sql/execution/PlannerSuite.scala | 3 +- .../columnar/PartitionBatchPruningSuite.scala | 1 + .../execution/datasources/json/JsonSuite.scala | 1 + .../parquet/ParquetFilterSuite.scala| 1 + .../datasources/parquet/ParquetIOSuite.scala| 1 + .../ParquetPartitionDiscoverySuite.scala| 1 + .../datasources/parquet/ParquetQuerySuite.scala | 1 + .../parquet/ParquetReadBenchmark.scala | 3 +- .../datasources/parquet/ParquetTest.scala | 3 +- .../sql/execution/joins/InnerJoinSuite.scala| 3 +- .../sql/execution/joins/OuterJoinSuite.scala| 3 +- .../sql/execution/joins/SemiJoinSuite.scala | 3 +- .../spark/sql/execution/local/DummyNode.scala | 2 +- .../sql/execution/local/HashJoinNodeSuite.scala | 2 +- .../sql/execution/local/LocalNodeTest.scala | 2 +- .../local/NestedLoopJoinNodeSuite.scala | 2 +- .../sql/execution/metric/SQLMetricsSuite.scala | 1 + .../spark/sql/internal/SQLConfEntrySuite.scala | 150 .../spark/sql/internal/SQLConfSuite.scala | 133 .../spark/sql/sources/DataSourceTest.scala | 1 + .../spark/sql/sources/FilteredScanSuite.scala | 1 + .../spark/sql/sources/PrunedScanSuite.scala | 1 + .../spark/sql/sources/SaveLoadSuite
[1/3] spark git commit: [SPARK-13486][SQL] Move SQLConf into an internal package
Repository: spark Updated Branches: refs/heads/master 07f92ef1f -> 2b2c8c332 http://git-wip-us.apache.org/repos/asf/spark/blob/2b2c8c33/sql/core/src/test/scala/org/apache/spark/sql/test/TestSQLContext.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/test/TestSQLContext.scala b/sql/core/src/test/scala/org/apache/spark/sql/test/TestSQLContext.scala index c89a151..28ad7ae 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/test/TestSQLContext.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/test/TestSQLContext.scala @@ -18,8 +18,8 @@ package org.apache.spark.sql.test import org.apache.spark.{SparkConf, SparkContext} -import org.apache.spark.sql.{SQLConf, SQLContext} - +import org.apache.spark.sql.SQLContext +import org.apache.spark.sql.internal.SQLConf /** * A special [[SQLContext]] prepared for testing. @@ -39,7 +39,7 @@ private[sql] class TestSQLContext(sc: SparkContext) extends SQLContext(sc) { sel super.clear() // Make sure we start with the default test configs even after clear - TestSQLContext.overrideConfs.map { + TestSQLContext.overrideConfs.foreach { case (key, value) => setConfString(key, value) } } http://git-wip-us.apache.org/repos/asf/spark/blob/2b2c8c33/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala -- diff --git a/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala b/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala index 66eaa3e..f32ba5f 100644 --- a/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala +++ b/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala @@ -32,10 +32,10 @@ import org.apache.hive.service.server.{HiveServer2, HiveServerServerOptionsProce import org.apache.spark.{Logging, SparkContext} import org.apache.spark.annotation.DeveloperApi import org.apache.spark.scheduler.{SparkListener, SparkListenerApplicationEnd, SparkListenerJobStart} -import org.apache.spark.sql.SQLConf import org.apache.spark.sql.hive.HiveContext import org.apache.spark.sql.hive.thriftserver.ReflectionUtils._ import org.apache.spark.sql.hive.thriftserver.ui.ThriftServerTab +import org.apache.spark.sql.internal.SQLConf import org.apache.spark.util.{ShutdownHookManager, Utils} /** http://git-wip-us.apache.org/repos/asf/spark/blob/2b2c8c33/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala -- diff --git a/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala b/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala index 8fef22c..458d4f2 100644 --- a/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala +++ b/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala @@ -33,9 +33,10 @@ import org.apache.hive.service.cli.operation.ExecuteStatementOperation import org.apache.hive.service.cli.session.HiveSession import org.apache.spark.Logging -import org.apache.spark.sql.{DataFrame, Row => SparkRow, SQLConf} +import org.apache.spark.sql.{DataFrame, Row => SparkRow} import org.apache.spark.sql.execution.SetCommand import org.apache.spark.sql.hive.{HiveContext, HiveMetastoreTypes} +import org.apache.spark.sql.internal.SQLConf import org.apache.spark.sql.types._ import org.apache.spark.util.{Utils => SparkUtils} http://git-wip-us.apache.org/repos/asf/spark/blob/2b2c8c33/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2Suites.scala -- diff --git a/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2Suites.scala b/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2Suites.scala index 5f9952a..c05527b 100644 --- a/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2Suites.scala +++ b/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2Suites.scala @@ -202,7 +202,7 @@ class HiveThriftBinaryServerSuite extends HiveThriftJdbcTest { } test("test multiple session") { -import org.apache.spark.sql.SQLConf +import org.apache.spark.sql.internal.SQLConf var defaultV1: String = null var defaultV2: String = null var data: ArrayBuffer[Int] = null
spark git commit: [SPARK-13263][SQL] SQL Generation Support for Tablesample
Repository: spark Updated Branches: refs/heads/master 5cd3e6f60 -> 87250580f [SPARK-13263][SQL] SQL Generation Support for Tablesample In the parser, tableSample clause is part of tableSource. ``` tableSource init { gParent.pushMsg("table source", state); } after { gParent.popMsg(state); } : tabname=tableName ((tableProperties) => props=tableProperties)? ((tableSample) => ts=tableSample)? ((KW_AS) => (KW_AS alias=Identifier) | (Identifier) => (alias=Identifier))? -> ^(TOK_TABREF $tabname $props? $ts? $alias?) ; ``` Two typical query samples using TABLESAMPLE are: ``` "SELECT s.id FROM t0 TABLESAMPLE(10 PERCENT) s" "SELECT * FROM t0 TABLESAMPLE(0.1 PERCENT)" ``` FYI, the logical plan of a TABLESAMPLE query: ``` sql("SELECT * FROM t0 TABLESAMPLE(0.1 PERCENT)").explain(true) == Analyzed Logical Plan == id: bigint Project [id#16L] +- Sample 0.0, 0.001, false, 381 +- Subquery t0 +- Relation[id#16L] ParquetRelation ``` Thanks! cc liancheng Author: gatorsmile <gatorsm...@gmail.com> Author: xiaoli <lixiao1...@gmail.com> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> This patch had conflicts when merged, resolved by Committer: Cheng Lian <l...@databricks.com> Closes #11148 from gatorsmile/tablesplitsample. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/87250580 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/87250580 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/87250580 Branch: refs/heads/master Commit: 87250580f214cb7c4dff01c5a3498ea6cb79a27e Parents: 5cd3e6f Author: gatorsmile <gatorsm...@gmail.com> Authored: Tue Feb 23 16:13:09 2016 +0800 Committer: Cheng Lian <l...@databricks.com> Committed: Tue Feb 23 16:13:09 2016 +0800 -- .../apache/spark/sql/catalyst/CatalystQl.scala | 6 +- .../sql/catalyst/optimizer/Optimizer.scala | 2 +- .../catalyst/plans/logical/basicOperators.scala | 5 +- .../optimizer/FilterPushdownSuite.scala | 4 +- .../scala/org/apache/spark/sql/DataFrame.scala | 4 +- .../scala/org/apache/spark/sql/Dataset.scala| 2 +- .../org/apache/spark/sql/hive/SQLBuilder.scala | 18 .../spark/sql/hive/LogicalPlanToSQLSuite.scala | 98 ++-- 8 files changed, 103 insertions(+), 36 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/87250580/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystQl.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystQl.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystQl.scala index 069c665..a0a56d7 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystQl.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystQl.scala @@ -499,12 +499,14 @@ https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation%2C+Cube%2C s"Sampling fraction ($fraction) must be on interval [0, 100]") Sample(0.0, fraction.toDouble / 100, withReplacement = false, (math.random * 1000).toInt, - relation) + relation)( + isTableSample = true) case Token("TOK_TABLEBUCKETSAMPLE", Token(numerator, Nil) :: Token(denominator, Nil) :: Nil) => val fraction = numerator.toDouble / denominator.toDouble -Sample(0.0, fraction, withReplacement = false, (math.random * 1000).toInt, relation) +Sample(0.0, fraction, withReplacement = false, (math.random * 1000).toInt, relation)( + isTableSample = true) case a => noParseRule("Sampling", a) }.getOrElse(relation) http://git-wip-us.apache.org/repos/asf/spark/blob/87250580/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala index 1554382..1f05f20 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala @@ -123,7 +123,7 @@ object SamplePushDown extends Rule[LogicalPlan] { // Push down projection into sample case Project(projectList, s @ Sample(lb, up, replace, seed, child)) => Sample(lb, up, replace, seed, -Project(projectList, child)) +Project(projectL
[1/2] spark git commit: [SPARK-12799] Simplify various string output for expressions
Repository: spark Updated Branches: refs/heads/master d806ed343 -> d9efe63ec http://git-wip-us.apache.org/repos/asf/spark/blob/d9efe63e/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystSchemaConverter.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystSchemaConverter.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystSchemaConverter.scala index 1c0d53f..54dda0c 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystSchemaConverter.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystSchemaConverter.scala @@ -544,7 +544,7 @@ private[parquet] object CatalystSchemaConverter { !name.matches(".*[ ,;{}()\n\t=].*"), s"""Attribute name "$name" contains invalid character(s) among " ,;{}()\\n\\t=". |Please use alias to rename it. - """.stripMargin.split("\n").mkString(" ")) + """.stripMargin.split("\n").mkString(" ").trim) } def checkFieldNames(schema: StructType): StructType = { http://git-wip-us.apache.org/repos/asf/spark/blob/d9efe63e/sql/core/src/main/scala/org/apache/spark/sql/execution/python/PythonUDF.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/python/PythonUDF.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/python/PythonUDF.scala index 0e53a0c..9aff0be 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/python/PythonUDF.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/python/PythonUDF.scala @@ -20,7 +20,7 @@ package org.apache.spark.sql.execution.python import org.apache.spark.{Accumulator, Logging} import org.apache.spark.api.python.PythonBroadcast import org.apache.spark.broadcast.Broadcast -import org.apache.spark.sql.catalyst.expressions.{Expression, Unevaluable} +import org.apache.spark.sql.catalyst.expressions.{Expression, NonSQLExpression, Unevaluable} import org.apache.spark.sql.types.DataType /** @@ -36,9 +36,12 @@ case class PythonUDF( broadcastVars: java.util.List[Broadcast[PythonBroadcast]], accumulator: Accumulator[java.util.List[Array[Byte]]], dataType: DataType, -children: Seq[Expression]) extends Expression with Unevaluable with Logging { +children: Seq[Expression]) + extends Expression with Unevaluable with NonSQLExpression with Logging { - override def toString: String = s"PythonUDF#$name(${children.mkString(",")})" + override def toString: String = s"PythonUDF#$name(${children.mkString(", ")})" override def nullable: Boolean = true + + override def sql: String = s"$name(${children.mkString(", ")})" } http://git-wip-us.apache.org/repos/asf/spark/blob/d9efe63e/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala index f9ba607..498f007 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala @@ -525,7 +525,7 @@ class DatasetSuite extends QueryTest with SharedSQLContext { val e = intercept[AnalysisException] { ds.as[ClassData2] } -assert(e.getMessage.contains("cannot resolve 'c' given input columns: [a, b]"), e.getMessage) +assert(e.getMessage.contains("cannot resolve '`c`' given input columns: [a, b]"), e.getMessage) } test("runtime nullability check") { http://git-wip-us.apache.org/repos/asf/spark/blob/d9efe63e/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala index bd87449..41a9404 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala @@ -455,7 +455,7 @@ class ParquetIOSuite extends QueryTest with ParquetTest with SharedSQLContext { sqlContext.udf.register("div0", (x: Int) => x / 0) withTempPath { dir => intercept[org.apache.spark.SparkException] { - sqlContext.sql("select div0(1)").write.parquet(dir.getCanonicalPath) + sqlContext.sql("select div0(1) as div0").write.parquet(dir.getCanonicalPath) } val path = new Path(dir.getCanonicalPath, "_temporary") val fs =
spark git commit: [SPARK-13205][SQL] SQL Generation Support for Self Join
Repository: spark Updated Branches: refs/heads/master 663cc400f -> 0f09f0226 [SPARK-13205][SQL] SQL Generation Support for Self Join This PR addresses two issues: - Self join does not work in SQL Generation - When creating new instances for `LogicalRelation`, `metastoreTableIdentifier` is lost. liancheng Could you please review the code changes? Thank you! Author: gatorsmile <gatorsm...@gmail.com> Closes #11084 from gatorsmile/selfJoinInSQLGen. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/0f09f022 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/0f09f022 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/0f09f022 Branch: refs/heads/master Commit: 0f09f0226983cdc409ef504dff48395787dc844f Parents: 663cc40 Author: gatorsmile <gatorsm...@gmail.com> Authored: Thu Feb 11 11:08:21 2016 +0800 Committer: Cheng Lian <l...@databricks.com> Committed: Thu Feb 11 11:08:21 2016 +0800 -- .../spark/sql/execution/datasources/LogicalRelation.scala | 6 +- .../main/scala/org/apache/spark/sql/hive/SQLBuilder.scala | 10 +- .../org/apache/spark/sql/hive/LogicalPlanToSQLSuite.scala | 8 3 files changed, 22 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/0f09f022/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/LogicalRelation.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/LogicalRelation.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/LogicalRelation.scala index fa97f3d..0e0748f 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/LogicalRelation.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/LogicalRelation.scala @@ -76,7 +76,11 @@ case class LogicalRelation( /** Used to lookup original attribute capitalization */ val attributeMap: AttributeMap[AttributeReference] = AttributeMap(output.map(o => (o, o))) - def newInstance(): this.type = LogicalRelation(relation).asInstanceOf[this.type] + def newInstance(): this.type = +LogicalRelation( + relation, + expectedOutputAttributes, + metastoreTableIdentifier).asInstanceOf[this.type] override def simpleString: String = s"Relation[${output.mkString(",")}] $relation" } http://git-wip-us.apache.org/repos/asf/spark/blob/0f09f022/sql/hive/src/main/scala/org/apache/spark/sql/hive/SQLBuilder.scala -- diff --git a/sql/hive/src/main/scala/org/apache/spark/sql/hive/SQLBuilder.scala b/sql/hive/src/main/scala/org/apache/spark/sql/hive/SQLBuilder.scala index fc5725d..4b75e60 100644 --- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/SQLBuilder.scala +++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/SQLBuilder.scala @@ -142,7 +142,15 @@ class SQLBuilder(logicalPlan: LogicalPlan, sqlContext: SQLContext) extends Loggi Some(s"`$database`.`$table`") case Subquery(alias, child) => - toSQL(child).map(childSQL => s"($childSQL) AS $alias") + toSQL(child).map( childSQL => +child match { + // Parentheses is not used for persisted data source relations + // e.g., select x.c1 from (t1) as x inner join (t1) as y on x.c1 = y.c1 + case Subquery(_, _: LogicalRelation | _: MetastoreRelation) => +s"$childSQL AS $alias" + case _ => +s"($childSQL) AS $alias" +}) case Join(left, right, joinType, condition) => for { http://git-wip-us.apache.org/repos/asf/spark/blob/0f09f022/sql/hive/src/test/scala/org/apache/spark/sql/hive/LogicalPlanToSQLSuite.scala -- diff --git a/sql/hive/src/test/scala/org/apache/spark/sql/hive/LogicalPlanToSQLSuite.scala b/sql/hive/src/test/scala/org/apache/spark/sql/hive/LogicalPlanToSQLSuite.scala index 129bfe0..80ae312 100644 --- a/sql/hive/src/test/scala/org/apache/spark/sql/hive/LogicalPlanToSQLSuite.scala +++ b/sql/hive/src/test/scala/org/apache/spark/sql/hive/LogicalPlanToSQLSuite.scala @@ -104,6 +104,14 @@ class LogicalPlanToSQLSuite extends SQLBuilderTest with SQLTestUtils { checkHiveQl("SELECT id FROM t0 UNION ALL SELECT CAST(id AS INT) AS id FROM t0") } + test("self join") { +checkHiveQl("SELECT x.key FROM t1 x JOIN t1 y ON x.key = y.key") + } + + test("self join with group by") { +checkHiveQl("SELECT x.key, COUNT(*) FROM t1 x JOIN t1 y ON x.key = y.key
spark git commit: [SPARK-12231][SQL] create a combineFilters' projection when we call buildPartitionedTableScan
Repository: spark Updated Branches: refs/heads/branch-1.6 bb01cbe9b -> ddb963304 [SPARK-12231][SQL] create a combineFilters' projection when we call buildPartitionedTableScan Hello Michael & All: We have some issues to submit the new codes in the other PR(#10299), so we closed that PR and open this one with the fix. The reason for the previous failure is that the projection for the scan when there is a filter that is not pushed down (the "left-over" filter) could be different, in elements or ordering, from the original projection. With this new codes, the approach to solve this problem is: Insert a new Project if the "left-over" filter is nonempty and (the original projection is not empty and the projection for the scan has more than one elements which could otherwise cause different ordering in projection). We create 3 test cases to cover the otherwise failure cases. Author: Kevin Yu <q...@us.ibm.com> Closes #10388 from kevinyu98/spark-12231. (cherry picked from commit fd50df413fbb3b7528cdff311cc040a6212340b9) Signed-off-by: Cheng Lian <l...@databricks.com> Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ddb96330 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ddb96330 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ddb96330 Branch: refs/heads/branch-1.6 Commit: ddb9633043e82fb2a34c7e0e29b487f635c3c744 Parents: bb01cbe Author: Kevin Yu <q...@us.ibm.com> Authored: Mon Dec 28 11:58:33 2015 -0800 Committer: Cheng Lian <l...@databricks.com> Committed: Sun Jan 31 16:16:34 2016 -0800 -- .../datasources/DataSourceStrategy.scala| 28 ++--- .../parquet/ParquetFilterSuite.scala| 41 2 files changed, 64 insertions(+), 5 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/ddb96330/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala index 8a15a51..3741a9c 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala @@ -77,7 +77,8 @@ private[sql] object DataSourceStrategy extends Strategy with Logging { val pushedFilters = filters.filter(_.references.intersect(partitionColumns).isEmpty) // Predicates with both partition keys and attributes - val combineFilters = filters.toSet -- partitionFilters.toSet -- pushedFilters.toSet + val partitionAndNormalColumnFilters = +filters.toSet -- partitionFilters.toSet -- pushedFilters.toSet val selectedPartitions = prunePartitions(partitionFilters, t.partitionSpec).toArray @@ -88,16 +89,33 @@ private[sql] object DataSourceStrategy extends Strategy with Logging { s"Selected $selected partitions out of $total, pruned $percentPruned% partitions." } + // need to add projections from "partitionAndNormalColumnAttrs" in if it is not empty + val partitionAndNormalColumnAttrs = AttributeSet(partitionAndNormalColumnFilters) + val partitionAndNormalColumnProjs = if (partitionAndNormalColumnAttrs.isEmpty) { +projects + } else { +(partitionAndNormalColumnAttrs ++ projects).toSeq + } + val scan = buildPartitionedTableScan( l, -projects, +partitionAndNormalColumnProjs, pushedFilters, t.partitionSpec.partitionColumns, selectedPartitions) - combineFilters -.reduceLeftOption(expressions.And) -.map(execution.Filter(_, scan)).getOrElse(scan) :: Nil + // Add a Projection to guarantee the original projection: + // this is because "partitionAndNormalColumnAttrs" may be different + // from the original "projects", in elements or their ordering + + partitionAndNormalColumnFilters.reduceLeftOption(expressions.And).map(cf => +if (projects.isEmpty || projects == partitionAndNormalColumnProjs) { + // if the original projection is empty, no need for the additional Project either + execution.Filter(cf, scan) +} else { + execution.Project(projects, execution.Filter(cf, scan)) +} + ).getOrElse(scan) :: Nil // Scanning non-partitioned HadoopFsRelation case PhysicalOperation(projects, filters, l @ LogicalRelation(t: HadoopFsRelation, _)) => http://git-wip-us.apache.org/repos/a
spark git commit: [SPARK-12818] Polishes spark-sketch module
Repository: spark Updated Branches: refs/heads/master 5f686cc8b -> 2b027e9a3 [SPARK-12818] Polishes spark-sketch module Fixes various minor code and Javadoc styling issues. Author: Cheng Lian <l...@databricks.com> Closes #10985 from liancheng/sketch-polishing. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/2b027e9a Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/2b027e9a Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/2b027e9a Branch: refs/heads/master Commit: 2b027e9a386fe4009f61ad03b169335af5a9a5c6 Parents: 5f686cc Author: Cheng Lian <l...@databricks.com> Authored: Fri Jan 29 12:01:13 2016 -0800 Committer: Cheng Lian <l...@databricks.com> Committed: Fri Jan 29 12:01:13 2016 -0800 -- .../org/apache/spark/util/sketch/BitArray.java | 2 +- .../apache/spark/util/sketch/BloomFilter.java | 111 ++- .../spark/util/sketch/BloomFilterImpl.java | 40 +++ .../spark/util/sketch/CountMinSketch.java | 26 +++-- .../spark/util/sketch/CountMinSketchImpl.java | 12 ++ .../org/apache/spark/util/sketch/Utils.java | 2 +- 6 files changed, 110 insertions(+), 83 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/2b027e9a/common/sketch/src/main/java/org/apache/spark/util/sketch/BitArray.java -- diff --git a/common/sketch/src/main/java/org/apache/spark/util/sketch/BitArray.java b/common/sketch/src/main/java/org/apache/spark/util/sketch/BitArray.java index 2a0484e..480a0a7 100644 --- a/common/sketch/src/main/java/org/apache/spark/util/sketch/BitArray.java +++ b/common/sketch/src/main/java/org/apache/spark/util/sketch/BitArray.java @@ -22,7 +22,7 @@ import java.io.DataOutputStream; import java.io.IOException; import java.util.Arrays; -public final class BitArray { +final class BitArray { private final long[] data; private long bitCount; http://git-wip-us.apache.org/repos/asf/spark/blob/2b027e9a/common/sketch/src/main/java/org/apache/spark/util/sketch/BloomFilter.java -- diff --git a/common/sketch/src/main/java/org/apache/spark/util/sketch/BloomFilter.java b/common/sketch/src/main/java/org/apache/spark/util/sketch/BloomFilter.java index 81772fc..c0b425e 100644 --- a/common/sketch/src/main/java/org/apache/spark/util/sketch/BloomFilter.java +++ b/common/sketch/src/main/java/org/apache/spark/util/sketch/BloomFilter.java @@ -22,16 +22,10 @@ import java.io.InputStream; import java.io.OutputStream; /** - * A Bloom filter is a space-efficient probabilistic data structure, that is used to test whether - * an element is a member of a set. It returns false when the element is definitely not in the - * set, returns true when the element is probably in the set. - * - * Internally a Bloom filter is initialized with 2 information: how many space to use(number of - * bits) and how many hash values to calculate for each record. To get as lower false positive - * probability as possible, user should call {@link BloomFilter#create} to automatically pick a - * best combination of these 2 parameters. - * - * Currently the following data types are supported: + * A Bloom filter is a space-efficient probabilistic data structure that offers an approximate + * containment test with one-sided error: if it claims that an item is contained in it, this + * might be in error, but if it claims that an item is not contained in it, then this is + * definitely true. Currently supported data types include: * * {@link Byte} * {@link Short} @@ -39,14 +33,17 @@ import java.io.OutputStream; * {@link Long} * {@link String} * + * The false positive probability ({@code FPP}) of a Bloom filter is defined as the probability that + * {@linkplain #mightContain(Object)} will erroneously return {@code true} for an object that hasu + * not actually been put in the {@code BloomFilter}. * - * The implementation is largely based on the {@code BloomFilter} class from guava. + * The implementation is largely based on the {@code BloomFilter} class from Guava. */ public abstract class BloomFilter { public enum Version { /** - * {@code BloomFilter} binary format version 1 (all values written in big-endian order): + * {@code BloomFilter} binary format version 1. All values written in big-endian order: * * Version number, always 1 (32 bit) * Number of hash functions (32 bit) @@ -68,14 +65,13 @@ public abstract class BloomFilter { } /** - * Returns the false positive probability, i.e. the probability that - * {@linkplain #mightContain(Object)} will erroneously return {@code true} for an object that - * has not actually b
spark git commit: [SPARK-13050][BUILD] Scalatest tags fail build with the addition of the sketch module
Repository: spark Updated Branches: refs/heads/master 721ced28b -> 8d3cc3de7 [SPARK-13050][BUILD] Scalatest tags fail build with the addition of the sketch module A dependency on the spark test tags was left out of the sketch module pom file causing builds to fail when test tags were used. This dependency is found in the pom file for every other module in spark. Author: Alex Bozarth <ajboz...@us.ibm.com> Closes #10954 from ajbozarth/spark13050. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/8d3cc3de Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/8d3cc3de Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/8d3cc3de Branch: refs/heads/master Commit: 8d3cc3de7d116190911e7943ef3233fe3b7db1bf Parents: 721ced2 Author: Alex Bozarth <ajboz...@us.ibm.com> Authored: Thu Jan 28 23:34:50 2016 -0800 Committer: Cheng Lian <l...@databricks.com> Committed: Thu Jan 28 23:34:50 2016 -0800 -- common/sketch/pom.xml | 7 +++ 1 file changed, 7 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/8d3cc3de/common/sketch/pom.xml -- diff --git a/common/sketch/pom.xml b/common/sketch/pom.xml index 67723fa..2cafe8c 100644 --- a/common/sketch/pom.xml +++ b/common/sketch/pom.xml @@ -35,6 +35,13 @@ sketch + + + org.apache.spark + spark-test-tags_${scala.binary.version} + + + target/scala-${scala.binary.version}/classes target/scala-${scala.binary.version}/test-classes - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-12401][SQL] Add integration tests for postgres enum types
Repository: spark Updated Branches: refs/heads/master df78a934a -> abae889f0 [SPARK-12401][SQL] Add integration tests for postgres enum types We can handle posgresql-specific enum types as strings in jdbc. So, we should just add tests and close the corresponding JIRA ticket. Author: Takeshi YAMAMURO <linguin@gmail.com> Closes #10596 from maropu/AddTestsInIntegration. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/abae889f Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/abae889f Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/abae889f Branch: refs/heads/master Commit: abae889f08eb412cb897e4e63614ec2c93885ffd Parents: df78a93 Author: Takeshi YAMAMURO <linguin@gmail.com> Authored: Thu Jan 28 15:20:16 2016 -0800 Committer: Cheng Lian <l...@databricks.com> Committed: Thu Jan 28 15:20:16 2016 -0800 -- .../spark/sql/jdbc/PostgresIntegrationSuite.scala| 15 +-- 1 file changed, 9 insertions(+), 6 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/abae889f/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/PostgresIntegrationSuite.scala -- diff --git a/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/PostgresIntegrationSuite.scala b/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/PostgresIntegrationSuite.scala index 7d011be..72bda8f 100644 --- a/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/PostgresIntegrationSuite.scala +++ b/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/PostgresIntegrationSuite.scala @@ -21,7 +21,7 @@ import java.sql.Connection import java.util.Properties import org.apache.spark.sql.Column -import org.apache.spark.sql.catalyst.expressions.{If, Literal} +import org.apache.spark.sql.catalyst.expressions.Literal import org.apache.spark.tags.DockerTest @DockerTest @@ -39,12 +39,13 @@ class PostgresIntegrationSuite extends DockerJDBCIntegrationSuite { override def dataPreparation(conn: Connection): Unit = { conn.prepareStatement("CREATE DATABASE foo").executeUpdate() conn.setCatalog("foo") +conn.prepareStatement("CREATE TYPE enum_type AS ENUM ('d1', 'd2')").executeUpdate() conn.prepareStatement("CREATE TABLE bar (c0 text, c1 integer, c2 double precision, c3 bigint, " + "c4 bit(1), c5 bit(10), c6 bytea, c7 boolean, c8 inet, c9 cidr, " - + "c10 integer[], c11 text[], c12 real[])").executeUpdate() + + "c10 integer[], c11 text[], c12 real[], c13 enum_type)").executeUpdate() conn.prepareStatement("INSERT INTO bar VALUES ('hello', 42, 1.25, 123456789012345, B'0', " + "B'1000100101', E'xDEADBEEF', true, '172.16.0.42', '192.168.0.0/16', " - + """'{1, 2}', '{"a", null, "b"}', '{0.11, 0.22}')""").executeUpdate() + + """'{1, 2}', '{"a", null, "b"}', '{0.11, 0.22}', 'd1')""").executeUpdate() } test("Type mapping for various types") { @@ -52,7 +53,7 @@ class PostgresIntegrationSuite extends DockerJDBCIntegrationSuite { val rows = df.collect() assert(rows.length == 1) val types = rows(0).toSeq.map(x => x.getClass) -assert(types.length == 13) +assert(types.length == 14) assert(classOf[String].isAssignableFrom(types(0))) assert(classOf[java.lang.Integer].isAssignableFrom(types(1))) assert(classOf[java.lang.Double].isAssignableFrom(types(2))) @@ -66,22 +67,24 @@ class PostgresIntegrationSuite extends DockerJDBCIntegrationSuite { assert(classOf[Seq[Int]].isAssignableFrom(types(10))) assert(classOf[Seq[String]].isAssignableFrom(types(11))) assert(classOf[Seq[Double]].isAssignableFrom(types(12))) +assert(classOf[String].isAssignableFrom(types(13))) assert(rows(0).getString(0).equals("hello")) assert(rows(0).getInt(1) == 42) assert(rows(0).getDouble(2) == 1.25) assert(rows(0).getLong(3) == 123456789012345L) -assert(rows(0).getBoolean(4) == false) +assert(!rows(0).getBoolean(4)) // BIT(10)'s come back as ASCII strings of ten ASCII 0's and 1's... assert(java.util.Arrays.equals(rows(0).getAs[Array[Byte]](5), Array[Byte](49, 48, 48, 48, 49, 48, 48, 49, 48, 49))) assert(java.util.Arrays.equals(rows(0).getAs[Array[Byte]](6), Array[Byte](0xDE.toByte, 0xAD.toByte, 0xBE.toByte, 0xEF.toByte))) -assert(rows(0).getBoolean(7) == true) +assert(rows(0).getBoolean(7)) assert(rows(0).getString(8) == "172.16.0.42") assert(rows(0).getString(9) == &qu
spark git commit: [SPARK-11955][SQL] Mark optional fields in merging schema for safely pushdowning filters in Parquet
Repository: spark Updated Branches: refs/heads/master 3a40c0e57 -> 4637fc08a [SPARK-11955][SQL] Mark optional fields in merging schema for safely pushdowning filters in Parquet JIRA: https://issues.apache.org/jira/browse/SPARK-11955 Currently we simply skip pushdowning filters in parquet if we enable schema merging. However, we can actually mark particular fields in merging schema for safely pushdowning filters in parquet. Author: Liang-Chi Hsieh <vii...@appier.com> Author: Liang-Chi Hsieh <vii...@gmail.com> Closes #9940 from viirya/safe-pushdown-parquet-filters. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/4637fc08 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/4637fc08 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/4637fc08 Branch: refs/heads/master Commit: 4637fc08a3733ec313218fb7e4d05064d9a6262d Parents: 3a40c0e Author: Liang-Chi Hsieh <vii...@appier.com> Authored: Thu Jan 28 16:25:21 2016 -0800 Committer: Cheng Lian <l...@databricks.com> Committed: Thu Jan 28 16:25:21 2016 -0800 -- .../org/apache/spark/sql/types/Metadata.scala | 5 +++ .../org/apache/spark/sql/types/StructType.scala | 34 +--- .../apache/spark/sql/types/DataTypeSuite.scala | 14 +-- .../datasources/parquet/ParquetFilters.scala| 37 - .../datasources/parquet/ParquetRelation.scala | 13 +++--- .../parquet/ParquetFilterSuite.scala| 43 +++- 6 files changed, 117 insertions(+), 29 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/4637fc08/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Metadata.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Metadata.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Metadata.scala index 9e0f994..66f1236 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Metadata.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Metadata.scala @@ -273,4 +273,9 @@ class MetadataBuilder { map.put(key, value) this } + + def remove(key: String): this.type = { +map.remove(key) +this + } } http://git-wip-us.apache.org/repos/asf/spark/blob/4637fc08/sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala index 3bd733f..da0c928 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala @@ -334,6 +334,8 @@ case class StructType(fields: Array[StructField]) extends DataType with Seq[Stru object StructType extends AbstractDataType { + private[sql] val metadataKeyForOptionalField = "_OPTIONAL_" + override private[sql] def defaultConcreteType: DataType = new StructType override private[sql] def acceptsType(other: DataType): Boolean = { @@ -359,6 +361,18 @@ object StructType extends AbstractDataType { protected[sql] def fromAttributes(attributes: Seq[Attribute]): StructType = StructType(attributes.map(a => StructField(a.name, a.dataType, a.nullable, a.metadata))) + def removeMetadata(key: String, dt: DataType): DataType = +dt match { + case StructType(fields) => +val newFields = fields.map { f => + val mb = new MetadataBuilder() + f.copy(dataType = removeMetadata(key, f.dataType), +metadata = mb.withMetadata(f.metadata).remove(key).build()) +} +StructType(newFields) + case _ => dt +} + private[sql] def merge(left: DataType, right: DataType): DataType = (left, right) match { case (ArrayType(leftElementType, leftContainsNull), @@ -376,24 +390,32 @@ object StructType extends AbstractDataType { case (StructType(leftFields), StructType(rightFields)) => val newFields = ArrayBuffer.empty[StructField] +// This metadata will record the fields that only exist in one of two StructTypes +val optionalMeta = new MetadataBuilder() val rightMapped = fieldsMap(rightFields) leftFields.foreach { case leftField @ StructField(leftName, leftType, leftNullable, _) => rightMapped.get(leftName) .map { case rightField @ StructField(_, rightType, rightNullable, _) => - leftField.copy( -dataType = merge(leftType, rightType), -nullable = leftNullable || rightNullable) -} -
spark git commit: [SQL] Minor Scaladoc format fix
Repository: spark Updated Branches: refs/heads/master ee74498de -> 83507fea9 [SQL] Minor Scaladoc format fix Otherwise the `^` character is always marked as error in IntelliJ since it represents an unclosed superscript markup tag. Author: Cheng Lian <l...@databricks.com> Closes #10926 from liancheng/agg-doc-fix. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/83507fea Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/83507fea Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/83507fea Branch: refs/heads/master Commit: 83507fea9f45c336d73dd4795b8cb37bcd63e31d Parents: ee74498 Author: Cheng Lian <l...@databricks.com> Authored: Tue Jan 26 14:29:29 2016 -0800 Committer: Cheng Lian <l...@databricks.com> Committed: Tue Jan 26 14:29:29 2016 -0800 -- .../sql/catalyst/expressions/aggregate/interfaces.scala | 8 1 file changed, 4 insertions(+), 4 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/83507fea/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala index ddd99c5..561fa33 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala @@ -200,7 +200,7 @@ abstract class ImperativeAggregate extends AggregateFunction with CodegenFallbac * For example, we have two aggregate functions `avg(x)` and `avg(y)`, which share the same * aggregation buffer. In this shared buffer, the position of the first buffer value of `avg(x)` * will be 0 and the position of the first buffer value of `avg(y)` will be 2: - * + * {{{ * avg(x) mutableAggBufferOffset = 0 * | * v @@ -210,7 +210,7 @@ abstract class ImperativeAggregate extends AggregateFunction with CodegenFallbac *^ *| * avg(y) mutableAggBufferOffset = 2 - * + * }}} */ protected val mutableAggBufferOffset: Int @@ -233,7 +233,7 @@ abstract class ImperativeAggregate extends AggregateFunction with CodegenFallbac * `avg(x)` and `avg(y)`. In the shared input aggregation buffer, the position of the first * buffer value of `avg(x)` will be 1 and the position of the first buffer value of `avg(y)` * will be 3 (position 0 is used for the value of `key`): - * + * {{{ * avg(x) inputAggBufferOffset = 1 * | * v @@ -243,7 +243,7 @@ abstract class ImperativeAggregate extends AggregateFunction with CodegenFallbac * ^ * | * avg(y) inputAggBufferOffset = 3 - * + * }}} */ protected val inputAggBufferOffset: Int - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-12934] use try-with-resources for streams
Repository: spark Updated Branches: refs/heads/master 109061f7a -> fdcc3512f [SPARK-12934] use try-with-resources for streams liancheng please take a look Author: tedyu <yuzhih...@gmail.com> Closes #10906 from tedyu/master. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/fdcc3512 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/fdcc3512 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/fdcc3512 Branch: refs/heads/master Commit: fdcc3512f7b45e5b067fc26cb05146f79c4a5177 Parents: 109061f Author: tedyu <yuzhih...@gmail.com> Authored: Mon Jan 25 18:23:47 2016 -0800 Committer: Cheng Lian <l...@databricks.com> Committed: Mon Jan 25 18:23:47 2016 -0800 -- .../src/main/java/org/apache/spark/util/sketch/CountMinSketch.java | 2 ++ 1 file changed, 2 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/fdcc3512/common/sketch/src/main/java/org/apache/spark/util/sketch/CountMinSketch.java -- diff --git a/common/sketch/src/main/java/org/apache/spark/util/sketch/CountMinSketch.java b/common/sketch/src/main/java/org/apache/spark/util/sketch/CountMinSketch.java index 6793864..9f4ff42 100644 --- a/common/sketch/src/main/java/org/apache/spark/util/sketch/CountMinSketch.java +++ b/common/sketch/src/main/java/org/apache/spark/util/sketch/CountMinSketch.java @@ -128,11 +128,13 @@ abstract public class CountMinSketch { /** * Writes out this {@link CountMinSketch} to an output stream in binary format. + * It is the caller's responsibility to close the stream */ public abstract void writeTo(OutputStream out) throws IOException; /** * Reads in a {@link CountMinSketch} from an input stream. + * It is the caller's responsibility to close the stream */ public static CountMinSketch readFrom(InputStream in) throws IOException { return CountMinSketchImpl.readFrom(in); - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-12560][SQL] SqlTestUtils.stripSparkFilter needs to copy utf8strings
Repository: spark Updated Branches: refs/heads/master b72e01e82 -> 4dbd31612 [SPARK-12560][SQL] SqlTestUtils.stripSparkFilter needs to copy utf8strings See https://issues.apache.org/jira/browse/SPARK-12560 This isn't causing any problems currently because the tests for string predicate pushdown are currently disabled. I ran into this while trying to turn them back on with a different version of parquet. Figure it was good to fix now in any case. Author: Imran Rashid <iras...@cloudera.com> Closes #10510 from squito/SPARK-12560. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/4dbd3161 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/4dbd3161 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/4dbd3161 Branch: refs/heads/master Commit: 4dbd3161227a32736105cef624f9df21650a359c Parents: b72e01e Author: Imran Rashid <iras...@cloudera.com> Authored: Tue Jan 19 12:24:21 2016 -0800 Committer: Cheng Lian <l...@databricks.com> Committed: Tue Jan 19 12:24:21 2016 -0800 -- .../src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala| 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/4dbd3161/sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala b/sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala index 7df344e..5f73d71 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala @@ -189,7 +189,7 @@ private[sql] trait SQLTestUtils .executedPlan.asInstanceOf[org.apache.spark.sql.execution.Filter] .child .execute() - .map(row => Row.fromSeq(row.toSeq(schema))) + .map(row => Row.fromSeq(row.copy().toSeq(schema))) sqlContext.createDataFrame(childRDD, schema) } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-12867][SQL] Nullability of Intersect can be stricter
Repository: spark Updated Branches: refs/heads/master 2388de519 -> b72e01e82 [SPARK-12867][SQL] Nullability of Intersect can be stricter JIRA: https://issues.apache.org/jira/browse/SPARK-12867 When intersecting one nullable column with one non-nullable column, the result will not contain any null. Thus, we can make nullability of `intersect` stricter. liancheng Could you please check if the code changes are appropriate? Also added test cases to verify the results. Thanks! Author: gatorsmile <gatorsm...@gmail.com> Closes #10812 from gatorsmile/nullabilityIntersect. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b72e01e8 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b72e01e8 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b72e01e8 Branch: refs/heads/master Commit: b72e01e82148a908eb19bb3f526f9777bfe27dde Parents: 2388de5 Author: gatorsmile <gatorsm...@gmail.com> Authored: Tue Jan 19 11:35:58 2016 -0800 Committer: Cheng Lian <l...@databricks.com> Committed: Tue Jan 19 11:35:58 2016 -0800 -- .../catalyst/plans/logical/basicOperators.scala | 18 +++-- .../org/apache/spark/sql/DataFrameSuite.scala | 21 2 files changed, 33 insertions(+), 6 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/b72e01e8/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicOperators.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicOperators.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicOperators.scala index 2a1b1b1..f4a3d85 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicOperators.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicOperators.scala @@ -91,11 +91,6 @@ case class Filter(condition: Expression, child: LogicalPlan) extends UnaryNode { } abstract class SetOperation(left: LogicalPlan, right: LogicalPlan) extends BinaryNode { - override def output: Seq[Attribute] = -left.output.zip(right.output).map { case (leftAttr, rightAttr) => - leftAttr.withNullability(leftAttr.nullable || rightAttr.nullable) -} - final override lazy val resolved: Boolean = childrenResolved && left.output.length == right.output.length && @@ -108,13 +103,24 @@ private[sql] object SetOperation { case class Union(left: LogicalPlan, right: LogicalPlan) extends SetOperation(left, right) { + override def output: Seq[Attribute] = +left.output.zip(right.output).map { case (leftAttr, rightAttr) => + leftAttr.withNullability(leftAttr.nullable || rightAttr.nullable) +} + override def statistics: Statistics = { val sizeInBytes = left.statistics.sizeInBytes + right.statistics.sizeInBytes Statistics(sizeInBytes = sizeInBytes) } } -case class Intersect(left: LogicalPlan, right: LogicalPlan) extends SetOperation(left, right) +case class Intersect(left: LogicalPlan, right: LogicalPlan) extends SetOperation(left, right) { + + override def output: Seq[Attribute] = +left.output.zip(right.output).map { case (leftAttr, rightAttr) => + leftAttr.withNullability(leftAttr.nullable && rightAttr.nullable) +} +} case class Except(left: LogicalPlan, right: LogicalPlan) extends SetOperation(left, right) { /** We don't use right.output because those rows get excluded from the set. */ http://git-wip-us.apache.org/repos/asf/spark/blob/b72e01e8/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala index afc8df0..bd11a38 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala @@ -337,6 +337,27 @@ class DataFrameSuite extends QueryTest with SharedSQLContext { checkAnswer(lowerCaseData.intersect(upperCaseData), Nil) } + test("intersect - nullability") { +val nonNullableInts = Seq(Tuple1(1), Tuple1(3)).toDF() +assert(nonNullableInts.schema.forall(_.nullable == false)) + +val df1 = nonNullableInts.intersect(nullInts) +checkAnswer(df1, Row(1) :: Row(3) :: Nil) +assert(df1.schema.forall(_.nullable == false)) + +val df2 = nullInts.intersect(nonNullableInts) +checkAnswer(df2, Row(1) :: Row(3) :: Nil) +assert(df2.schema.forall(_.nullable == false)) + +val df3 = nullInts.intersect(nullInts) +checkAnswer(df3,
spark git commit: [SPARK-12724] SQL generation support for persisted data source tables
Repository: spark Updated Branches: refs/heads/master 0d543b98f -> 8ed5f12d2 [SPARK-12724] SQL generation support for persisted data source tables This PR implements SQL generation support for persisted data source tables. A new field `metastoreTableIdentifier: Option[TableIdentifier]` is added to `LogicalRelation`. When a `LogicalRelation` representing a persisted data source relation is created, this field holds the database name and table name of the relation. Author: Cheng Lian <l...@databricks.com> Closes #10712 from liancheng/spark-12724-datasources-sql-gen. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/8ed5f12d Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/8ed5f12d Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/8ed5f12d Branch: refs/heads/master Commit: 8ed5f12d2bb408bd37e4156b5f1bad9a6b8c3cb5 Parents: 0d543b9 Author: Cheng Lian <l...@databricks.com> Authored: Tue Jan 12 14:19:53 2016 -0800 Committer: Cheng Lian <l...@databricks.com> Committed: Tue Jan 12 14:19:53 2016 -0800 -- .../main/scala/org/apache/spark/sql/DataFrame.scala | 2 +- .../execution/datasources/DataSourceStrategy.scala | 16 .../sql/execution/datasources/LogicalRelation.scala | 8 +--- .../datasources/parquet/ParquetRelation.scala | 10 ++ .../spark/sql/execution/datasources/rules.scala | 16 .../datasources/parquet/ParquetFilterSuite.scala| 2 +- .../parquet/ParquetPartitionDiscoverySuite.scala| 2 +- .../spark/sql/sources/FilteredScanSuite.scala | 2 +- .../spark/sql/hive/HiveMetastoreCatalog.scala | 6 -- .../org/apache/spark/sql/hive/SQLBuilder.scala | 14 +- .../apache/spark/sql/hive/execution/commands.scala | 2 +- .../spark/sql/hive/LogicalPlanToSQLSuite.scala | 10 ++ .../spark/sql/hive/MetastoreDataSourcesSuite.scala | 2 +- .../spark/sql/hive/execution/SQLQuerySuite.scala| 2 +- .../apache/spark/sql/hive/orc/OrcFilterSuite.scala | 2 +- .../org/apache/spark/sql/hive/parquetSuites.scala | 8 .../spark/sql/sources/hadoopFsRelationSuites.scala | 2 +- 17 files changed, 55 insertions(+), 51 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/8ed5f12d/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala b/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala index 60d2f05..91bf2f8 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala @@ -1728,7 +1728,7 @@ class DataFrame private[sql]( */ def inputFiles: Array[String] = { val files: Seq[String] = logicalPlan.collect { - case LogicalRelation(fsBasedRelation: FileRelation, _) => + case LogicalRelation(fsBasedRelation: FileRelation, _, _) => fsBasedRelation.inputFiles case fr: FileRelation => fr.inputFiles http://git-wip-us.apache.org/repos/asf/spark/blob/8ed5f12d/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala index 1d6290e..da9320f 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala @@ -41,7 +41,7 @@ import org.apache.spark.util.{SerializableConfiguration, Utils} */ private[sql] object DataSourceStrategy extends Strategy with Logging { def apply(plan: LogicalPlan): Seq[execution.SparkPlan] = plan match { -case PhysicalOperation(projects, filters, l @ LogicalRelation(t: CatalystScan, _)) => +case PhysicalOperation(projects, filters, l @ LogicalRelation(t: CatalystScan, _, _)) => pruneFilterProjectRaw( l, projects, @@ -49,14 +49,14 @@ private[sql] object DataSourceStrategy extends Strategy with Logging { (requestedColumns, allPredicates, _) => toCatalystRDD(l, requestedColumns, t.buildScan(requestedColumns, allPredicates))) :: Nil -case PhysicalOperation(projects, filters, l @ LogicalRelation(t: PrunedFilteredScan, _)) => +case PhysicalOperation(projects, filters, l @ LogicalRelation(t: PrunedFilteredScan, _, _)) => pruneFilterProject( l, projects, filters, (a, f) =&
spark git commit: [SPARK-12742][SQL] org.apache.spark.sql.hive.LogicalPlanToSQLSuite failure due to Table already exists exception
Repository: spark Updated Branches: refs/heads/master fe9eb0b0c -> 473907adf [SPARK-12742][SQL] org.apache.spark.sql.hive.LogicalPlanToSQLSuite failure due to Table already exists exception ``` [info] Exception encountered when attempting to run a suite with class name: org.apache.spark.sql.hive.LogicalPlanToSQLSuite *** ABORTED *** (325 milliseconds) [info] org.apache.spark.sql.AnalysisException: Table `t1` already exists.; [info] at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:296) [info] at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:285) [info] at org.apache.spark.sql.hive.LogicalPlanToSQLSuite.beforeAll(LogicalPlanToSQLSuite.scala:33) [info] at org.scalatest.BeforeAndAfterAll$class.beforeAll(BeforeAndAfterAll.scala:187) [info] at org.apache.spark.sql.hive.LogicalPlanToSQLSuite.beforeAll(LogicalPlanToSQLSuite.scala:23) [info] at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:253) [info] at org.apache.spark.sql.hive.LogicalPlanToSQLSuite.run(LogicalPlanToSQLSuite.scala:23) [info] at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:462) [info] at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:671) [info] at sbt.ForkMain$Run$2.call(ForkMain.java:296) [info] at sbt.ForkMain$Run$2.call(ForkMain.java:286) [info] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [info] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [info] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [info] at java.lang.Thread.run(Thread.java:745) ``` /cc liancheng Author: wangfei <wangfei_he...@126.com> Closes #10682 from scwf/fix-test. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/473907ad Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/473907ad Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/473907ad Branch: refs/heads/master Commit: 473907adf6e37855ee31d0703b43d7170e26b4b9 Parents: fe9eb0b Author: wangfei <wangfei_he...@126.com> Authored: Mon Jan 11 18:18:44 2016 -0800 Committer: Cheng Lian <l...@databricks.com> Committed: Mon Jan 11 18:18:44 2016 -0800 -- .../scala/org/apache/spark/sql/hive/LogicalPlanToSQLSuite.scala | 3 +++ 1 file changed, 3 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/473907ad/sql/hive/src/test/scala/org/apache/spark/sql/hive/LogicalPlanToSQLSuite.scala -- diff --git a/sql/hive/src/test/scala/org/apache/spark/sql/hive/LogicalPlanToSQLSuite.scala b/sql/hive/src/test/scala/org/apache/spark/sql/hive/LogicalPlanToSQLSuite.scala index 9a8a9c5..2ee8150 100644 --- a/sql/hive/src/test/scala/org/apache/spark/sql/hive/LogicalPlanToSQLSuite.scala +++ b/sql/hive/src/test/scala/org/apache/spark/sql/hive/LogicalPlanToSQLSuite.scala @@ -24,6 +24,9 @@ class LogicalPlanToSQLSuite extends SQLBuilderTest with SQLTestUtils { import testImplicits._ protected override def beforeAll(): Unit = { +sql("DROP TABLE IF EXISTS t0") +sql("DROP TABLE IF EXISTS t1") +sql("DROP TABLE IF EXISTS t2") sqlContext.range(10).write.saveAsTable("t0") sqlContext - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org