from:"Cheng Lian \(JIRA\)"

[jira] [Created] (SPARK-18730) Ask the build script to link to Jenkins test report page instead of full console output page when posting to GitHub

2016-12-05 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-18730:
--

 Summary: Ask the build script to link to Jenkins test report page 
instead of full console output page when posting to GitHub
 Key: SPARK-18730
 URL: https://issues.apache.org/jira/browse/SPARK-18730
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Cheng Lian
Assignee: Cheng Lian


Currently, the full console output page of a Spark Jenkins PR build can be as 
large as several megabytes. It takes a relatively long time to load and may 
even freeze the browser for quite a while.

I'd suggest posting the test report page link to GitHub instead, which is way 
more concise and is usually the first page I'd like to check when investigating 
a Jenkins build failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18539) Cannot filter by nonexisting column in parquet file

2016-12-05 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15723781#comment-15723781
 ] 

Cheng Lian commented on SPARK-18539:


Please remind me if I missed anything important, otherwise, we can resolve this 
ticket as "Not a Problem".

> Cannot filter by nonexisting column in parquet file
> ---
>
> Key: SPARK-18539
> URL: https://issues.apache.org/jira/browse/SPARK-18539
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1, 2.0.2
>Reporter: Vitaly Gerasimov
>Priority: Critical
>
> {code}
>   import org.apache.spark.SparkConf
>   import org.apache.spark.sql.SparkSession
>   import org.apache.spark.sql.types.DataTypes._
>   import org.apache.spark.sql.types.{StructField, StructType}
>   val sc = SparkSession.builder().config(new 
> SparkConf().setMaster("local")).getOrCreate()
>   val jsonRDD = sc.sparkContext.parallelize(Seq("""{"a":1}"""))
>   sc.read
> .schema(StructType(Seq(StructField("a", IntegerType
> .json(jsonRDD)
> .write
> .parquet("/tmp/test")
>   sc.read
> .schema(StructType(Seq(StructField("a", IntegerType), StructField("b", 
> IntegerType, nullable = true
> .load("/tmp/test")
> .createOrReplaceTempView("table")
>   sc.sql("select b from table where b is not null").show()
> {code}
> returns:
> {code}
> 16/11/22 17:43:47 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.IllegalArgumentException: Column [b] was not found in schema!
>   at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:190)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:178)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:160)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:100)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:59)
>   at 
> org.apache.parquet.filter2.predicate.Operators$NotEq.accept(Operators.java:194)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:64)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:59)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:40)
>   at 
> org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:126)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:46)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:110)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:109)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:367)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:341)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:116)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at

[jira] [Comment Edited] (SPARK-18539) Cannot filter by nonexisting column in parquet file

2016-12-05 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15723747#comment-15723747
 ] 

Cheng Lian edited comment on SPARK-18539 at 12/5/16 11:43 PM:
--

[~v-gerasimov], [~smilegator], and [~xwu0226], after some investigation, I 
don't think this is a bug now.

Just tested the master branch using the following test cases:
{code}
  for {
useVectorizedReader <- Seq(true, false)
mergeSchema <- Seq(true, false)
  } {
test(s"foo - mergeSchema: $mergeSchema - vectorized: $useVectorizedReader") 
{
  withSQLConf(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key -> 
useVectorizedReader.toString) {
withTempPath { dir =>
  val path = dir.getCanonicalPath
  val df = spark.range(1).coalesce(1)

  df.selectExpr("id AS a", "id AS b").write.parquet(path)
  df.selectExpr("id AS a").write.mode("append").parquet(path)

  assertResult(0) {
spark.read
  .option("mergeSchema", mergeSchema.toString)
  .parquet(path)
  .filter("b < 0")
  .count()
  }
}
  }
}
  }
{code}
It turned out that this issue only happens when schema merging is turned off. 
This also explains why PR #9940 doesn't prevent PARQUET-389: because the trick 
PR #9940 employs happens during schema merging phase. On the other hand, you 
can't expect missing columns to be properly read when schema merging is turned 
off. Therefore, I don't think it's a bug.

The fix for the snippet mentioned in the ticket description is easy, just add 
{{.option("mergeSchema", "true")}} to enable schema merging.


was (Author: lian cheng):
[~v-gerasimov], [~smilegator], and [~xwu0226], after some investigation, I 
don't think this is a bug now.

Just tested the master branch using the following test cases:
{code}
  for {
useVectorizedReader <- Seq(true, false)
mergeSchema <- Seq(true, false)
  } {
test(s"foo - mergeSchema: $mergeSchema - vectorized: $useVectorizedReader") 
{
  withSQLConf(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key -> 
useVectorizedReader.toString) {
withTempPath { dir =>
  val path = dir.getCanonicalPath
  val df = spark.range(1).coalesce(1)

  df.selectExpr("id AS a", "id AS b").write.parquet(path)
  df.selectExpr("id AS a").write.mode("append").parquet(path)

  assertResult(0) {
spark.read
  .option("mergeSchema", mergeSchema.toString)
  .parquet(path)
  .filter("b < 0")
  .count()
  }
}
  }
}
  }
{code}
It turned out that this issue only happens when schema merging is turned off. 
This also explains why PR #9940 doesn't prevent PARQUET-389: because the trick 
PR #9940 employs happens during schema merging phase. On the other hand, you 
can't expect missing columns can be properly read when schema merging is turned 
off. Therefore, I don't think it's a bug.

The fix for the snippet mentioned in the ticket description is easy, just add 
{{.option("mergeSchema", "true")}} to enable schema merging.

> Cannot filter by nonexisting column in parquet file
> ---
>
> Key: SPARK-18539
> URL: https://issues.apache.org/jira/browse/SPARK-18539
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1, 2.0.2
>Reporter: Vitaly Gerasimov
>Priority: Critical
>
> {code}
>   import org.apache.spark.SparkConf
>   import org.apache.spark.sql.SparkSession
>   import org.apache.spark.sql.types.DataTypes._
>   import org.apache.spark.sql.types.{StructField, StructType}
>   val sc = SparkSession.builder().config(new 
> SparkConf().setMaster("local")).getOrCreate()
>   val jsonRDD = sc.sparkContext.parallelize(Seq("""{"a":1}"""))
>   sc.read
> .schema(StructType(Seq(StructField("a", IntegerType
> .json(jsonRDD)
> .write
> .parquet("/tmp/test")
>   sc.read
> .schema(StructType(Seq(StructField("a", IntegerType), StructField("b", 
> IntegerType, nullable = true
> .load("/tmp/test")
> .createOrReplaceTempView("table")
>   sc.sql("select b from table where b is not null").show()
> {code}
> returns:
> {code}
> 16/11/22 17:43:47 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.IllegalArgumentException: Column [b] was not found in schema!
>   at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:190)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:178)
>   at 
>

[jira] [Commented] (SPARK-18539) Cannot filter by nonexisting column in parquet file

2016-12-05 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15723747#comment-15723747
 ] 

Cheng Lian commented on SPARK-18539:


[~v-gerasimov], [~smilegator], and [~xwu0226], after some investigation, I 
don't think this is a bug now.

Just tested the master branch using the following test cases:
{code}
  for {
useVectorizedReader <- Seq(true, false)
mergeSchema <- Seq(true, false)
  } {
test(s"foo - mergeSchema: $mergeSchema - vectorized: $useVectorizedReader") 
{
  withSQLConf(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key -> 
useVectorizedReader.toString) {
withTempPath { dir =>
  val path = dir.getCanonicalPath
  val df = spark.range(1).coalesce(1)

  df.selectExpr("id AS a", "id AS b").write.parquet(path)
  df.selectExpr("id AS a").write.mode("append").parquet(path)

  assertResult(0) {
spark.read
  .option("mergeSchema", mergeSchema.toString)
  .parquet(path)
  .filter("b < 0")
  .count()
  }
}
  }
}
  }
{code}
It turned out that this issue only happens when schema merging is turned off. 
This also explains why PR #9940 doesn't prevent PARQUET-389: because the trick 
PR #9940 employs happens during schema merging phase. On the other hand, you 
can't expect missing columns can be properly read when schema merging is turned 
off. Therefore, I don't think it's a bug.

The fix for the snippet mentioned in the ticket description is easy, just add 
{{.option("mergeSchema", "true")}} to enable schema merging.

> Cannot filter by nonexisting column in parquet file
> ---
>
> Key: SPARK-18539
> URL: https://issues.apache.org/jira/browse/SPARK-18539
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1, 2.0.2
>Reporter: Vitaly Gerasimov
>Priority: Critical
>
> {code}
>   import org.apache.spark.SparkConf
>   import org.apache.spark.sql.SparkSession
>   import org.apache.spark.sql.types.DataTypes._
>   import org.apache.spark.sql.types.{StructField, StructType}
>   val sc = SparkSession.builder().config(new 
> SparkConf().setMaster("local")).getOrCreate()
>   val jsonRDD = sc.sparkContext.parallelize(Seq("""{"a":1}"""))
>   sc.read
> .schema(StructType(Seq(StructField("a", IntegerType
> .json(jsonRDD)
> .write
> .parquet("/tmp/test")
>   sc.read
> .schema(StructType(Seq(StructField("a", IntegerType), StructField("b", 
> IntegerType, nullable = true
> .load("/tmp/test")
> .createOrReplaceTempView("table")
>   sc.sql("select b from table where b is not null").show()
> {code}
> returns:
> {code}
> 16/11/22 17:43:47 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.IllegalArgumentException: Column [b] was not found in schema!
>   at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:190)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:178)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:160)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:100)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:59)
>   at 
> org.apache.parquet.filter2.predicate.Operators$NotEq.accept(Operators.java:194)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:64)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:59)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:40)
>   at 
> org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:126)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:46)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:110)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:109)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:367)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:341)
>   at 
>

[jira] [Commented] (SPARK-18539) Cannot filter by nonexisting column in parquet file

2016-12-05 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15723718#comment-15723718
 ] 

Cheng Lian commented on SPARK-18539:


As commented on GitHub, there're two issues right now:
# This bug also affects the normal Parquet reader code path, where 
{{ParquetRecordReader}} is a 3rd party class closed for modification. 
Therefore, we can't capture the exception there.
# [PR #9940|https://github.com/apache/spark/pull/9940] should have already 
fixed this issue. But somehow it is broken right now.

> Cannot filter by nonexisting column in parquet file
> ---
>
> Key: SPARK-18539
> URL: https://issues.apache.org/jira/browse/SPARK-18539
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1, 2.0.2
>Reporter: Vitaly Gerasimov
>Priority: Critical
>
> {code}
>   import org.apache.spark.SparkConf
>   import org.apache.spark.sql.SparkSession
>   import org.apache.spark.sql.types.DataTypes._
>   import org.apache.spark.sql.types.{StructField, StructType}
>   val sc = SparkSession.builder().config(new 
> SparkConf().setMaster("local")).getOrCreate()
>   val jsonRDD = sc.sparkContext.parallelize(Seq("""{"a":1}"""))
>   sc.read
> .schema(StructType(Seq(StructField("a", IntegerType
> .json(jsonRDD)
> .write
> .parquet("/tmp/test")
>   sc.read
> .schema(StructType(Seq(StructField("a", IntegerType), StructField("b", 
> IntegerType, nullable = true
> .load("/tmp/test")
> .createOrReplaceTempView("table")
>   sc.sql("select b from table where b is not null").show()
> {code}
> returns:
> {code}
> 16/11/22 17:43:47 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.IllegalArgumentException: Column [b] was not found in schema!
>   at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:190)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:178)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:160)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:100)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:59)
>   at 
> org.apache.parquet.filter2.predicate.Operators$NotEq.accept(Operators.java:194)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:64)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:59)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:40)
>   at 
> org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:126)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:46)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:110)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:109)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:367)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:341)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:116)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
>

[jira] [Commented] (SPARK-18539) Cannot filter by nonexisting column in parquet file

2016-12-05 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15722891#comment-15722891
 ] 

Cheng Lian commented on SPARK-18539:


Haven't looked deeply into this issue, but my hunch is that this is related to 
https://issues.apache.org/jira/browse/PARQUET-389, which was fixed in 
parquet-mr 1.9.0, while Spark is still using 1.8 (in 2.1) and 1.7 (in 2.0).

> Cannot filter by nonexisting column in parquet file
> ---
>
> Key: SPARK-18539
> URL: https://issues.apache.org/jira/browse/SPARK-18539
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1, 2.0.2
>Reporter: Vitaly Gerasimov
>Priority: Critical
>
> {code}
>   import org.apache.spark.SparkConf
>   import org.apache.spark.sql.SparkSession
>   import org.apache.spark.sql.types.DataTypes._
>   import org.apache.spark.sql.types.{StructField, StructType}
>   val sc = SparkSession.builder().config(new 
> SparkConf().setMaster("local")).getOrCreate()
>   val jsonRDD = sc.sparkContext.parallelize(Seq("""{"a":1}"""))
>   sc.read
> .schema(StructType(Seq(StructField("a", IntegerType
> .json(jsonRDD)
> .write
> .parquet("/tmp/test")
>   sc.read
> .schema(StructType(Seq(StructField("a", IntegerType), StructField("b", 
> IntegerType, nullable = true
> .load("/tmp/test")
> .createOrReplaceTempView("table")
>   sc.sql("select b from table where b is not null").show()
> {code}
> returns:
> {code}
> 16/11/22 17:43:47 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.IllegalArgumentException: Column [b] was not found in schema!
>   at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:190)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:178)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:160)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:100)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:59)
>   at 
> org.apache.parquet.filter2.predicate.Operators$NotEq.accept(Operators.java:194)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:64)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:59)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:40)
>   at 
> org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:126)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:46)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:110)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:109)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:367)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:341)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:116)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)

[jira] [Assigned] (SPARK-17213) Parquet String Pushdown for Non-Eq Comparisons Broken

2016-12-01 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian reassigned SPARK-17213:
--

Assignee: Cheng Lian

> Parquet String Pushdown for Non-Eq Comparisons Broken
> -
>
> Key: SPARK-17213
> URL: https://issues.apache.org/jira/browse/SPARK-17213
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.1.0
>Reporter: Andrew Duffy
>Assignee: Cheng Lian
>
> Spark defines ordering over strings based on comparison of UTF8 byte arrays, 
> which compare bytes as unsigned integers. Currently however Parquet does not 
> respect this ordering. This is currently in the process of being fixed in 
> Parquet, JIRA and PR link below, but currently all filters are broken over 
> strings, with there actually being a correctness issue for {{>}} and {{<}}.
> *Repro:*
> Querying directly from in-memory DataFrame:
> {code}
> > Seq("a", "é").toDF("name").where("name > 'a'").count
> 1
> {code}
> Querying from a parquet dataset:
> {code}
> > Seq("a", "é").toDF("name").write.parquet("/tmp/bad")
> > spark.read.parquet("/tmp/bad").where("name > 'a'").count
> 0
> {code}
> This happens because Spark sorts the rows to be {{[a, é]}}, but Parquet's 
> implementation of comparison of strings is based on signed byte array 
> comparison, so it will actually create 1 row group with statistics 
> {{min=é,max=a}}, and so the row group will be dropped by the query.
> Based on the way Parquet pushes down Eq, it will not be affecting correctness 
> but it will force you to read row groups you should be able to skip.
> Link to PARQUET issue: https://issues.apache.org/jira/browse/PARQUET-686
> Link to PR: https://github.com/apache/parquet-mr/pull/362



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17213) Parquet String Pushdown for Non-Eq Comparisons Broken

2016-12-01 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15712707#comment-15712707
 ] 

Cheng Lian commented on SPARK-17213:


Agree that we should disable string and binary filter push down for now until 
PARQUET-686 gets fixed.

We turned off Parquet filter pushdown for string and binary columns in 1.6 due 
to PARQUET-251 (see SPARK-11153). In Spark 2.1, we upgraded to Parquet 1.8.1 to 
get PARQUET-251 fixed, then this issue pops up due to PARQUET-686. I think this 
also affects Spark 1.5.1 and prior versions.

> Parquet String Pushdown for Non-Eq Comparisons Broken
> -
>
> Key: SPARK-17213
> URL: https://issues.apache.org/jira/browse/SPARK-17213
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.1.0
>Reporter: Andrew Duffy
>
> Spark defines ordering over strings based on comparison of UTF8 byte arrays, 
> which compare bytes as unsigned integers. Currently however Parquet does not 
> respect this ordering. This is currently in the process of being fixed in 
> Parquet, JIRA and PR link below, but currently all filters are broken over 
> strings, with there actually being a correctness issue for {{>}} and {{<}}.
> *Repro:*
> Querying directly from in-memory DataFrame:
> {code}
> > Seq("a", "é").toDF("name").where("name > 'a'").count
> 1
> {code}
> Querying from a parquet dataset:
> {code}
> > Seq("a", "é").toDF("name").write.parquet("/tmp/bad")
> > spark.read.parquet("/tmp/bad").where("name > 'a'").count
> 0
> {code}
> This happens because Spark sorts the rows to be {{[a, é]}}, but Parquet's 
> implementation of comparison of strings is based on signed byte array 
> comparison, so it will actually create 1 row group with statistics 
> {{min=é,max=a}}, and so the row group will be dropped by the query.
> Based on the way Parquet pushes down Eq, it will not be affecting correctness 
> but it will force you to read row groups you should be able to skip.
> Link to PARQUET issue: https://issues.apache.org/jira/browse/PARQUET-686
> Link to PR: https://github.com/apache/parquet-mr/pull/362



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-9876) Upgrade parquet-mr to 1.8.1

2016-12-01 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-9876.
---
   Resolution: Fixed
Fix Version/s: 2.1.0

> Upgrade parquet-mr to 1.8.1
> ---
>
> Key: SPARK-9876
> URL: https://issues.apache.org/jira/browse/SPARK-9876
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Cheng Lian
>Assignee: Ryan Blue
> Fix For: 2.1.0
>
>
> {{parquet-mr}} 1.8.1 fixed several issues that affect Spark. For example 
> PARQUET-201 (SPARK-9407).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18251) DataSet API | RuntimeException: Null value appeared in non-nullable field when holding Option Case Class

2016-11-30 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15709869#comment-15709869
 ] 

Cheng Lian commented on SPARK-18251:


One more comment about why we shouldn't allow a {{Option\[T <: Product\]}} to 
be used as top-level Dataset type: one way to think about this more intuitively 
is to make an analogy to databases. In a database table, you cannot mark a row 
itself as null. Instead, you are only allowed to mark a field of a row to be 
null.

Instead of using {{Option\[T <: Product\]}}, the user should resort to 
{{Tuple1\[T <: Product\]}}. Thus, you have a row consisting of a single field, 
which can be filled with either a null or a struct.

> DataSet API | RuntimeException: Null value appeared in non-nullable field 
> when holding Option Case Class
> 
>
> Key: SPARK-18251
> URL: https://issues.apache.org/jira/browse/SPARK-18251
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1
> Environment: OS X
>Reporter: Aniket Bhatnagar
>Assignee: Wenchen Fan
> Fix For: 2.2.0
>
>
> I am running into a runtime exception when a DataSet is holding an Empty 
> object instance for an Option type that is holding non-nullable field. For 
> instance, if we have the following case class:
> case class DataRow(id: Int, value: String)
> Then, DataSet[Option[DataRow]] can only hold Some(DataRow) objects and cannot 
> hold Empty. If it does so, the following exception is thrown:
> {noformat}
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 6 in stage 0.0 failed 1 times, most recent failure: 
> Lost task 6.0 in stage 0.0 (TID 6, localhost): java.lang.RuntimeException: 
> Null value appeared in non-nullable field:
> - field (class: "scala.Int", name: "id")
> - option value class: "DataSetOptBug.DataRow"
> - root class: "scala.Option"
> If the schema is inferred from a Scala tuple/case class, or a Java bean, 
> please try to use scala.Option[_] or other nullable types (e.g. 
> java.lang.Integer instead of int/scala.Int).
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}
> The bug can be reproduce by using the program: 
> https://gist.github.com/aniketbhatnagar/2ed74613f70d2defe999c18afaa4816e



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18251) DataSet API | RuntimeException: Null value appeared in non-nullable field when holding Option Case Class

2016-11-30 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-18251:
---
Assignee: Wenchen Fan

> DataSet API | RuntimeException: Null value appeared in non-nullable field 
> when holding Option Case Class
> 
>
> Key: SPARK-18251
> URL: https://issues.apache.org/jira/browse/SPARK-18251
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1
> Environment: OS X
>Reporter: Aniket Bhatnagar
>Assignee: Wenchen Fan
> Fix For: 2.2.0
>
>
> I am running into a runtime exception when a DataSet is holding an Empty 
> object instance for an Option type that is holding non-nullable field. For 
> instance, if we have the following case class:
> case class DataRow(id: Int, value: String)
> Then, DataSet[Option[DataRow]] can only hold Some(DataRow) objects and cannot 
> hold Empty. If it does so, the following exception is thrown:
> {noformat}
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 6 in stage 0.0 failed 1 times, most recent failure: 
> Lost task 6.0 in stage 0.0 (TID 6, localhost): java.lang.RuntimeException: 
> Null value appeared in non-nullable field:
> - field (class: "scala.Int", name: "id")
> - option value class: "DataSetOptBug.DataRow"
> - root class: "scala.Option"
> If the schema is inferred from a Scala tuple/case class, or a Java bean, 
> please try to use scala.Option[_] or other nullable types (e.g. 
> java.lang.Integer instead of int/scala.Int).
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}
> The bug can be reproduce by using the program: 
> https://gist.github.com/aniketbhatnagar/2ed74613f70d2defe999c18afaa4816e



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18251) DataSet API | RuntimeException: Null value appeared in non-nullable field when holding Option Case Class

2016-11-30 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-18251.

   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 15979
[https://github.com/apache/spark/pull/15979]

> DataSet API | RuntimeException: Null value appeared in non-nullable field 
> when holding Option Case Class
> 
>
> Key: SPARK-18251
> URL: https://issues.apache.org/jira/browse/SPARK-18251
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1
> Environment: OS X
>Reporter: Aniket Bhatnagar
> Fix For: 2.2.0
>
>
> I am running into a runtime exception when a DataSet is holding an Empty 
> object instance for an Option type that is holding non-nullable field. For 
> instance, if we have the following case class:
> case class DataRow(id: Int, value: String)
> Then, DataSet[Option[DataRow]] can only hold Some(DataRow) objects and cannot 
> hold Empty. If it does so, the following exception is thrown:
> {noformat}
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 6 in stage 0.0 failed 1 times, most recent failure: 
> Lost task 6.0 in stage 0.0 (TID 6, localhost): java.lang.RuntimeException: 
> Null value appeared in non-nullable field:
> - field (class: "scala.Int", name: "id")
> - option value class: "DataSetOptBug.DataRow"
> - root class: "scala.Option"
> If the schema is inferred from a Scala tuple/case class, or a Java bean, 
> please try to use scala.Option[_] or other nullable types (e.g. 
> java.lang.Integer instead of int/scala.Int).
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}
> The bug can be reproduce by using the program: 
> https://gist.github.com/aniketbhatnagar/2ed74613f70d2defe999c18afaa4816e



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-18403) ObjectHashAggregateSuite is being flaky (occasional OOM errors)

2016-11-21 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15684659#comment-15684659
 ] 

Cheng Lian edited comment on SPARK-18403 at 11/22/16 6:54 AM:
--

Here is a minimal test case (add it to {{ObjectHashAggregateSuite}}) that can 
be used to reproduce this issue steadily:
{code}
test("oom") {
  withSQLConf(
SQLConf.USE_OBJECT_HASH_AGG.key -> "true",
SQLConf.OBJECT_AGG_SORT_BASED_FALLBACK_THRESHOLD.key -> "1"
  ) {
Seq(Tuple1(Seq.empty[Int]))
  .toDF("c0")
  .groupBy(lit(1))
  .agg(typed_count($"c0"), max($"c0"))
  .show()
  }
}
{code}
What I observed is that the partial aggregation phase produces a malformed 
{{UnsafeRow}} after applying the {{resultProjection}} 
[here|https://github.com/apache/spark/blob/07beb5d21c6803e80733149f1560c71cd3cacc86/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggregationIterator.scala#L254].

When printed, the malformed {{UnsafeRow}} is always
{noformat}
[0,0,28,280008,100,5a5a5a5a5a5a5a5a]
{noformat}
The {{5a5a5a5a5a5a5a5a}} is interpreted as the length of an {{ArrayData}}. 
Therefore, the JVM blows up when trying to allocate a huge array to deep copy 
this {{ArrayData}} at a later phase.

[~sameer] and [~davies], would you mind to have a look at this issue? Thanks!


was (Author: lian cheng):
Here is a minimal test case (add it to {{ObjectHashAggregateSuite}}) that can 
be used to reproduce this issue steadily:
{code}
test("oom") {
  withSQLConf(
SQLConf.USE_OBJECT_HASH_AGG.key -> "true",
SQLConf.OBJECT_AGG_SORT_BASED_FALLBACK_THRESHOLD.key -> "1"
  ) {
Seq(Tuple1(Seq.empty[Int]))
  .toDF("c0")
  .groupBy(lit(1))
  .agg(typed_count($"c0"), max($"c0"))
  .show()
  }
}
{code}
What I observed is that the partial aggregation phase produces a malformed 
{{UnsafeRow}} after applying the {{resultProjection}} 
[here|https://github.com/apache/spark/blob/07beb5d21c6803e80733149f1560c71cd3cacc86/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggregationIterator.scala#L254].

When printed, the malformed {{UnsafeRow}} is always
{noformat}
[0,0,28,280008,100,5a5a5a5a5a5a5a5a]
{noformat}
The {{5a5a5a5a5a5a5a5a}} is interpreted as the length of an {{ArrayData}}. 
Therefore, the JVM blows up when trying to allocate a huge array to deep copy 
of this {{ArrayData}} at a later phase.

[~sameer] and [~davies], would you mind to have a look at this issue? Thanks!

> ObjectHashAggregateSuite is being flaky (occasional OOM errors)
> ---
>
> Key: SPARK-18403
> URL: https://issues.apache.org/jira/browse/SPARK-18403
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
> Fix For: 2.2.0
>
>
> This test suite fails occasionally on Jenkins due to OOM errors. I've already 
> reproduced it locally but haven't figured out the root cause.
> We should probably disable it temporarily before getting it fixed so that it 
> doesn't break the PR build too often.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18403) ObjectHashAggregateSuite is being flaky (occasional OOM errors)

2016-11-21 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15685389#comment-15685389
 ] 

Cheng Lian commented on SPARK-18403:


Figured it out. It's caused by a false sharing issue inside 
{{ObjectAggregationIterator}}. In short, after setting an {{UnsafeArrayData}} 
to an aggregation buffer, which is a safe row, the underlying buffer of the 
{{UnsafeArrayData}} gets overwritten when iterator steps forward.

Have to say that this issue is pretty hard to debug. The large array allocation 
blows up the JVM right away and you can't really find the large array in the 
heap dump since the allocation itself fails. Therefore, all the heap dumps are 
super small (~70MB) compared to the heap size (3GB for default SBT tests) and 
you can't find anything useful in the heap dumps.

I'm opening a PR to fix this issue.

> ObjectHashAggregateSuite is being flaky (occasional OOM errors)
> ---
>
> Key: SPARK-18403
> URL: https://issues.apache.org/jira/browse/SPARK-18403
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
> Fix For: 2.2.0
>
>
> This test suite fails occasionally on Jenkins due to OOM errors. I've already 
> reproduced it locally but haven't figured out the root cause.
> We should probably disable it temporarily before getting it fixed so that it 
> doesn't break the PR build too often.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18403) ObjectHashAggregateSuite is being flaky (occasional OOM errors)

2016-11-21 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15684659#comment-15684659
 ] 

Cheng Lian commented on SPARK-18403:


Here is a minimal test case (add it to {{ObjectHashAggregateSuite}}) that can 
be used to reproduce this issue steadily:
{code}
test("oom") {
  withSQLConf(
SQLConf.USE_OBJECT_HASH_AGG.key -> "true",
SQLConf.OBJECT_AGG_SORT_BASED_FALLBACK_THRESHOLD.key -> "1"
  ) {
Seq(Tuple1(Seq.empty[Int]))
  .toDF("c0")
  .groupBy(lit(1))
  .agg(typed_count($"c0"), max($"c0"))
  .show()
  }
}
{code}
What I observed is that the partial aggregation phase produces a malformed 
{{UnsafeRow}} after applying the {{resultProjection}} 
[here|https://github.com/apache/spark/blob/07beb5d21c6803e80733149f1560c71cd3cacc86/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggregationIterator.scala#L254].

When printed, the malformed {{UnsafeRow}} is always
{noformat}
[0,0,28,280008,100,5a5a5a5a5a5a5a5a]
{noformat}
The {{5a5a5a5a5a5a5a5a}} is interpreted as the length of an {{ArrayData}}. 
Therefore, the JVM blows up when trying to allocate a huge array to deep copy 
of this {{ArrayData}} at a later phase.

[~sameer] and [~davies], would you mind to have a look at this issue? Thanks!

> ObjectHashAggregateSuite is being flaky (occasional OOM errors)
> ---
>
> Key: SPARK-18403
> URL: https://issues.apache.org/jira/browse/SPARK-18403
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
> Fix For: 2.2.0
>
>
> This test suite fails occasionally on Jenkins due to OOM errors. I've already 
> reproduced it locally but haven't figured out the root cause.
> We should probably disable it temporarily before getting it fixed so that it 
> doesn't break the PR build too often.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11785) When deployed against remote Hive metastore with lower versions, JDBC metadata calls throws exception

2016-11-18 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15677469#comment-15677469
 ] 

Cheng Lian commented on SPARK-11785:


But I'm not sure which PR fixes this issue, though.

> When deployed against remote Hive metastore with lower versions, JDBC 
> metadata calls throws exception
> -
>
> Key: SPARK-11785
> URL: https://issues.apache.org/jira/browse/SPARK-11785
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1, 1.6.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Critical
>
> To reproduce this issue with 1.7-SNAPSHOT
> # Start Hive 0.13.1 metastore service using {{$HIVE_HOME/bin/hive --service 
> metastore}}
> # Configures remote Hive metastore in {{conf/hive-site.xml}} by pointing 
> {{hive.metastore.uris}} to metastore endpoint (e.g. 
> {{thrift://localhost:9083}})
> # Set {{spark.sql.hive.metastore.version}} to {{0.13.1}} and 
> {{spark.sql.hive.metastore.jars}} to {{maven}} in {{conf/spark-defaults.conf}}
> # Start Thrift server using {{$SPARK_HOME/sbin/start-thriftserver.sh}}
> # Run the testing JDBC client program attached at the end
> Exception thrown from client side:
> {noformat}
> java.sql.SQLException: Could not create ResultSet: Required field 
> 'operationHandle' is unset! 
> Struct:TGetResultSetMetadataReq(operationHandle:null)
> java.sql.SQLException: Could not create ResultSet: Required field 
> 'operationHandle' is unset! 
> Struct:TGetResultSetMetadataReq(operationHandle:null)
> at 
> org.apache.hive.jdbc.HiveQueryResultSet.retrieveSchema(HiveQueryResultSet.java:273)
> at 
> org.apache.hive.jdbc.HiveQueryResultSet.(HiveQueryResultSet.java:188)
> at 
> org.apache.hive.jdbc.HiveQueryResultSet$Builder.build(HiveQueryResultSet.java:170)
> at 
> org.apache.hive.jdbc.HiveDatabaseMetaData.getColumns(HiveDatabaseMetaData.java:222)
> at JDBCExperiments$.main(JDBCExperiments.scala:28)
> at JDBCExperiments.main(JDBCExperiments.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> Caused by: org.apache.thrift.protocol.TProtocolException: Required field 
> 'operationHandle' is unset! 
> Struct:TGetResultSetMetadataReq(operationHandle:null)
> at 
> org.apache.hive.service.cli.thrift.TGetResultSetMetadataReq.validate(TGetResultSetMetadataReq.java:290)
> at 
> org.apache.hive.service.cli.thrift.TCLIService$GetResultSetMetadata_args.validate(TCLIService.java:12041)
> at 
> org.apache.hive.service.cli.thrift.TCLIService$GetResultSetMetadata_args$GetResultSetMetadata_argsStandardScheme.write(TCLIService.java:12098)
> at 
> org.apache.hive.service.cli.thrift.TCLIService$GetResultSetMetadata_args$GetResultSetMetadata_argsStandardScheme.write(TCLIService.java:12067)
> at 
> org.apache.hive.service.cli.thrift.TCLIService$GetResultSetMetadata_args.write(TCLIService.java:12018)
> at org.apache.thrift.TServiceClient.sendBase(TServiceClient.java:63)
> at 
> org.apache.hive.service.cli.thrift.TCLIService$Client.send_GetResultSetMetadata(TCLIService.java:472)
> at 
> org.apache.hive.service.cli.thrift.TCLIService$Client.GetResultSetMetadata(TCLIService.java:464)
> at 
> org.apache.hive.jdbc.HiveQueryResultSet.retrieveSchema(HiveQueryResultSet.java:242)
> at 
> org.apache.hive.jdbc.HiveQueryResultSet.(HiveQueryResultSet.java:188)
> at 
> org.apache.hive.jdbc.HiveQueryResultSet$Builder.build(HiveQueryResultSet.java:170)
> at 
> org.apache.hive.jdbc.HiveDatabaseMetaData.getColumns(HiveDatabaseMetaData.java:222)
> at JDBCExperiments$.main(JDBCExperiments.scala:28)
> at JDBCExperiments.main(JDBCExperiments.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> {noformat}
> Exception thrown from server side:
> {noformat}
> 15/11/18 02:27:01 WARN RetryingMetaStoreClient: MetaStoreClient lost 
> connection. Attempting to reconnect.
> org.apache.thrift.TApplicationException: Invalid method name: 
> 'get_schema_with_environment_context'
> at 
> org.apache.thrift.TApplicationException.read(TApplicationException.java:111)
> at 
>

[jira] [Commented] (SPARK-11785) When deployed against remote Hive metastore with lower versions, JDBC metadata calls throws exception

2016-11-18 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15677468#comment-15677468
 ] 

Cheng Lian commented on SPARK-11785:


Confirmed that this is no longer an issue for 2.1

> When deployed against remote Hive metastore with lower versions, JDBC 
> metadata calls throws exception
> -
>
> Key: SPARK-11785
> URL: https://issues.apache.org/jira/browse/SPARK-11785
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1, 1.6.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Critical
>
> To reproduce this issue with 1.7-SNAPSHOT
> # Start Hive 0.13.1 metastore service using {{$HIVE_HOME/bin/hive --service 
> metastore}}
> # Configures remote Hive metastore in {{conf/hive-site.xml}} by pointing 
> {{hive.metastore.uris}} to metastore endpoint (e.g. 
> {{thrift://localhost:9083}})
> # Set {{spark.sql.hive.metastore.version}} to {{0.13.1}} and 
> {{spark.sql.hive.metastore.jars}} to {{maven}} in {{conf/spark-defaults.conf}}
> # Start Thrift server using {{$SPARK_HOME/sbin/start-thriftserver.sh}}
> # Run the testing JDBC client program attached at the end
> Exception thrown from client side:
> {noformat}
> java.sql.SQLException: Could not create ResultSet: Required field 
> 'operationHandle' is unset! 
> Struct:TGetResultSetMetadataReq(operationHandle:null)
> java.sql.SQLException: Could not create ResultSet: Required field 
> 'operationHandle' is unset! 
> Struct:TGetResultSetMetadataReq(operationHandle:null)
> at 
> org.apache.hive.jdbc.HiveQueryResultSet.retrieveSchema(HiveQueryResultSet.java:273)
> at 
> org.apache.hive.jdbc.HiveQueryResultSet.(HiveQueryResultSet.java:188)
> at 
> org.apache.hive.jdbc.HiveQueryResultSet$Builder.build(HiveQueryResultSet.java:170)
> at 
> org.apache.hive.jdbc.HiveDatabaseMetaData.getColumns(HiveDatabaseMetaData.java:222)
> at JDBCExperiments$.main(JDBCExperiments.scala:28)
> at JDBCExperiments.main(JDBCExperiments.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> Caused by: org.apache.thrift.protocol.TProtocolException: Required field 
> 'operationHandle' is unset! 
> Struct:TGetResultSetMetadataReq(operationHandle:null)
> at 
> org.apache.hive.service.cli.thrift.TGetResultSetMetadataReq.validate(TGetResultSetMetadataReq.java:290)
> at 
> org.apache.hive.service.cli.thrift.TCLIService$GetResultSetMetadata_args.validate(TCLIService.java:12041)
> at 
> org.apache.hive.service.cli.thrift.TCLIService$GetResultSetMetadata_args$GetResultSetMetadata_argsStandardScheme.write(TCLIService.java:12098)
> at 
> org.apache.hive.service.cli.thrift.TCLIService$GetResultSetMetadata_args$GetResultSetMetadata_argsStandardScheme.write(TCLIService.java:12067)
> at 
> org.apache.hive.service.cli.thrift.TCLIService$GetResultSetMetadata_args.write(TCLIService.java:12018)
> at org.apache.thrift.TServiceClient.sendBase(TServiceClient.java:63)
> at 
> org.apache.hive.service.cli.thrift.TCLIService$Client.send_GetResultSetMetadata(TCLIService.java:472)
> at 
> org.apache.hive.service.cli.thrift.TCLIService$Client.GetResultSetMetadata(TCLIService.java:464)
> at 
> org.apache.hive.jdbc.HiveQueryResultSet.retrieveSchema(HiveQueryResultSet.java:242)
> at 
> org.apache.hive.jdbc.HiveQueryResultSet.(HiveQueryResultSet.java:188)
> at 
> org.apache.hive.jdbc.HiveQueryResultSet$Builder.build(HiveQueryResultSet.java:170)
> at 
> org.apache.hive.jdbc.HiveDatabaseMetaData.getColumns(HiveDatabaseMetaData.java:222)
> at JDBCExperiments$.main(JDBCExperiments.scala:28)
> at JDBCExperiments.main(JDBCExperiments.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> {noformat}
> Exception thrown from server side:
> {noformat}
> 15/11/18 02:27:01 WARN RetryingMetaStoreClient: MetaStoreClient lost 
> connection. Attempting to reconnect.
> org.apache.thrift.TApplicationException: Invalid method name: 
> 'get_schema_with_environment_context'
> at 
> org.apache.thrift.TApplicationException.read(TApplicationException.java:111)
> at 
>

[jira] [Comment Edited] (SPARK-18251) DataSet API | RuntimeException: Null value appeared in non-nullable field when holding Option Case Class

2016-11-18 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15677396#comment-15677396
 ] 

Cheng Lian edited comment on SPARK-18251 at 11/18/16 6:38 PM:
--

I'd prefer option 1 because of consistency of the semantics, and I don't think 
this is really a bug since {{Option\[T\]}} shouldn't be used as top level 
{{Dataset}} types anyway.

While doing schema inference, Catalyst always translates {{Option\[T\]}} to the 
nullable version of {{T'}}, where {{T'}} is the inferred data type of {{T}}. 
Take {{case class A(i: Option\[Int\])}} as an example, if we go for option 2, 
then what should the inferred schema of {{A}} be? To keep the original 
semantics, it should be
{noformat}
new StructType()
  .add("i", IntegerType, nullable = true)
{noformat}
while option 2 requires
{noformat}
new StructType()
  .add("i", new StructType()
.add("value", IntegerType, nullable = true), nullable = true)
{noformat}
since now {{Option\[T\]}} is treated as a single field struct.

Option 1 keeps the current semantics, which is pretty clear and easy to reason 
about, while option 2 either introduces inconsistency or requires us to further 
special case schema inference for top level {{Dataset}} types.



was (Author: lian cheng):
I'd prefer option 1 because of consistency of the semantics, and I don't think 
this is really a bug since {{Option\[T\]}} shouldn't be used as top level 
{{Dataset}} types anyway.

While doing schema inference, Catalyst always translates {{Option\[T\]}} to the 
nullable version of {{T'}}, where {{T'}} is the inferred data type of {{T}}. 
Take {{case class A(i: Option\[Int\])}} as an example, if we go for option 2, 
then what should the inferred schema of {{A}} be? To keep the original 
semantics, it should be
{noformat}
new StructType()
  .add("i", IntegerType, nullable = true)
{noformat}
while option 2 requires
{noformat}
new StructType()
  .add("i", new StructType()
.add("value", IntegerType, nullable = true), nullable = true)
{noformat}
since now {{Option\[T\]}} is treated as a single field struct.

Option 1 keeps the current semantics, which is pretty clear and easy to reason 
about, while option 2 either introduce inconsistency or requires further 
special casing schema inference for top level {{Dataset}} types.


> DataSet API | RuntimeException: Null value appeared in non-nullable field 
> when holding Option Case Class
> 
>
> Key: SPARK-18251
> URL: https://issues.apache.org/jira/browse/SPARK-18251
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1
> Environment: OS X
>Reporter: Aniket Bhatnagar
>
> I am running into a runtime exception when a DataSet is holding an Empty 
> object instance for an Option type that is holding non-nullable field. For 
> instance, if we have the following case class:
> case class DataRow(id: Int, value: String)
> Then, DataSet[Option[DataRow]] can only hold Some(DataRow) objects and cannot 
> hold Empty. If it does so, the following exception is thrown:
> {noformat}
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 6 in stage 0.0 failed 1 times, most recent failure: 
> Lost task 6.0 in stage 0.0 (TID 6, localhost): java.lang.RuntimeException: 
> Null value appeared in non-nullable field:
> - field (class: "scala.Int", name: "id")
> - option value class: "DataSetOptBug.DataRow"
> - root class: "scala.Option"
> If the schema is inferred from a Scala tuple/case class, or a Java bean, 
> please try to use scala.Option[_] or other nullable types (e.g. 
> java.lang.Integer instead of int/scala.Int).
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
>

[jira] [Comment Edited] (SPARK-18251) DataSet API | RuntimeException: Null value appeared in non-nullable field when holding Option Case Class

2016-11-18 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15677396#comment-15677396
 ] 

Cheng Lian edited comment on SPARK-18251 at 11/18/16 6:37 PM:
--

I'd prefer option 1 because of consistency of the semantics, and I don't think 
this is really a bug since {{Option\[T\]}} shouldn't be used as top level 
{{Dataset}} types anyway.

While doing schema inference, Catalyst always translates {{Option\[T\]}} to the 
nullable version of {{T'}}, where {{T'}} is the inferred data type of {{T}}. 
Take {{case class A(i: Option\[Int\])}} as an example, if we go for option 2, 
then what should the inferred schema of {{A}} be? To keep the original 
semantics, it should be
{noformat}
new StructType()
  .add("i", IntegerType, nullable = true)
{noformat}
while option 2 requires
{noformat}
new StructType()
  .add("i", new StructType()
.add("value", IntegerType, nullable = true), nullable = true)
{noformat}
since now {{Option\[T\]}} is treated as a single field struct.

Option 1 keeps the current semantics, which is pretty clear and easy to reason 
about, while option 2 either introduce inconsistency or requires further 
special casing schema inference for top level {{Dataset}} types.



was (Author: lian cheng):
I'd prefer option 1 because of consistency of the semantics, and I don't think 
this is really a bug since {{Option\[T\]}} shouldn't be used as top level 
{{Dataset}} types anyway.

While doing schema inference, Catalyst always treats {{Option\[T\]}} as the 
nullable version of {{T'}}, where {{T'}} is the inferred data type of {{T}}. 
Take {{case class A(i: Option\[Int\])}} as an example, if we go for option 2, 
then what should the inferred schema of {{A}} be? To keep the original 
semantics, it should be
{noformat}
new StructType()
  .add("i", IntegerType, nullable = true)
{noformat}
while option 2 requires
{noformat}
new StructType()
  .add("i", new StructType()
.add("value", IntegerType, nullable = true), nullable = true)
{noformat}
since now {{Option\[T\]}} is treated as a single field struct.

Option 1 keeps the current semantics, which is pretty clear and easy to reason 
about, while option 2 either introduce inconsistency or requires further 
special casing schema inference for top level {{Dataset}} types.


> DataSet API | RuntimeException: Null value appeared in non-nullable field 
> when holding Option Case Class
> 
>
> Key: SPARK-18251
> URL: https://issues.apache.org/jira/browse/SPARK-18251
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1
> Environment: OS X
>Reporter: Aniket Bhatnagar
>
> I am running into a runtime exception when a DataSet is holding an Empty 
> object instance for an Option type that is holding non-nullable field. For 
> instance, if we have the following case class:
> case class DataRow(id: Int, value: String)
> Then, DataSet[Option[DataRow]] can only hold Some(DataRow) objects and cannot 
> hold Empty. If it does so, the following exception is thrown:
> {noformat}
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 6 in stage 0.0 failed 1 times, most recent failure: 
> Lost task 6.0 in stage 0.0 (TID 6, localhost): java.lang.RuntimeException: 
> Null value appeared in non-nullable field:
> - field (class: "scala.Int", name: "id")
> - option value class: "DataSetOptBug.DataRow"
> - root class: "scala.Option"
> If the schema is inferred from a Scala tuple/case class, or a Java bean, 
> please try to use scala.Option[_] or other nullable types (e.g. 
> java.lang.Integer instead of int/scala.Int).
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>

[jira] [Commented] (SPARK-18251) DataSet API | RuntimeException: Null value appeared in non-nullable field when holding Option Case Class

2016-11-18 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15677396#comment-15677396
 ] 

Cheng Lian commented on SPARK-18251:


I'd prefer option 1 because of consistency of the semantics, and I don't think 
this is really a bug since {{Option\[T\]}} shouldn't be used as top level 
{{Dataset}} types anyway.

While doing schema inference, Catalyst always treats {{Option\[T\]}} as the 
nullable version of {{T'}}, where {{T'}} is the inferred data type of {{T}}. 
Take {{case class A(i: Option\[Int\])}} as an example, if we go for option 2, 
then what should the inferred schema of {{A}} be? To keep the original 
semantics, it should be
{noformat}
new StructType()
  .add("i", IntegerType, nullable = true)
{noformat}
while option 2 requires
{noformat}
new StructType()
  .add("i", new StructType()
.add("value", IntegerType, nullable = true), nullable = true)
{noformat}
since now {{Option\[T\]}} is treated as a single field struct.

Option 1 keeps the current semantics, which is pretty clear and easy to reason 
about, while option 2 either introduce inconsistency or requires further 
special casing schema inference for top level {{Dataset}} types.


> DataSet API | RuntimeException: Null value appeared in non-nullable field 
> when holding Option Case Class
> 
>
> Key: SPARK-18251
> URL: https://issues.apache.org/jira/browse/SPARK-18251
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1
> Environment: OS X
>Reporter: Aniket Bhatnagar
>
> I am running into a runtime exception when a DataSet is holding an Empty 
> object instance for an Option type that is holding non-nullable field. For 
> instance, if we have the following case class:
> case class DataRow(id: Int, value: String)
> Then, DataSet[Option[DataRow]] can only hold Some(DataRow) objects and cannot 
> hold Empty. If it does so, the following exception is thrown:
> {noformat}
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 6 in stage 0.0 failed 1 times, most recent failure: 
> Lost task 6.0 in stage 0.0 (TID 6, localhost): java.lang.RuntimeException: 
> Null value appeared in non-nullable field:
> - field (class: "scala.Int", name: "id")
> - option value class: "DataSetOptBug.DataRow"
> - root class: "scala.Option"
> If the schema is inferred from a Scala tuple/case class, or a Java bean, 
> please try to use scala.Option[_] or other nullable types (e.g. 
> java.lang.Integer instead of int/scala.Int).
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}
> The bug can be reproduce by using the program: 
> https://gist.github.com/aniketbhatnagar/2ed74613f70d2defe999c18afaa4816e



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18451) Always set -XX:+HeapDumpOnOutOfMemoryError for Spark tests

2016-11-15 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-18451:
--

 Summary: Always set -XX:+HeapDumpOnOutOfMemoryError for Spark tests
 Key: SPARK-18451
 URL: https://issues.apache.org/jira/browse/SPARK-18451
 Project: Spark
  Issue Type: Bug
  Components: Build, Tests
Reporter: Cheng Lian


It would be nice if we always set {{-XX:+HeapDumpOnOutOfMemoryError}} and 
{{-XX:+HeapDumpPath}} for open source Spark tests. So that it would be easier 
to investigate issues like SC-5041.

Note:

- We need to ensure that the heap dumps are stored in a location on Jenkins 
that won't be automatically cleaned up.
- It would be nice to be able to customize the customize the heap dump output 
paths on a per build basis so that it's easier to find the heap dump file of 
any given build.

The 2nd point is optional since we can probably identify wanted heap dump files 
by looking at the creation timestamp.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18403) ObjectHashAggregateSuite is being flaky (occasional OOM errors)

2016-11-10 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-18403:
--

 Summary: ObjectHashAggregateSuite is being flaky (occasional OOM 
errors)
 Key: SPARK-18403
 URL: https://issues.apache.org/jira/browse/SPARK-18403
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Cheng Lian
Assignee: Cheng Lian


This test suite fails occasionally on Jenkins due to OOM errors. I've already 
reproduced it locally but haven't figured out the root cause.

We should probably disable it temporarily before getting it fixed so that it 
doesn't break the PR build too often.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18390) Optimized plan tried to use Cartesian join when it is not enabled

2016-11-09 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15652202#comment-15652202
 ] 

Cheng Lian commented on SPARK-18390:


I think this issue has already been fixed by SPARK-17298 and 
https://github.com/apache/spark/pull/14866

> Optimized plan tried to use Cartesian join when it is not enabled
> -
>
> Key: SPARK-18390
> URL: https://issues.apache.org/jira/browse/SPARK-18390
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Xiangrui Meng
>Assignee: Srinath
>
> {code}
> val df2 = spark.range(1e9.toInt).withColumn("one", lit(1))
> val df3 = spark.range(1e9.toInt)
> df3.join(df2, df3("id") === df2("one")).count()
> {code}
> throws
> bq. org.apache.spark.sql.AnalysisException: Cartesian joins could be 
> prohibitively expensive and are disabled by default. To explicitly enable 
> them, please set spark.sql.crossJoin.enabled = true;
> This is probably not the right behavior because it was not the user who 
> suggested using cartesian product. SQL picked it while knowing it is not 
> enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18390) Optimized plan tried to use Cartesian join when it is not enabled

2016-11-09 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-18390:
---
Description: 
{code}
val df2 = spark.range(1e9.toInt).withColumn("one", lit(1))
val df3 = spark.range(1e9.toInt)
df3.join(df2, df3("id") === df2("one")).count()
{code}

throws

bq. org.apache.spark.sql.AnalysisException: Cartesian joins could be 
prohibitively expensive and are disabled by default. To explicitly enable them, 
please set spark.sql.crossJoin.enabled = true;

This is probably not the right behavior because it was not the user who 
suggested using cartesian product. SQL picked it while knowing it is not 
enabled.

  was:
I hit this error when I tried to test skewed joins.

{code}
val df2 = spark.range(1e9.toInt).withColumn("one", lit(1))
val df3 = spark.range(1e9.toInt)
df3.join(df2, df3("id") === df2("one")).count()
{code}

throws

{code}
org.apache.spark.sql.AnalysisException: Cartesian joins could be prohibitively 
expensive and are disabled by default. To explicitly enable them, please set 
spark.sql.crossJoin.enabled = true;
{code}

This is probably not the right behavior because it was not the user who 
suggested using cartesian product. SQL picked it while knowing it is not 
enabled.


> Optimized plan tried to use Cartesian join when it is not enabled
> -
>
> Key: SPARK-18390
> URL: https://issues.apache.org/jira/browse/SPARK-18390
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Xiangrui Meng
>
> {code}
> val df2 = spark.range(1e9.toInt).withColumn("one", lit(1))
> val df3 = spark.range(1e9.toInt)
> df3.join(df2, df3("id") === df2("one")).count()
> {code}
> throws
> bq. org.apache.spark.sql.AnalysisException: Cartesian joins could be 
> prohibitively expensive and are disabled by default. To explicitly enable 
> them, please set spark.sql.crossJoin.enabled = true;
> This is probably not the right behavior because it was not the user who 
> suggested using cartesian product. SQL picked it while knowing it is not 
> enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18338) ObjectHashAggregateSuite fails under Maven builds

2016-11-07 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-18338:
---
Description: 
Test case initialization order under Maven and SBT are different. Maven always 
creates instances of all test cases and then run them all together.

This fails {{ObjectHashAggregateSuite}} because the randomized test cases there 
register a temporary Hive function right before creating a test case, and can 
be cleared while initializing other successive test cases.

In SBT, this is fine since the created test case is executed immediately after 
creating the temporary function. 

To fix this issue, we should put initialization/destruction code into 
{{beforeAll()}} and {{afterAll()}}.


  was:
Test case initialization order under Maven and SBT are different. Maven always 
creates instances of all test cases and then run them altogether.

This fails {{ObjectHashAggregateSuite}} because the randomized test cases their 
registers a temporary Hive function right before creating a test case, and can 
be cleared while initializing other successive test cases.

In SBT, this is fine since the created test case is executed immediately after 
creating the temporary function. 

To fix this issue, we should put initialization/destruction code into 
{{beforeAll()}} and {{afterAll()}}.



> ObjectHashAggregateSuite fails under Maven builds
> -
>
> Key: SPARK-18338
> URL: https://issues.apache.org/jira/browse/SPARK-18338
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>  Labels: flaky-test
>
> Test case initialization order under Maven and SBT are different. Maven 
> always creates instances of all test cases and then run them all together.
> This fails {{ObjectHashAggregateSuite}} because the randomized test cases 
> there register a temporary Hive function right before creating a test case, 
> and can be cleared while initializing other successive test cases.
> In SBT, this is fine since the created test case is executed immediately 
> after creating the temporary function. 
> To fix this issue, we should put initialization/destruction code into 
> {{beforeAll()}} and {{afterAll()}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18338) ObjectHashAggregateSuite fails under Maven builds

2016-11-07 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-18338:
--

 Summary: ObjectHashAggregateSuite fails under Maven builds
 Key: SPARK-18338
 URL: https://issues.apache.org/jira/browse/SPARK-18338
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Cheng Lian
Assignee: Cheng Lian


Test case initialization order under Maven and SBT are different. Maven always 
creates instances of all test cases and then run them altogether.

This fails {{ObjectHashAggregateSuite}} because the randomized test cases their 
registers a temporary Hive function right before creating a test case, and can 
be cleared while initializing other successive test cases.

In SBT, this is fine since the created test case is executed immediately after 
creating the temporary function. 

To fix this issue, we should put initialization/destruction code into 
{{beforeAll()}} and {{afterAll()}}.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17972) Query planning slows down dramatically for large query plans even when sub-trees are cached

2016-11-02 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-17972:
---
Description: 
The following Spark shell snippet creates a series of query plans that grow 
exponentially. The {{i}}-th plan is created using 4 *cached* copies of the {{i 
- 1}}-th plan.

{code}
(0 until 6).foldLeft(Seq(1, 2, 3).toDS) { (plan, iteration) =>
  val start = System.currentTimeMillis()
  val result = plan.join(plan, "value").join(plan, "value").join(plan, 
"value").join(plan, "value")
  result.cache()
  System.out.println(s"Iteration $iteration takes time 
${System.currentTimeMillis() - start} ms")
  result.as[Int]
}
{code}

We can see that although all plans are cached, the query planning time still 
grows exponentially and quickly becomes unbearable.

{noformat}
Iteration 0 takes time 9 ms
Iteration 1 takes time 19 ms
Iteration 2 takes time 61 ms
Iteration 3 takes time 219 ms
Iteration 4 takes time 830 ms
Iteration 5 takes time 4080 ms
{noformat}

Similar scenarios can be found in iterative ML code and significantly affects 
usability.

This issue can be fixed by introducing a {{checkpoint()}} method for 
{{Dataset}} that truncates both the query plan and the lineage of the 
underlying RDD.

  was:
The following Spark shell snippet creates a series of query plans that grow 
exponentially. The {{i}}-th plan is created using 4 *cached* copies of the {{i 
- 1}}-th plan.

{code}
(0 until 6).foldLeft(Seq(1, 2, 3).toDS) { (plan, iteration) =>
  val start = System.currentTimeMillis()
  val result = plan.join(plan, "value").join(plan, "value").join(plan, 
"value").join(plan, "value")
  result.cache()
  System.out.println(s"Iteration $iteration takes time 
${System.currentTimeMillis() - start} ms")
  result.as[Int]
}
{code}

We can see that although all plans are cached, the query planning time still 
grows exponentially and quickly becomes unbearable.

{noformat}
Iteration 0 takes time 9 ms
Iteration 1 takes time 19 ms
Iteration 2 takes time 61 ms
Iteration 3 takes time 219 ms
Iteration 4 takes time 830 ms
Iteration 5 takes time 4080 ms
{noformat}

Similar scenarios can be found in iterative ML code and significantly affects 
usability.


> Query planning slows down dramatically for large query plans even when 
> sub-trees are cached
> ---
>
> Key: SPARK-17972
> URL: https://issues.apache.org/jira/browse/SPARK-17972
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2, 2.0.1
>Reporter: Cheng Lian
>Assignee: Cheng Lian
> Fix For: 2.1.0
>
>
> The following Spark shell snippet creates a series of query plans that grow 
> exponentially. The {{i}}-th plan is created using 4 *cached* copies of the 
> {{i - 1}}-th plan.
> {code}
> (0 until 6).foldLeft(Seq(1, 2, 3).toDS) { (plan, iteration) =>
>   val start = System.currentTimeMillis()
>   val result = plan.join(plan, "value").join(plan, "value").join(plan, 
> "value").join(plan, "value")
>   result.cache()
>   System.out.println(s"Iteration $iteration takes time 
> ${System.currentTimeMillis() - start} ms")
>   result.as[Int]
> }
> {code}
> We can see that although all plans are cached, the query planning time still 
> grows exponentially and quickly becomes unbearable.
> {noformat}
> Iteration 0 takes time 9 ms
> Iteration 1 takes time 19 ms
> Iteration 2 takes time 61 ms
> Iteration 3 takes time 219 ms
> Iteration 4 takes time 830 ms
> Iteration 5 takes time 4080 ms
> {noformat}
> Similar scenarios can be found in iterative ML code and significantly affects 
> usability.
> This issue can be fixed by introducing a {{checkpoint()}} method for 
> {{Dataset}} that truncates both the query plan and the lineage of the 
> underlying RDD.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-11879) Checkpoint support for DataFrame/Dataset

2016-11-02 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-11879.

Resolution: Duplicate

> Checkpoint support for DataFrame/Dataset
> 
>
> Key: SPARK-11879
> URL: https://issues.apache.org/jira/browse/SPARK-11879
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Cristian Opris
>
> Explicit support for checkpointing DataFrames is need to be able to truncate 
> lineages, prune the query plan (particularly the logical plan) and 
> transparent failure recovery.
> While for recovery saving to a Parquet file may be sufficient, actually using 
> that as a checkpoint (and truncating the lineage), requires reading the files 
> back.
> This is required to be able to use DataFrames in iterative scenarios like 
> Streaming and ML, as well as for avoiding expensive re-computations in case 
> of executor failure when executing a complex chain of queries on very large 
> datasets. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11879) Checkpoint support for DataFrame/Dataset

2016-11-02 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15630537#comment-15630537
 ] 

Cheng Lian commented on SPARK-11879:


Sorry that I didn't notice this ticket while working on SPARK-17972. 
{{Dataset.checkpoint()}} is included in Spark 2.1.

I'll mark this one as duplicated.

> Checkpoint support for DataFrame/Dataset
> 
>
> Key: SPARK-11879
> URL: https://issues.apache.org/jira/browse/SPARK-11879
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Cristian Opris
>
> Explicit support for checkpointing DataFrames is need to be able to truncate 
> lineages, prune the query plan (particularly the logical plan) and 
> transparent failure recovery.
> While for recovery saving to a Parquet file may be sufficient, actually using 
> that as a checkpoint (and truncating the lineage), requires reading the files 
> back.
> This is required to be able to use DataFrames in iterative scenarios like 
> Streaming and ML, as well as for avoiding expensive re-computations in case 
> of executor failure when executing a complex chain of queries on very large 
> datasets. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18209) More robust view canonicalization without full SQL expansion

2016-11-01 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15626823#comment-15626823
 ] 

Cheng Lian commented on SPARK-18209:


One problem of the proposed approach is that our SQL parser doesn't allow CTEs 
to be used in subquery. But we can change that.

Not quite sure why the Presto parser explicitly prohibited this syntax (our SQL 
parser is derived from the Presto parser).

> More robust view canonicalization without full SQL expansion
> 
>
> Key: SPARK-18209
> URL: https://issues.apache.org/jira/browse/SPARK-18209
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Critical
>
> Spark SQL currently stores views by analyzing the provided SQL and then 
> generating fully expanded SQL out of the analyzed logical plan. This is 
> actually a very error prone way of doing it, because:
> 1. It is non-trivial to guarantee that the generated SQL is correct without 
> being extremely verbose, given the current set of operators.
> 2. We need extensive testing for all combination of operators.
> 3. Whenever we introduce a new logical plan operator, we need to be super 
> careful because it might break SQL generation. This is the main reason 
> broadcast join hint has taken forever to be merged because it is very 
> difficult to guarantee correctness.
> Given the two primary reasons to do view canonicalization is to provide the 
> context for the database as well as star expansion, I think we can this 
> through a simpler approach, by taking the user given SQL, analyze it, and 
> just wrap the original SQL with a SELECT clause at the outer and store the 
> database as a hint.
> For example, given the following view creation SQL:
> {code}
> USE DATABASE my_db;
> CREATE TABLE my_table (id int, name string);
> CREATE VIEW my_view AS SELECT * FROM my_table WHERE id > 10;
> {code}
> We store the following SQL instead:
> {code}
> SELECT /*+ current_db: `my_db` */ id, name FROM (SELECT * FROM my_table WHERE 
> id > 10);
> {code}
> During parsing time, we expand the view along using the provided database 
> context.
> (We don't need to follow exactly the same hint, as I'm merely illustrating 
> the high level approach here.)
> Note that there is a chance that the underlying base table(s)' schema change 
> and the stored schema of the view might differ from the actual SQL schema. In 
> that case, I think we should throw an exception at runtime to warn users. 
> This exception can be controlled by a flag.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18186) Migrate HiveUDAFFunction to TypedImperativeAggregate for partial aggregation support

2016-10-31 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-18186:
--

 Summary: Migrate HiveUDAFFunction to TypedImperativeAggregate for 
partial aggregation support
 Key: SPARK-18186
 URL: https://issues.apache.org/jira/browse/SPARK-18186
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.1, 1.6.2
Reporter: Cheng Lian
Assignee: Cheng Lian


Currently, Hive UDAFs in Spark SQL don't support partial aggregation. Any query 
involving any Hive UDAFs has to fall back to {{SortAggregateExec}} without 
partial aggregation.

This issue can be fixed by migrating {{HiveUDAFFunction}} to 
{{TypedImperativeAggregate}}, which already provides partial aggregation 
support for aggregate functions that may use arbitrary Java objects as 
aggregation states.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18053) ARRAY equality is broken in Spark 2.0

2016-10-24 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15602974#comment-15602974
 ] 

Cheng Lian commented on SPARK-18053:


Yea, reproduced using 2.0.

> ARRAY equality is broken in Spark 2.0
> -
>
> Key: SPARK-18053
> URL: https://issues.apache.org/jira/browse/SPARK-18053
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Cheng Lian
>Assignee: Wenchen Fan
>  Labels: correctness
>
> The following Spark shell reproduces this issue:
> {code}
> case class Test(a: Seq[Int])
> Seq(Test(Seq(1))).toDF().createOrReplaceTempView("t")
> sql("SELECT a FROM t WHERE a = array(1)").show()
> // +---+
> // |  a|
> // +---+
> // +---+
> sql("SELECT a FROM (SELECT array(1) AS a) x WHERE x.a = array(1)").show()
> // +---+
> // |  a|
> // +---+
> // |[1]|
> // +---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18053) ARRAY equality is broken in Spark 2.0

2016-10-24 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15602969#comment-15602969
 ] 

Cheng Lian commented on SPARK-18053:


Hm, the user mailing list thread said that it fails under 2.0 
https://lists.apache.org/thread.html/%3c1476953644701-27926.p...@n3.nabble.com%3E

I haven't verify it under 2.0 yet.

> ARRAY equality is broken in Spark 2.0
> -
>
> Key: SPARK-18053
> URL: https://issues.apache.org/jira/browse/SPARK-18053
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Cheng Lian
>Assignee: Wenchen Fan
>  Labels: correctness
>
> The following Spark shell reproduces this issue:
> {code}
> case class Test(a: Seq[Int])
> Seq(Test(Seq(1))).toDF().createOrReplaceTempView("t")
> sql("SELECT a FROM t WHERE a = array(1)").show()
> // +---+
> // |  a|
> // +---+
> // +---+
> sql("SELECT a FROM (SELECT array(1) AS a) x WHERE x.a = array(1)").show()
> // +---+
> // |  a|
> // +---+
> // |[1]|
> // +---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18058) AnalysisException may be thrown when union two DFs whose struct fields have different nullability

2016-10-21 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-18058:
--

 Summary: AnalysisException may be thrown when union two DFs whose 
struct fields have different nullability
 Key: SPARK-18058
 URL: https://issues.apache.org/jira/browse/SPARK-18058
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.1, 1.6.2
Reporter: Cheng Lian


The following Spark shell snippet reproduces this issue:
{code}
spark.range(10).createOrReplaceTempView("t1")
spark.range(10).map(i => i: 
java.lang.Long).toDF("id").createOrReplaceTempView("t2")
sql("SELECT struct(id) FROM t1 UNION ALL SELECT struct(id) FROM t2")
{code}

{noformat}
org.apache.spark.sql.AnalysisException: Union can only be performed on tables 
with the compatible column types. StructType(StructField(id,LongType,true)) <> 
StructType(StructField(id,LongType,false)) at the first column of the second 
table;
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:57)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$11$$anonfun$apply$12.apply(CheckAnalysis.scala:291)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$11$$anonfun$apply$12.apply(CheckAnalysis.scala:289)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$11.apply(CheckAnalysis.scala:289)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$11.apply(CheckAnalysis.scala:278)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:278)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:132)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:57)
  at 
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:61)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:573)
  ... 50 elided
{noformat}

The reason is that we treat two {{StructType}} incompatible even if their only 
differ from each other in field nullability.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17949) Introduce a JVM object based aggregate operator

2016-10-21 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-17949:
---
Description: 
The new Tungsten execution engine has very robust memory management and speed 
for simple data types. It does, however, suffer from the following:

# For user-defined aggregates (Hive UDAFs, Dataset typed operators), it is 
fairly expensive to fit into the Tungsten internal format.
# For aggregate functions that require complex intermediate data structures, 
Unsafe (on raw bytes) is not a good programming abstraction due to the lack of 
structs.

The idea here is to introduce an JVM object based hash aggregate operator that 
can support the aforementioned use cases. This operator, however, should limit 
its memory usage to avoid putting too much pressure on GC, e.g. falling back to 
sort-based aggregate as soon the number of objects exceed a very low threshold.

Internally at Databricks we prototyped a version of this for a customer POC and 
have observed substantial speed-ups over existing Spark.



  was:
The new Tungsten execution engine has very robust memory management and speed 
for simple data types. It does, however, suffer from the following:

1. For user defined aggregates (Hive UDAFs, Dataset typed operators), it is 
fairly expensive to fit into the Tungsten internal format.

2. For aggregate functions that require complex intermediate data structures, 
Unsafe (on raw bytes) is not a good programming abstraction due to the lack of 
structs.

The idea here is to introduce an JVM object based hash aggregate operator that 
can support the aforementioned use cases. This operator, however, should limit 
its memory usage to avoid putting too much pressure on GC, e.g. falling back to 
sort-based aggregate as soon the number of objects exceed a very low threshold.

Internally at Databricks we prototyped a version of this for a customer POC and 
have observed substantial speed-ups over existing Spark.




> Introduce a JVM object based aggregate operator
> ---
>
> Key: SPARK-17949
> URL: https://issues.apache.org/jira/browse/SPARK-17949
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Cheng Lian
> Attachments: [Design Doc] Support for Arbitrary Aggregation States.pdf
>
>
> The new Tungsten execution engine has very robust memory management and speed 
> for simple data types. It does, however, suffer from the following:
> # For user-defined aggregates (Hive UDAFs, Dataset typed operators), it is 
> fairly expensive to fit into the Tungsten internal format.
> # For aggregate functions that require complex intermediate data structures, 
> Unsafe (on raw bytes) is not a good programming abstraction due to the lack 
> of structs.
> The idea here is to introduce an JVM object based hash aggregate operator 
> that can support the aforementioned use cases. This operator, however, should 
> limit its memory usage to avoid putting too much pressure on GC, e.g. falling 
> back to sort-based aggregate as soon the number of objects exceed a very low 
> threshold.
> Internally at Databricks we prototyped a version of this for a customer POC 
> and have observed substantial speed-ups over existing Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17949) Introduce a JVM object based aggregate operator

2016-10-21 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-17949:
---
Description: 
The new Tungsten execution engine has very robust memory management and speed 
for simple data types. It does, however, suffer from the following:

# For user-defined aggregates (Hive UDAFs, Dataset typed operators), it is 
fairly expensive to fit into the Tungsten internal format.
# For aggregate functions that require complex intermediate data structures, 
Unsafe (on raw bytes) is not a good programming abstraction due to the lack of 
structs.

The idea here is to introduce a JVM object based hash aggregate operator that 
can support the aforementioned use cases. This operator, however, should limit 
its memory usage to avoid putting too much pressure on GC, e.g. falling back to 
sort-based aggregate as soon the number of objects exceeds a very low threshold.

Internally at Databricks we prototyped a version of this for a customer POC and 
have observed substantial speed-ups over existing Spark.



  was:
The new Tungsten execution engine has very robust memory management and speed 
for simple data types. It does, however, suffer from the following:

# For user-defined aggregates (Hive UDAFs, Dataset typed operators), it is 
fairly expensive to fit into the Tungsten internal format.
# For aggregate functions that require complex intermediate data structures, 
Unsafe (on raw bytes) is not a good programming abstraction due to the lack of 
structs.

The idea here is to introduce an JVM object based hash aggregate operator that 
can support the aforementioned use cases. This operator, however, should limit 
its memory usage to avoid putting too much pressure on GC, e.g. falling back to 
sort-based aggregate as soon the number of objects exceed a very low threshold.

Internally at Databricks we prototyped a version of this for a customer POC and 
have observed substantial speed-ups over existing Spark.




> Introduce a JVM object based aggregate operator
> ---
>
> Key: SPARK-17949
> URL: https://issues.apache.org/jira/browse/SPARK-17949
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Cheng Lian
> Attachments: [Design Doc] Support for Arbitrary Aggregation States.pdf
>
>
> The new Tungsten execution engine has very robust memory management and speed 
> for simple data types. It does, however, suffer from the following:
> # For user-defined aggregates (Hive UDAFs, Dataset typed operators), it is 
> fairly expensive to fit into the Tungsten internal format.
> # For aggregate functions that require complex intermediate data structures, 
> Unsafe (on raw bytes) is not a good programming abstraction due to the lack 
> of structs.
> The idea here is to introduce a JVM object based hash aggregate operator that 
> can support the aforementioned use cases. This operator, however, should 
> limit its memory usage to avoid putting too much pressure on GC, e.g. falling 
> back to sort-based aggregate as soon the number of objects exceeds a very low 
> threshold.
> Internally at Databricks we prototyped a version of this for a customer POC 
> and have observed substantial speed-ups over existing Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18053) ARRAY equality is broken in Spark 2.0

2016-10-21 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-18053:
---
Labels: correctness  (was: )

> ARRAY equality is broken in Spark 2.0
> -
>
> Key: SPARK-18053
> URL: https://issues.apache.org/jira/browse/SPARK-18053
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Cheng Lian
>  Labels: correctness
>
> The following Spark shell reproduces this issue:
> {code}
> case class Test(a: Seq[Int])
> Seq(Test(Seq(1))).toDF().createOrReplaceTempView("t")
> sql("SELECT a FROM t WHERE a = array(1)").show()
> // +---+
> // |  a|
> // +---+
> // +---+
> sql("SELECT a FROM (SELECT array(1) AS a) x WHERE x.a = array(1)").show()
> // +---+
> // |  a|
> // +---+
> // |[1]|
> // +---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18053) ARRAY equality is broken in Spark 2.0

2016-10-21 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-18053:
---
Description: 
The following Spark shell reproduces this issue:
{code}
case class Test(a: Seq[Int])
Seq(Test(Seq(1))).toDF().createOrReplaceTempView("t")

sql("SELECT a FROM t WHERE a = array(1)").show()
// +---+
// |  a|
// +---+
// +---+

sql("SELECT a FROM (SELECT array(1) AS a) x WHERE x.a = array(1)").show()
// +---+
// |  a|
// +---+
// |[1]|
// +---+
{code}

> ARRAY equality is broken in Spark 2.0
> -
>
> Key: SPARK-18053
> URL: https://issues.apache.org/jira/browse/SPARK-18053
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Cheng Lian
>  Labels: correctness
>
> The following Spark shell reproduces this issue:
> {code}
> case class Test(a: Seq[Int])
> Seq(Test(Seq(1))).toDF().createOrReplaceTempView("t")
> sql("SELECT a FROM t WHERE a = array(1)").show()
> // +---+
> // |  a|
> // +---+
> // +---+
> sql("SELECT a FROM (SELECT array(1) AS a) x WHERE x.a = array(1)").show()
> // +---+
> // |  a|
> // +---+
> // |[1]|
> // +---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18053) ARRAY equality is broken in Spark 2.0

2016-10-21 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-18053:
--

 Summary: ARRAY equality is broken in Spark 2.0
 Key: SPARK-18053
 URL: https://issues.apache.org/jira/browse/SPARK-18053
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.1, 2.0.0
Reporter: Cheng Lian






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18012) Simplify WriterContainer code

2016-10-19 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-18012.

   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 15551
[https://github.com/apache/spark/pull/15551]

> Simplify WriterContainer code
> -
>
> Key: SPARK-18012
> URL: https://issues.apache.org/jira/browse/SPARK-18012
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.1.0
>
>
> The existing WriterContainer code was developed in ad-hoc ways and becomes 
> pretty confusing and doesn't have great control flows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17949) Introduce a JVM object based aggregate operator

2016-10-19 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-17949:
---
Attachment: [Design Doc] Support for Arbitrary Aggregation States.pdf

> Introduce a JVM object based aggregate operator
> ---
>
> Key: SPARK-17949
> URL: https://issues.apache.org/jira/browse/SPARK-17949
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Cheng Lian
> Attachments: [Design Doc] Support for Arbitrary Aggregation States.pdf
>
>
> The new Tungsten execution engine has very robust memory management and speed 
> for simple data types. It does, however, suffer from the following:
> 1. For user defined aggregates (Hive UDAFs, Dataset typed operators), it is 
> fairly expensive to fit into the Tungsten internal format.
> 2. For aggregate functions that require complex intermediate data structures, 
> Unsafe (on raw bytes) is not a good programming abstraction due to the lack 
> of structs.
> The idea here is to introduce an JVM object based hash aggregate operator 
> that can support the aforementioned use cases. This operator, however, should 
> limit its memory usage to avoid putting too much pressure on GC, e.g. falling 
> back to sort-based aggregate as soon the number of objects exceed a very low 
> threshold.
> Internally at Databricks we prototyped a version of this for a customer POC 
> and have observed substantial speed-ups over existing Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17949) Introduce a JVM object based aggregate operator

2016-10-19 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-17949:
---
Attachment: (was: [Design Doc] Support for Arbitrary Aggregation 
States.pdf)

> Introduce a JVM object based aggregate operator
> ---
>
> Key: SPARK-17949
> URL: https://issues.apache.org/jira/browse/SPARK-17949
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Cheng Lian
> Attachments: [Design Doc] Support for Arbitrary Aggregation States.pdf
>
>
> The new Tungsten execution engine has very robust memory management and speed 
> for simple data types. It does, however, suffer from the following:
> 1. For user defined aggregates (Hive UDAFs, Dataset typed operators), it is 
> fairly expensive to fit into the Tungsten internal format.
> 2. For aggregate functions that require complex intermediate data structures, 
> Unsafe (on raw bytes) is not a good programming abstraction due to the lack 
> of structs.
> The idea here is to introduce an JVM object based hash aggregate operator 
> that can support the aforementioned use cases. This operator, however, should 
> limit its memory usage to avoid putting too much pressure on GC, e.g. falling 
> back to sort-based aggregate as soon the number of objects exceed a very low 
> threshold.
> Internally at Databricks we prototyped a version of this for a customer POC 
> and have observed substantial speed-ups over existing Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17949) Introduce a JVM object based aggregate operator

2016-10-19 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-17949:
---
Attachment: [Design Doc] Support for Arbitrary Aggregation States.pdf

> Introduce a JVM object based aggregate operator
> ---
>
> Key: SPARK-17949
> URL: https://issues.apache.org/jira/browse/SPARK-17949
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Cheng Lian
> Attachments: [Design Doc] Support for Arbitrary Aggregation States.pdf
>
>
> The new Tungsten execution engine has very robust memory management and speed 
> for simple data types. It does, however, suffer from the following:
> 1. For user defined aggregates (Hive UDAFs, Dataset typed operators), it is 
> fairly expensive to fit into the Tungsten internal format.
> 2. For aggregate functions that require complex intermediate data structures, 
> Unsafe (on raw bytes) is not a good programming abstraction due to the lack 
> of structs.
> The idea here is to introduce an JVM object based hash aggregate operator 
> that can support the aforementioned use cases. This operator, however, should 
> limit its memory usage to avoid putting too much pressure on GC, e.g. falling 
> back to sort-based aggregate as soon the number of objects exceed a very low 
> threshold.
> Internally at Databricks we prototyped a version of this for a customer POC 
> and have observed substantial speed-ups over existing Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (PARQUET-754) Deprecate the "strict" argument in MessageType.union()

2016-10-17 Thread Cheng Lian (JIRA)

Cheng Lian created PARQUET-754:
--

 Summary: Deprecate the "strict" argument in MessageType.union()
 Key: PARQUET-754
 URL: https://issues.apache.org/jira/browse/PARQUET-754
 Project: Parquet
  Issue Type: Bug
  Components: parquet-mr
Affects Versions: 1.8.1
Reporter: Cheng Lian
Priority: Minor


As discussed in PARQUET-379, non-strict schema merging doesn't really make any 
sense and we always set to true throughout the code base. Should probably 
deprecate it and make sure no internal code ever use non-strict schema merging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PARQUET-753) GroupType.union() doesn't merge the original type

2016-10-17 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583942#comment-15583942
 ] 

Cheng Lian commented on PARQUET-753:


PARQUET-379 resolves the {{union}} issue related to primitive types, but 
doesn't handle group types.

> GroupType.union() doesn't merge the original type
> -
>
> Key: PARQUET-753
> URL: https://issues.apache.org/jira/browse/PARQUET-753
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.8.1
>Reporter: Deneche A. Hakim
>
> When merging two GroupType, the union() method doesn't merge their original 
> type which will be lost after the union.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (SPARK-17972) Query planning slows down dramatically for large query plans even when sub-trees are cached

2016-10-17 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-17972:
--

 Summary: Query planning slows down dramatically for large query 
plans even when sub-trees are cached
 Key: SPARK-17972
 URL: https://issues.apache.org/jira/browse/SPARK-17972
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.1, 1.6.2
Reporter: Cheng Lian
Assignee: Cheng Lian


The following Spark shell snippet creates a series of query plans that grow 
exponentially. The {{i}}-th plan is created using 4 *cached* copies of the {{i 
- 1}}-th plan.

{code}
(0 until 6).foldLeft(Seq(1, 2, 3).toDS) { (plan, iteration) =>
  val start = System.currentTimeMillis()
  val result = plan.join(plan, "value").join(plan, "value").join(plan, 
"value").join(plan, "value")
  result.cache()
  System.out.println(s"Iteration $iteration takes time 
${System.currentTimeMillis() - start} ms")
  result.as[Int]
}
{code}

We can see that although all plans are cached, the query planning time still 
grows exponentially and quickly becomes unbearable.

{noformat}
Iteration 0 takes time 9 ms
Iteration 1 takes time 19 ms
Iteration 2 takes time 61 ms
Iteration 3 takes time 219 ms
Iteration 4 takes time 830 ms
Iteration 5 takes time 4080 ms
{noformat}

Similar scenarios can be found in iterative ML code and significantly affects 
usability.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17949) Introduce a JVM object based aggregate operator

2016-10-14 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian reassigned SPARK-17949:
--

Assignee: Cheng Lian

> Introduce a JVM object based aggregate operator
> ---
>
> Key: SPARK-17949
> URL: https://issues.apache.org/jira/browse/SPARK-17949
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Cheng Lian
>
> The new Tungsten execution engine has very robust memory management and speed 
> for simple data types. It does, however, suffer from the following:
> 1. For user defined aggregates (Hive UDAFs, Dataset typed operators), it is 
> fairly expensive to fit into the Tungsten internal format.
> 2. For aggregate functions that require complex intermediate data structures, 
> Unsafe (on raw bytes) is not a good programming abstraction due to the lack 
> of structs.
> The idea here is to introduce an JVM object based hash aggregate operator 
> that can support the aforementioned use cases. This operator, however, should 
> limit its memory usage to avoid putting too much pressure on GC, e.g. falling 
> back to sort-based aggregate as soon the number of objects exceed a very low 
> threshold.
> Internally at Databricks we prototyped a version of this for a customer POC 
> and have observed substantial speed-ups over existing Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10954) Parquet version in the "created_by" metadata field of Parquet files written by Spark 1.5 and 1.6 is wrong

2016-10-14 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15576623#comment-15576623
 ] 

Cheng Lian commented on SPARK-10954:


[~hyukjin.kwon], yes, confirmed. Thanks!

> Parquet version in the "created_by" metadata field of Parquet files written 
> by Spark 1.5 and 1.6 is wrong
> -
>
> Key: SPARK-10954
> URL: https://issues.apache.org/jira/browse/SPARK-10954
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1, 1.6.0
>Reporter: Cheng Lian
>Assignee: Gayathri Murali
>Priority: Minor
>
> We've upgraded to parquet-mr 1.7.0 in Spark 1.5, but the {{created_by}} field 
> still says 1.6.0. This issue can be reproduced by generating any Parquet file 
> with Spark 1.5, and then check the metadata with {{parquet-meta}} CLI tool:
> {noformat}
> $ parquet-meta /tmp/parquet/dec
> file:
> file:/tmp/parquet/dec/part-r-0-f210e968-1be5-40bc-bcbc-007f935e6dc7.gz.parquet
> creator: parquet-mr version 1.6.0
> extra:   org.apache.spark.sql.parquet.row.metadata = 
> {"type":"struct","fields":[{"name":"dec","type":"decimal(20,2)","nullable":true,"metadata":{}}]}
> file schema: spark_schema
> -
> dec: OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1
> row group 1: RC:10 TS:140 OFFSET:4
> -
> dec:  FIXED_LEN_BYTE_ARRAY GZIP DO:0 FPO:4 SZ:99/140/1.41 VC:10 
> ENC:PLAIN,BIT_PACKED,RLE
> {noformat}
> Note that this field is written by parquet-mr rather than Spark. However, 
> writing Parquet files using parquet-mr 1.7.0 directly without Spark 1.5 only 
> shows {{parquet-mr}} without any version number. Files written by parquet-mr 
> 1.8.1 without Spark look fine though.
> Currently this isn't a big issue. But parquet-mr 1.8 checks for this field to 
> workaround PARQUET-251.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-9783) Use SqlNewHadoopRDD in JSONRelation to eliminate extra refresh() call

2016-10-14 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian closed SPARK-9783.
-
Resolution: Not A Problem

This issue is no longer a problem since we re-implemented the JSON data source 
using the new {{FileFormat}} facilities in Spark 2.0.0.

> Use SqlNewHadoopRDD in JSONRelation to eliminate extra refresh() call
> -
>
> Key: SPARK-9783
> URL: https://issues.apache.org/jira/browse/SPARK-9783
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> PR #8035 made a quick fix for SPARK-9743 by introducing an extra 
> {{refresh()}} call in {{JSONRelation.buildScan}}. Obviously it hurts 
> performance. To overcome this, we can use {{SqlNewHadoopRDD}} there and 
> override {{listStatus()}} to inject cached {{FileStatus}} instances, similar 
> as what we did in {{ParquetRelation}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9783) Use SqlNewHadoopRDD in JSONRelation to eliminate extra refresh() call

2016-10-14 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15576523#comment-15576523
 ] 

Cheng Lian commented on SPARK-9783:
---

Yes, I'm closing this. Thanks!

> Use SqlNewHadoopRDD in JSONRelation to eliminate extra refresh() call
> -
>
> Key: SPARK-9783
> URL: https://issues.apache.org/jira/browse/SPARK-9783
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> PR #8035 made a quick fix for SPARK-9743 by introducing an extra 
> {{refresh()}} call in {{JSONRelation.buildScan}}. Obviously it hurts 
> performance. To overcome this, we can use {{SqlNewHadoopRDD}} there and 
> override {{listStatus()}} to inject cached {{FileStatus}} instances, similar 
> as what we did in {{ParquetRelation}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17636) Parquet filter push down doesn't handle struct fields

2016-10-14 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15576513#comment-15576513
 ] 

Cheng Lian commented on SPARK-17636:


[~MasterDDT], yes, just as what [~hyukjin.kwon] explained previously, it's not 
implemented and is expected behavior.

> Parquet filter push down doesn't handle struct fields
> -
>
> Key: SPARK-17636
> URL: https://issues.apache.org/jira/browse/SPARK-17636
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.6.2, 1.6.3
>Reporter: Mitesh
>Priority: Minor
>
> There's a *PushedFilters* for a simple numeric field, but not for a numeric 
> field inside a struct. Not sure if this is a Spark limitation because of 
> Parquet, or only a Spark limitation.
> {noformat}
> scala> hc.read.parquet("s3a://some/parquet/file").select("day_timestamp", 
> "sale_id")
> res5: org.apache.spark.sql.DataFrame = [day_timestamp: 
> struct, sale_id: bigint]
> scala> res5.filter("sale_id > 4").queryExecution.executedPlan
> res9: org.apache.spark.sql.execution.SparkPlan =
> Filter[23814] [args=(sale_id#86324L > 
> 4)][outPart=UnknownPartitioning(0)][outOrder=List()]
> +- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths: 
> s3a://some/parquet/file, PushedFilters: [GreaterThan(sale_id,4)]
> scala> res5.filter("day_timestamp.timestamp > 4").queryExecution.executedPlan
> res10: org.apache.spark.sql.execution.SparkPlan =
> Filter[23815] [args=(day_timestamp#86302.timestamp > 
> 4)][outPart=UnknownPartitioning(0)][outOrder=List()]
> +- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths: 
> s3a://some/parquet/file
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17636) Parquet filter push down doesn't handle struct fields

2016-10-14 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-17636:
---
Description: 
There's a *PushedFilters* for a simple numeric field, but not for a numeric 
field inside a struct. Not sure if this is a Spark limitation because of 
Parquet, or only a Spark limitation.

{noformat}
scala> hc.read.parquet("s3a://some/parquet/file").select("day_timestamp", 
"sale_id")

res5: org.apache.spark.sql.DataFrame = [day_timestamp: 
struct, sale_id: bigint]

scala> res5.filter("sale_id > 4").queryExecution.executedPlan

res9: org.apache.spark.sql.execution.SparkPlan =
Filter[23814] [args=(sale_id#86324L > 
4)][outPart=UnknownPartitioning(0)][outOrder=List()]
+- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths: 
s3a://some/parquet/file, PushedFilters: [GreaterThan(sale_id,4)]

scala> res5.filter("day_timestamp.timestamp > 4").queryExecution.executedPlan

res10: org.apache.spark.sql.execution.SparkPlan =
Filter[23815] [args=(day_timestamp#86302.timestamp > 
4)][outPart=UnknownPartitioning(0)][outOrder=List()]
+- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths: 
s3a://some/parquet/file
{noformat}

  was:
Theres a *PushedFilters* for a simple numeric field, but not for a numeric 
field inside a struct. Not sure if this is a Spark limitation because of 
Parquet, or only a Spark limitation.

{quote} 
scala> hc.read.parquet("s3a://some/parquet/file").select("day_timestamp", 
"sale_id")

res5: org.apache.spark.sql.DataFrame = [day_timestamp: 
struct, sale_id: bigint]

scala> res5.filter("sale_id > 4").queryExecution.executedPlan

res9: org.apache.spark.sql.execution.SparkPlan =
Filter[23814] [args=(sale_id#86324L > 
4)][outPart=UnknownPartitioning(0)][outOrder=List()]
+- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths: 
s3a://some/parquet/file, PushedFilters: [GreaterThan(sale_id,4)]

scala> res5.filter("day_timestamp.timestamp > 4").queryExecution.executedPlan

res10: org.apache.spark.sql.execution.SparkPlan =
Filter[23815] [args=(day_timestamp#86302.timestamp > 
4)][outPart=UnknownPartitioning(0)][outOrder=List()]
+- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths: 
s3a://some/parquet/file
{quote} 



> Parquet filter push down doesn't handle struct fields
> -
>
> Key: SPARK-17636
> URL: https://issues.apache.org/jira/browse/SPARK-17636
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.6.2, 1.6.3
>Reporter: Mitesh
>Priority: Minor
>
> There's a *PushedFilters* for a simple numeric field, but not for a numeric 
> field inside a struct. Not sure if this is a Spark limitation because of 
> Parquet, or only a Spark limitation.
> {noformat}
> scala> hc.read.parquet("s3a://some/parquet/file").select("day_timestamp", 
> "sale_id")
> res5: org.apache.spark.sql.DataFrame = [day_timestamp: 
> struct, sale_id: bigint]
> scala> res5.filter("sale_id > 4").queryExecution.executedPlan
> res9: org.apache.spark.sql.execution.SparkPlan =
> Filter[23814] [args=(sale_id#86324L > 
> 4)][outPart=UnknownPartitioning(0)][outOrder=List()]
> +- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths: 
> s3a://some/parquet/file, PushedFilters: [GreaterThan(sale_id,4)]
> scala> res5.filter("day_timestamp.timestamp > 4").queryExecution.executedPlan
> res10: org.apache.spark.sql.execution.SparkPlan =
> Filter[23815] [args=(day_timestamp#86302.timestamp > 
> 4)][outPart=UnknownPartitioning(0)][outOrder=List()]
> +- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths: 
> s3a://some/parquet/file
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17845) Improve window function frame boundary API in DataFrame

2016-10-10 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15561376#comment-15561376
 ] 

Cheng Lian edited comment on SPARK-17845 at 10/10/16 6:43 AM:
--

One thing is that ANSI SQL also allows using arbitrary integral expressions to 
specify frame boundaries. For example, it's legal to do something like this:

{code:sql}
SELECT max(id) OVER (ROWS BETWEEN id + 1 PRECEDING AND id % 3 FOLLOWING) FROM t
{code}

where id is a column of {{INT}}.

Although many other databases, including PostgreSQL, don't support this syntax, 
Presto does support it:

{noformat}
presto> select sum(id) over (rows between id + 1 preceding and id % 3 
following) from (values 1, 2, 3) as t(id);
 _col0
---
 3
 6
 6
(3 rows)
{noformat}

If we want to have another API for specifying window frame boundaries, it would 
be nice to take this scenario into consideration, just in case we want to 
support it in the future.

With this in mind, I'd propose a more verbose but more flexible and readable 
API as following:

{code}
trait FrameBoundary extends Expression

case object CurrentRow extends FrameBoundary
case object UnboundedPreceding extends FrameBoundary
case object UnboundedFollowing extends FrameBoundary

case class Preceding(offset: Expression) extends FrameBoundary
case class Following(offset: Expression) extends FrameBoundary

Window.rowsBetween(Preceding($"id" + 1), Following($"id" % 3))
Window.rangeBetween(CurrentRow, UnboundedFollowing)
{code}

We can have shortcut constructors like {{Following(1)}}, which is equivalent to 
{{Following(lit(1))}}, for convenience. At the current stage, we can restrict 
the offset expressions to be constant integral expressions only.

The above version is just a skeleton and hasn't taken Java interoperability 
into account yet, though.



was (Author: lian cheng):
One thing is that ANSI SQL also allows using arbitrary integral expressions to 
specify frame boundaries. For example, it's legal to do something like this:

{code:sql}
SELECT max(id) OVER (ROWS BETWEEN id +1 PRECEDING AND id % 3 FOLLOWING) FROM t
{code}

where id is a column of {{INT}}.

Although many other databases, including PostgreSQL, don't support this syntax, 
Presto does support it:

{noformat}
presto> select sum(id) over (rows between id + 1 preceding and id % 3 
following) from (values 1, 2, 3) as t(id);
 _col0
---
 3
 6
 6
(3 rows)
{noformat}

If we want to have another API for specifying window frame boundaries, it would 
be nice to take this scenario into consideration, just in case we want to 
support it in the future.

With this in mind, I'd propose a more verbose but more flexible and readable 
API as following:

{code}
trait FrameBoundary extends Expression

case object CurrentRow extends FrameBoundary
case object UnboundedPreceding extends FrameBoundary
case object UnboundedFollowing extends FrameBoundary

case class Preceding(offset: Expression) extends FrameBoundary
case class Following(offset: Expression) extends FrameBoundary

Window.rowsBetween(Preceding($"id" + 1), Following($"id" % 3))
Window.rangeBetween(CurrentRow, UnboundedFollowing)
{code}

We can have shortcut constructors like {{Following(1)}}, which is equivalent to 
{{Following(lit(1))}}, for convenience. At the current stage, we can restrict 
the offset expressions to be constant integral expressions only.

The above version is just a skeleton and hasn't taken Java interoperability 
into account yet, though.


> Improve window function frame boundary API in DataFrame
> ---
>
> Key: SPARK-17845
> URL: https://issues.apache.org/jira/browse/SPARK-17845
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> ANSI SQL uses the following to specify the frame boundaries for window 
> functions:
> {code}
> ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING
> ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING
> ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
> ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING
> ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
> {code}
> In Spark's DataFrame API, we use integer values to indicate relative position:
> - 0 means "CURRENT ROW"
> - -1 means "1 PRECEDING"
> - Long.MinValue means "UNBOUNDED PRECEDING"
> - Long.MaxValue to indicate "UNBOUNDED FOLLOWING"
> {code}
> // ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING
> Window.rowsBetween(-3, +3)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING
> Window.rowsBetween(Long.MinValue, -3)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
> Window.rowsBetween(Long.MinValue, 0)
> // ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING
> Window.rowsBetween(0, Long.MaxValue)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
>

[jira] [Comment Edited] (SPARK-17845) Improve window function frame boundary API in DataFrame

2016-10-10 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15561376#comment-15561376
 ] 

Cheng Lian edited comment on SPARK-17845 at 10/10/16 6:00 AM:
--

One thing is that ANSI SQL also allows using arbitrary integral expressions to 
specify frame boundaries. For example, it's legal to do something like this:

{code:sql}
SELECT max(id) OVER (ROWS BETWEEN id +1 PRECEDING AND id % 3 FOLLOWING) FROM t
{code}

where id is a column of {{INT}}.

Although many other databases, including PostgreSQL, don't support this syntax, 
Presto does support it:

{noformat}
presto> select sum(id) over (rows between id + 1 preceding and id % 3 
following) from (values 1, 2, 3) as t(id);
 _col0
---
 3
 6
 6
(3 rows)
{noformat}

If we want to have another API for specifying window frame boundaries, it would 
be nice to take this scenario into consideration, just in case we want to 
support it in the future.

With this in mind, I'd propose a more verbose but more flexible and readable 
API as following:

{code}
trait FrameBoundary extends Expression

case object CurrentRow extends FrameBoundary
case object UnboundedPreceding extends FrameBoundary
case object UnboundedFollowing extends FrameBoundary

case class Preceding(offset: Expression) extends FrameBoundary
case class Following(offset: Expression) extends FrameBoundary

Window.rowsBetween(Preceding($"id" + 1), Following($"id" % 3))
Window.rangeBetween(CurrentRow, UnboundedFollowing)
{code}

We can have shortcut constructors like {{Following(1)}}, which is equivalent to 
{{Following(lit(1))}}, for convenience. At the current stage, we can restrict 
the offset expressions to be constant integral expressions only.

The above version is just a skeleton and hasn't taken Java interoperability 
into account yet, though.



was (Author: lian cheng):
One thing is that ANSI SQL also allows using arbitrary integral expressions to 
specify frame boundaries. For example, it's legal to do something like this:

{code:sql}
SELECT max(id) OVER (ROWS BETWEEN id +1 PRECEDING AND id % 3 FOLLOWING) FROM t
{code}

where id is a column of {{INT}}.

Although many other databases, including PostgreSQL, don't support this syntax, 
Presto does support it:

{noformat}
presto> select sum(id) over (rows between id + 1 preceding and id % 3 
following) from (values 1, 2, 3) as t(id);
 _col0
---
 3
 6
 6
(3 rows)
{noformat}

If we want to have another API for specifying window frame boundaries, it would 
be nice to take this scenario into consideration, just in case we want to 
support it in the future.

With this in mind, I'd propose a more verbose but more flexible and readable 
API as following:

{code}
trait FrameBoundary extends Expression

case object CurrentRow extends FrameBoundary
case object UnboundedPreceding extends FrameBoundary
case object UnboundedFollowing extends FrameBoundary

case class Preceding(offset: Expression) extends FrameBoundary
case class Following(offset: Expression) extends FrameBoundary

Window.rowsBetween(Preceding($"id" + 1), Following($"id" % 3))
Window.rangeBetween(CurrentRow, UnboundedFollowing)
{code}

We can have shortcut constructors like `Following(1)`, which is equivalent to 
`Following(lit(1))`, for convenience. At the current stage, we can restrict the 
offset expressions to be constant integral expressions only.

The above version is just a skeleton and hasn't taken Java interoperability 
into account yet, though.


> Improve window function frame boundary API in DataFrame
> ---
>
> Key: SPARK-17845
> URL: https://issues.apache.org/jira/browse/SPARK-17845
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> ANSI SQL uses the following to specify the frame boundaries for window 
> functions:
> {code}
> ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING
> ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING
> ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
> ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING
> ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
> {code}
> In Spark's DataFrame API, we use integer values to indicate relative position:
> - 0 means "CURRENT ROW"
> - -1 means "1 PRECEDING"
> - Long.MinValue means "UNBOUNDED PRECEDING"
> - Long.MaxValue to indicate "UNBOUNDED FOLLOWING"
> {code}
> // ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING
> Window.rowsBetween(-3, +3)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING
> Window.rowsBetween(Long.MinValue, -3)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
> Window.rowsBetween(Long.MinValue, 0)
> // ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING
> Window.rowsBetween(0, Long.MaxValue)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
>

[jira] [Commented] (SPARK-17845) Improve window function frame boundary API in DataFrame

2016-10-09 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15561376#comment-15561376
 ] 

Cheng Lian commented on SPARK-17845:


One thing is that ANSI SQL also allows using arbitrary integral expressions to 
specify frame boundaries. For example, it's legal to do something like this:

{code:sql}
SELECT max(id) OVER (ROWS BETWEEN id +1 PRECEDING AND id % 3 FOLLOWING) FROM t
{code}

where id is a column of {{INT}}.

Although many other databases, including PostgreSQL, don't support this syntax, 
Presto does support it:

{noformat}
presto> select sum(id) over (rows between id + 1 preceding and id % 3 
following) from (values 1, 2, 3) as t(id);
 _col0
---
 3
 6
 6
(3 rows)
{noformat}

If we want to have another API for specifying window frame boundaries, it would 
be nice to take this scenario into consideration, just in case we want to 
support it in the future.

With this in mind, I'd propose a more verbose but more flexible and readable 
API as following:

{code}
trait FrameBoundary extends Expression

case object CurrentRow extends FrameBoundary
case object UnboundedPreceding extends FrameBoundary
case object UnboundedFollowing extends FrameBoundary

case class Preceding(offset: Expression) extends FrameBoundary
case class Following(offset: Expression) extends FrameBoundary

Window.rowsBetween(Preceding($"id" + 1), Following($"id" % 3))
Window.rangeBetween(CurrentRow, UnboundedFollowing)
{code}

We can have shortcut constructors like `Following(1)`, which is equivalent to 
`Following(lit(1))`, for convenience. At the current stage, we can restrict the 
offset expressions to be constant integral expressions only.

The above version is just a skeleton and hasn't taken Java interoperability 
into account yet, though.


> Improve window function frame boundary API in DataFrame
> ---
>
> Key: SPARK-17845
> URL: https://issues.apache.org/jira/browse/SPARK-17845
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> ANSI SQL uses the following to specify the frame boundaries for window 
> functions:
> {code}
> ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING
> ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING
> ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
> ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING
> ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
> {code}
> In Spark's DataFrame API, we use integer values to indicate relative position:
> - 0 means "CURRENT ROW"
> - -1 means "1 PRECEDING"
> - Long.MinValue means "UNBOUNDED PRECEDING"
> - Long.MaxValue to indicate "UNBOUNDED FOLLOWING"
> {code}
> // ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING
> Window.rowsBetween(-3, +3)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING
> Window.rowsBetween(Long.MinValue, -3)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
> Window.rowsBetween(Long.MinValue, 0)
> // ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING
> Window.rowsBetween(0, Long.MaxValue)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
> Window.rowsBetween(Long.MinValue, Long.MaxValue)
> {code}
> I think using numeric values to indicate relative positions is actually a 
> good idea, but the reliance on Long.MinValue and Long.MaxValue to indicate 
> unbounded ends is pretty confusing:
> 1. The API is not self-evident. There is no way for a new user to figure out 
> how to indicate an unbounded frame by looking at just the API. The user has 
> to read the doc to figure this out.
> 2. It is weird Long.MinValue or Long.MaxValue has some special meaning.
> 3. Different languages have different min/max values, e.g. in Python we use 
> -sys.maxsize and +sys.maxsize.
> To make this API less confusing, we have a few options:
> Option 1. Add the following (additional) methods:
> {code}
> // ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING
> Window.rowsBetween(-3, +3)  // this one exists already
> // ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING
> Window.rowsBetweenUnboundedPrecedingAnd(-3)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
> Window.rowsBetweenUnboundedPrecedingAndCurrentRow()
> // ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING
> Window.rowsBetweenCurrentRowAndUnboundedFollowing()
> // ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
> Window.rowsBetweenUnboundedPrecedingAndUnboundedFollowing()
> {code}
> This is obviously very verbose, but is very similar to how these functions 
> are done in SQL, and is perhaps the most obvious to end users, especially if 
> they come from SQL background.
> Option 2. Decouple the specification for frame begin and frame end into two 
> functions. Assume the boundary is unlimited unless specified.
> {code}
> // ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING
>

[jira] [Commented] (SPARK-17725) Spark should not write out parquet files with schema containing non-nullable fields

2016-09-29 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15533109#comment-15533109
 ] 

Cheng Lian commented on SPARK-17725:


Reproducing this issue by writing a Parquet file using:

{code}
spark.range(10).write.parquet("file:///tmp/nullability.parquet")
{code}

and then inspect the schema of the written Parquet file using parquet-tools:

{code}
$ parquet-schema /tmp/nullability.parquet
message spark_schema {
  required int64 id;
}
{code}

In Spark 1.6, the inspected schema is:

{code}
message spark_schema {
  optional int64 id;
}
{code}

> Spark should not write out parquet files with schema containing non-nullable 
> fields
> ---
>
> Key: SPARK-17725
> URL: https://issues.apache.org/jira/browse/SPARK-17725
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Wenchen Fan
>
> Since Spark 1.3, after PR https://github.com/apache/spark/pull/4826 , Spark 
> SQL will always set all schema fields to nullable before writing out parquet 
> files, to make the data pipeline more robust.
> However, this behaviour has been changed in 2.0 accidently by PR 
> https://github.com/apache/spark/pull/11509



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16516) Support for pushing down filters for decimal and timestamp types in ORC

2016-09-27 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-16516.

Resolution: Fixed

Issue resolved by pull request 14172
[https://github.com/apache/spark/pull/14172]

> Support for pushing down filters for decimal and timestamp types in ORC
> ---
>
> Key: SPARK-16516
> URL: https://issues.apache.org/jira/browse/SPARK-16516
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
> Fix For: 2.1.0
>
>
> Currently filters for {{TimestampType}} and {{DecimalType}} are not being 
> pushed down in ORC data source although ORC filters support both.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16777) Parquet schema converter depends on deprecated APIs

2016-09-27 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-16777:
---
Fix Version/s: (was: 2.2.0)
   2.1.0

> Parquet schema converter depends on deprecated APIs
> ---
>
> Key: SPARK-16777
> URL: https://issues.apache.org/jira/browse/SPARK-16777
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: holdenk
>Assignee: Hyukjin Kwon
> Fix For: 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16516) Support for pushing down filters for decimal and timestamp types in ORC

2016-09-27 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-16516:
---
Fix Version/s: 2.1.0

> Support for pushing down filters for decimal and timestamp types in ORC
> ---
>
> Key: SPARK-16516
> URL: https://issues.apache.org/jira/browse/SPARK-16516
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
> Fix For: 2.1.0
>
>
> Currently filters for {{TimestampType}} and {{DecimalType}} are not being 
> pushed down in ORC data source although ORC filters support both.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16516) Support for pushing down filters for decimal and timestamp types in ORC

2016-09-27 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-16516:
---
Assignee: Hyukjin Kwon

> Support for pushing down filters for decimal and timestamp types in ORC
> ---
>
> Key: SPARK-16516
> URL: https://issues.apache.org/jira/browse/SPARK-16516
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
> Fix For: 2.1.0
>
>
> Currently filters for {{TimestampType}} and {{DecimalType}} are not being 
> pushed down in ORC data source although ORC filters support both.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16777) Parquet schema converter depends on deprecated APIs

2016-09-27 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-16777:
---
Fix Version/s: 2.1.0
   2.0.2

> Parquet schema converter depends on deprecated APIs
> ---
>
> Key: SPARK-16777
> URL: https://issues.apache.org/jira/browse/SPARK-16777
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: holdenk
>Assignee: Hyukjin Kwon
> Fix For: 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16777) Parquet schema converter depends on deprecated APIs

2016-09-27 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-16777.

   Resolution: Fixed
Fix Version/s: (was: 2.0.2)
   (was: 2.1.0)
   2.2.0

Issue resolved by pull request 14399
[https://github.com/apache/spark/pull/14399]

> Parquet schema converter depends on deprecated APIs
> ---
>
> Key: SPARK-16777
> URL: https://issues.apache.org/jira/browse/SPARK-16777
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: holdenk
>Assignee: Hyukjin Kwon
> Fix For: 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16777) Parquet schema converter depends on deprecated APIs

2016-09-27 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-16777:
---
Assignee: Hyukjin Kwon

> Parquet schema converter depends on deprecated APIs
> ---
>
> Key: SPARK-16777
> URL: https://issues.apache.org/jira/browse/SPARK-16777
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: holdenk
>Assignee: Hyukjin Kwon
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-8824) Support Parquet time related logical types

2016-09-27 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15525313#comment-15525313
 ] 

Cheng Lian edited comment on SPARK-8824 at 9/27/16 7:09 AM:


Since we've already upgraded parquet-mr in Spark master to 1.8.1, it's possible 
to add support for {{INTERVAL}} and {{TIMESTAMP_MILLIS}} now. 
{{TIMESTAMP_MICROS}} is still unavailable since even parquet-mr 1.8.1 hasn't 
included it as a valid logical type.

Contributors are welcomed!


was (Author: lian cheng):
Since we've already upgraded parquet-mr in Spark master to 1.8.1, it's possible 
to add support for {{INTERVAL}} and {{TIMESTAMP_MILLIS}} now. 
{{TIMESTAMP_MICROS}} is still unavailable since even parquet-mr 1.8.1 hasn't 
included it as a valide logical type.

> Support Parquet time related logical types
> --
>
> Key: SPARK-8824
> URL: https://issues.apache.org/jira/browse/SPARK-8824
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Cheng Lian
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8824) Support Parquet time related logical types

2016-09-27 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15525313#comment-15525313
 ] 

Cheng Lian commented on SPARK-8824:
---

Since we've already upgraded parquet-mr in Spark master to 1.8.1, it's possible 
to add support for {{INTERVAL}} and {{TIMESTAMP_MILLIS}} now. 
{{TIMESTAMP_MICROS}} is still unavailable since even parquet-mr 1.8.1 hasn't 
included it as a valide logical type.

> Support Parquet time related logical types
> --
>
> Key: SPARK-8824
> URL: https://issues.apache.org/jira/browse/SPARK-8824
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Cheng Lian
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17572) Write.df is failing on spark cluster

2016-09-20 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15506051#comment-15506051
 ] 

Cheng Lian commented on SPARK-17572:


Yea, I know you are not using HDFS. But Spark always uses Hadoop facilities to 
read file systems, including NFS. (You may see Hadoop library method calls in 
the stack trace you provided.) So it's still meaningful to provide the Hadoop 
version. In your case, it's the Hadoop version you used to compile your Spark 
distribution. For example, the default Hadoop version Spark 2.0.0 uses is 2.2.0.

> Write.df is failing on spark cluster
> 
>
> Key: SPARK-17572
> URL: https://issues.apache.org/jira/browse/SPARK-17572
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Sankar Mittapally
>
> Hi,
>  We have spark cluster with four nodes, all four nodes have NFS partition 
> shared(there is no HDFS), We have same uid on all servers. When we are trying 
> to write data we are getting following exceptions. I am not sure whether it 
> is a error or not and not sure will I lost the data in the output.
> The command which I am using to save the data.
> {code}
> saveDF(banking_l1_1,"banking_l1_v2.csv",source="csv",mode="append",schema="true")
> {code}
> {noformat}
> 16/09/17 08:03:28 ERROR InsertIntoHadoopFsRelationCommand: Aborting job.
> java.io.IOException: Failed to rename 
> DeprecatedRawLocalFileStatus{path=file:/nfspartition/sankar/banking_l1_v2.csv/_temporary/0/task_201609170802_0013_m_00/part-r-0-46a7f178-2490-444e-9110-510978eaaecb.csv;
>  isDirectory=false; length=436486316; replication=1; blocksize=33554432; 
> modification_time=147409940; access_time=0; owner=; group=; 
> permission=rw-rw-rw-; isSymlink=false} to 
> file:/nfspartition/sankar/banking_l1_v2.csv/part-r-0-46a7f178-2490-444e-9110-510978eaaecb.csv
> at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:371)
> at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:384)
> at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:326)
> at 
> org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitJob(WriterContainer.scala:222)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelationCommand.scala:144)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRelationCommand.scala:115)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRelationCommand.scala:115)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:115)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
> at 
> org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:487)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:194)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:141)
> at 
>

[jira] [Commented] (SPARK-17572) Write.df is failing on spark cluster

2016-09-20 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15505921#comment-15505921
 ] 

Cheng Lian commented on SPARK-17572:


Which version of Hadoop are you using? Does it work when you pick a random node 
of the NFS cluster and run a single node Spark in local mode and just write to 
the local filesystem? Just to make sure the basic Spark setup is correct.

> Write.df is failing on spark cluster
> 
>
> Key: SPARK-17572
> URL: https://issues.apache.org/jira/browse/SPARK-17572
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Sankar Mittapally
>
> Hi,
>  We have spark cluster with four nodes, all four nodes have NFS partition 
> shared(there is no HDFS), We have same uid on all servers. When we are trying 
> to write data we are getting following exceptions. I am not sure whether it 
> is a error or not and not sure will I lost the data in the output.
> The command which I am using to save the data.
> {code}
> saveDF(banking_l1_1,"banking_l1_v2.csv",source="csv",mode="append",schema="true")
> {code}
> {noformat}
> 16/09/17 08:03:28 ERROR InsertIntoHadoopFsRelationCommand: Aborting job.
> java.io.IOException: Failed to rename 
> DeprecatedRawLocalFileStatus{path=file:/nfspartition/sankar/banking_l1_v2.csv/_temporary/0/task_201609170802_0013_m_00/part-r-0-46a7f178-2490-444e-9110-510978eaaecb.csv;
>  isDirectory=false; length=436486316; replication=1; blocksize=33554432; 
> modification_time=147409940; access_time=0; owner=; group=; 
> permission=rw-rw-rw-; isSymlink=false} to 
> file:/nfspartition/sankar/banking_l1_v2.csv/part-r-0-46a7f178-2490-444e-9110-510978eaaecb.csv
> at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:371)
> at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:384)
> at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:326)
> at 
> org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitJob(WriterContainer.scala:222)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelationCommand.scala:144)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRelationCommand.scala:115)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRelationCommand.scala:115)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:115)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
> at 
> org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:487)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:194)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:141)
> at 
> org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:86)
> at 
> org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:38)
> at 
>

[jira] [Updated] (SPARK-17572) Write.df is failing on spark cluster

2016-09-20 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-17572:
---
Description: 
Hi,

 We have spark cluster with four nodes, all four nodes have NFS partition 
shared(there is no HDFS), We have same uid on all servers. When we are trying 
to write data we are getting following exceptions. I am not sure whether it is 
a error or not and not sure will I lost the data in the output.

The command which I am using to save the data.

{code}
saveDF(banking_l1_1,"banking_l1_v2.csv",source="csv",mode="append",schema="true")
{code}

{noformat}
16/09/17 08:03:28 ERROR InsertIntoHadoopFsRelationCommand: Aborting job.
java.io.IOException: Failed to rename 
DeprecatedRawLocalFileStatus{path=file:/nfspartition/sankar/banking_l1_v2.csv/_temporary/0/task_201609170802_0013_m_00/part-r-0-46a7f178-2490-444e-9110-510978eaaecb.csv;
 isDirectory=false; length=436486316; replication=1; blocksize=33554432; 
modification_time=147409940; access_time=0; owner=; group=; 
permission=rw-rw-rw-; isSymlink=false} to 
file:/nfspartition/sankar/banking_l1_v2.csv/part-r-0-46a7f178-2490-444e-9110-510978eaaecb.csv
at 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:371)
at 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:384)
at 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:326)
at 
org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitJob(WriterContainer.scala:222)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelationCommand.scala:144)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRelationCommand.scala:115)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRelationCommand.scala:115)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:115)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
at 
org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:487)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:194)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:141)
at 
org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:86)
at 
org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:38)
at 
io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:244)
at

[jira] [Resolved] (SPARK-17289) Sort based partial aggregation breaks due to SPARK-12978

2016-08-30 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-17289.

   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 14865
[https://github.com/apache/spark/pull/14865]

> Sort based partial aggregation breaks due to SPARK-12978
> 
>
> Key: SPARK-17289
> URL: https://issues.apache.org/jira/browse/SPARK-17289
> Project: Spark
>  Issue Type: Bug
>Reporter: Sean Zhong
>Assignee: Takeshi Yamamuro
>Priority: Blocker
> Fix For: 2.1.0
>
>
> For the following query:
> {code}
> val df2 = (0 to 1000).map(x => (x % 2, x.toString)).toDF("a", 
> "b").createOrReplaceTempView("t2")
> spark.sql("select max(b) from t2 group by a").explain(true)
> {code}
> Now, the SortAggregator won't insert Sort operator before partial 
> aggregation, this will break sort-based partial aggregation.
> {code}
> == Physical Plan ==
> SortAggregate(key=[a#5], functions=[max(b#6)], output=[max(b)#17])
> +- *Sort [a#5 ASC], false, 0
>+- Exchange hashpartitioning(a#5, 200)
>   +- SortAggregate(key=[a#5], functions=[partial_max(b#6)], output=[a#5, 
> max#19])
>  +- LocalTableScan [a#5, b#6]
> {code}
> In Spark 2.0 branch, the plan is:
> {code}
> == Physical Plan ==
> SortAggregate(key=[a#5], functions=[max(b#6)], output=[max(b)#17])
> +- *Sort [a#5 ASC], false, 0
>+- Exchange hashpartitioning(a#5, 200)
>   +- SortAggregate(key=[a#5], functions=[partial_max(b#6)], output=[a#5, 
> max#19])
>  +- *Sort [a#5 ASC], false, 0
> +- LocalTableScan [a#5, b#6]
> {code}
> This is related to SPARK-12978



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17289) Sort based partial aggregation breaks due to SPARK-12978

2016-08-30 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-17289:
---
Assignee: Takeshi Yamamuro

> Sort based partial aggregation breaks due to SPARK-12978
> 
>
> Key: SPARK-17289
> URL: https://issues.apache.org/jira/browse/SPARK-17289
> Project: Spark
>  Issue Type: Bug
>Reporter: Sean Zhong
>Assignee: Takeshi Yamamuro
>Priority: Blocker
> Fix For: 2.1.0
>
>
> For the following query:
> {code}
> val df2 = (0 to 1000).map(x => (x % 2, x.toString)).toDF("a", 
> "b").createOrReplaceTempView("t2")
> spark.sql("select max(b) from t2 group by a").explain(true)
> {code}
> Now, the SortAggregator won't insert Sort operator before partial 
> aggregation, this will break sort-based partial aggregation.
> {code}
> == Physical Plan ==
> SortAggregate(key=[a#5], functions=[max(b#6)], output=[max(b)#17])
> +- *Sort [a#5 ASC], false, 0
>+- Exchange hashpartitioning(a#5, 200)
>   +- SortAggregate(key=[a#5], functions=[partial_max(b#6)], output=[a#5, 
> max#19])
>  +- LocalTableScan [a#5, b#6]
> {code}
> In Spark 2.0 branch, the plan is:
> {code}
> == Physical Plan ==
> SortAggregate(key=[a#5], functions=[max(b#6)], output=[max(b)#17])
> +- *Sort [a#5 ASC], false, 0
>+- Exchange hashpartitioning(a#5, 200)
>   +- SortAggregate(key=[a#5], functions=[partial_max(b#6)], output=[a#5, 
> max#19])
>  +- *Sort [a#5 ASC], false, 0
> +- LocalTableScan [a#5, b#6]
> {code}
> This is related to SPARK-12978



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16283) Implement percentile_approx SQL function

2016-08-25 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-16283:
---
Assignee: (was: Sean Zhong)

> Implement percentile_approx SQL function
> 
>
> Key: SPARK-16283
> URL: https://issues.apache.org/jira/browse/SPARK-16283
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16283) Implement percentile_approx SQL function

2016-08-25 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-16283:
---
Assignee: Sean Zhong

> Implement percentile_approx SQL function
> 
>
> Key: SPARK-16283
> URL: https://issues.apache.org/jira/browse/SPARK-16283
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Sean Zhong
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17182) CollectList and CollectSet should be marked as non-deterministic

2016-08-22 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-17182:
--

 Summary: CollectList and CollectSet should be marked as 
non-deterministic
 Key: SPARK-17182
 URL: https://issues.apache.org/jira/browse/SPARK-17182
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Cheng Lian
Assignee: Cheng Lian


{{CollectList}} and {{CollectSet}} should be marked as non-deterministic since 
their results depend on the actual order of input rows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16975) Spark-2.0.0 unable to infer schema for parquet data written by Spark-1.6.2

2016-08-12 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-16975.

   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.1

Issue resolved by pull request 14585
[https://github.com/apache/spark/pull/14585]

> Spark-2.0.0 unable to infer schema for parquet data written by Spark-1.6.2
> --
>
> Key: SPARK-16975
> URL: https://issues.apache.org/jira/browse/SPARK-16975
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
> Environment: Ubuntu Linux 14.04
>Reporter: immerrr again
>Assignee: Dongjoon Hyun
>  Labels: parquet
> Fix For: 2.0.1, 2.1.0
>
>
> Spark-2.0.0 seems to have some problems reading a parquet dataset generated 
> by 1.6.2. 
> {code}
> In [80]: spark.read.parquet('/path/to/data')
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data. It must be specified manually;'
> {code}
> The dataset is ~150G and partitioned by _locality_code column. None of the 
> partitions are empty. I have narrowed the failing dataset to the first 32 
> partitions of the data:
> {code}
> In [82]: spark.read.parquet(*subdirs[:32])
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AI. It must be 
> specified manually;'
> {code}
> Interestingly, it works OK if you remove any of the partitions from the list:
> {code}
> In [83]: for i in range(32): spark.read.parquet(*(subdirs[:i] + 
> subdirs[i+1:32]))
> {code}
> Another strange thing is that the schemas for the first and the last 31 
> partitions of the subset are identical:
> {code}
> In [84]: spark.read.parquet(*subdirs[:31]).schema.fields == 
> spark.read.parquet(*subdirs[1:32]).schema.fields
> Out[84]: True
> {code}
> Which got me interested and I tried this:
> {code}
> In [87]: spark.read.parquet(*([subdirs[0]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AQ. It must be 
> specified manually;'
> In [88]: spark.read.parquet(*([subdirs[15]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AX,/path/to/data/_locality_code=AX. It must be 
> specified manually;'
> In [89]: spark.read.parquet(*([subdirs[31]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=BE,/path/to/data/_locality_code=BE. It must be 
> specified manually;'
> {code}
> If I read the first partition, save it in 2.0 and try to read in the same 
> manner, everything is fine:
> {code}
> In [100]: spark.read.parquet(subdirs[0]).write.parquet('spark-2.0-test')
> 16/08/09 11:03:37 WARN ParquetRecordReader: Can not initialize counter due to 
> context is not a instance of TaskInputOutputContext, but is 
> org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
> In [101]: df = spark.read.parquet(*(['spark-2.0-test'] * 32))
> {code}
> I have originally posted it to user mailing list, but with the last 
> discoveries this clearly seems like a bug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16975) Spark-2.0.0 unable to infer schema for parquet data written by Spark-1.6.2

2016-08-12 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-16975:
---
Assignee: Dongjoon Hyun

> Spark-2.0.0 unable to infer schema for parquet data written by Spark-1.6.2
> --
>
> Key: SPARK-16975
> URL: https://issues.apache.org/jira/browse/SPARK-16975
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
> Environment: Ubuntu Linux 14.04
>Reporter: immerrr again
>Assignee: Dongjoon Hyun
>  Labels: parquet
>
> Spark-2.0.0 seems to have some problems reading a parquet dataset generated 
> by 1.6.2. 
> {code}
> In [80]: spark.read.parquet('/path/to/data')
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data. It must be specified manually;'
> {code}
> The dataset is ~150G and partitioned by _locality_code column. None of the 
> partitions are empty. I have narrowed the failing dataset to the first 32 
> partitions of the data:
> {code}
> In [82]: spark.read.parquet(*subdirs[:32])
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AI. It must be 
> specified manually;'
> {code}
> Interestingly, it works OK if you remove any of the partitions from the list:
> {code}
> In [83]: for i in range(32): spark.read.parquet(*(subdirs[:i] + 
> subdirs[i+1:32]))
> {code}
> Another strange thing is that the schemas for the first and the last 31 
> partitions of the subset are identical:
> {code}
> In [84]: spark.read.parquet(*subdirs[:31]).schema.fields == 
> spark.read.parquet(*subdirs[1:32]).schema.fields
> Out[84]: True
> {code}
> Which got me interested and I tried this:
> {code}
> In [87]: spark.read.parquet(*([subdirs[0]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AQ. It must be 
> specified manually;'
> In [88]: spark.read.parquet(*([subdirs[15]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AX,/path/to/data/_locality_code=AX. It must be 
> specified manually;'
> In [89]: spark.read.parquet(*([subdirs[31]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=BE,/path/to/data/_locality_code=BE. It must be 
> specified manually;'
> {code}
> If I read the first partition, save it in 2.0 and try to read in the same 
> manner, everything is fine:
> {code}
> In [100]: spark.read.parquet(subdirs[0]).write.parquet('spark-2.0-test')
> 16/08/09 11:03:37 WARN ParquetRecordReader: Can not initialize counter due to 
> context is not a instance of TaskInputOutputContext, but is 
> org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
> In [101]: df = spark.read.parquet(*(['spark-2.0-test'] * 32))
> {code}
> I have originally posted it to user mailing list, but with the last 
> discoveries this clearly seems like a bug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16867) createTable and alterTable in ExternalCatalog should not take db

2016-08-04 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-16867.

   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 14476
[https://github.com/apache/spark/pull/14476]

> createTable and alterTable in ExternalCatalog should not take db
> 
>
> Key: SPARK-16867
> URL: https://issues.apache.org/jira/browse/SPARK-16867
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16842) Concern about disallowing user-given schema for Parquet and ORC

2016-08-02 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15403567#comment-15403567
 ] 

Cheng Lian commented on SPARK-16842:


First of all, the cost of schema discovery can be heavy when schema merging is 
turned on. Thus, one practical use case is to run schema merging once offline 
and use this schema in all following queries upon the same dataset. This is 
particularly useful for Parquet datasets converted from log files in JSON 
format. In this case, different files generated at different time may have 
different but compatible schema, and usually requires the user to run schema 
merging to discover the global schema.

Also, when reading a Parquet file, you may add extra columns or remove columns 
you don't care in the user specified schema to tailor the form of the read 
DataFrame.

That said, I'd argue that user specified schema is still pretty useful for 
Parquet and ORC.

> Concern about disallowing user-given schema for Parquet and ORC
> ---
>
> Key: SPARK-16842
> URL: https://issues.apache.org/jira/browse/SPARK-16842
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>
> If my understanding is correct,
> If the user-given schema is different with the inferred schema, it is handled 
> differently for each datasource.
> - For JSON and CSV
>   it is kind of permissive generally (for example, compatibility among 
> numeric types).
> - For ORC and Parquet
>   Generally it is strict to types. So they don't allow the compatibility 
> (except for very few cases, e.g. for Parquet, 
> https://github.com/apache/spark/pull/14272 and 
> https://github.com/apache/spark/pull/14278)
> - For Text
>   it only supports {{StringType}}.
> - For JDBC
>   it does not take user-given schema since it does not implement 
> {{SchemaRelationProvider}}.
> By allowing the user-given schema, we can use some types such as {{DateType}} 
> and {{TimestampType}} for JSON and CSV. CSV and JSON allow arguably 
> permissive schema.
> To cut this short, JSON and CSV do not have the complete schema information 
> written in the data whereas Orc and Parquet do. 
> So, we might have to just disallow giving user-given schema for Parquet and 
> Orc. Actually, we can't give a different schema for Orc and Parquet almost at 
> all times if my understanding it correct. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16621) Generate stable SQLs in SQLBuilder

2016-07-26 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-16621:
---
Assignee: Dongjoon Hyun

> Generate stable SQLs in SQLBuilder
> --
>
> Key: SPARK-16621
> URL: https://issues.apache.org/jira/browse/SPARK-16621
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
> Fix For: 2.1.0
>
>
> Currently, the generated SQLs have not-stable IDs for generated attributes.
> The stable generated SQL will give more benefit for understanding or testing 
> the queries. This issue provides stable SQL generation by the followings.
> * Provide unique ids for generated subqueries, `gen_subquery_xxx`.
> * Provide unique and stable ids for generated attributes, `gen_attr_xxx`.
> **Before**
> {code}
> scala> new org.apache.spark.sql.catalyst.SQLBuilder(sql("select 1")).toSQL
> res0: String = SELECT `gen_attr_0` AS `1` FROM (SELECT 1 AS `gen_attr_0`) AS 
> gen_subquery_0
> scala> new org.apache.spark.sql.catalyst.SQLBuilder(sql("select 1")).toSQL
> res1: String = SELECT `gen_attr_4` AS `1` FROM (SELECT 1 AS `gen_attr_4`) AS 
> gen_subquery_0
> {code}
> **After**
> {code}
> scala> new org.apache.spark.sql.catalyst.SQLBuilder(sql("select 1")).toSQL
> res1: String = SELECT `gen_attr_0` AS `1` FROM (SELECT 1 AS `gen_attr_0`) AS 
> gen_subquery_0
> scala> new org.apache.spark.sql.catalyst.SQLBuilder(sql("select 1")).toSQL
> res2: String = SELECT `gen_attr_0` AS `1` FROM (SELECT 1 AS `gen_attr_0`) AS 
> gen_subquery_0
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16621) Generate stable SQLs in SQLBuilder

2016-07-26 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-16621.

   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 14257
[https://github.com/apache/spark/pull/14257]

> Generate stable SQLs in SQLBuilder
> --
>
> Key: SPARK-16621
> URL: https://issues.apache.org/jira/browse/SPARK-16621
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Dongjoon Hyun
> Fix For: 2.1.0
>
>
> Currently, the generated SQLs have not-stable IDs for generated attributes.
> The stable generated SQL will give more benefit for understanding or testing 
> the queries. This issue provides stable SQL generation by the followings.
> * Provide unique ids for generated subqueries, `gen_subquery_xxx`.
> * Provide unique and stable ids for generated attributes, `gen_attr_xxx`.
> **Before**
> {code}
> scala> new org.apache.spark.sql.catalyst.SQLBuilder(sql("select 1")).toSQL
> res0: String = SELECT `gen_attr_0` AS `1` FROM (SELECT 1 AS `gen_attr_0`) AS 
> gen_subquery_0
> scala> new org.apache.spark.sql.catalyst.SQLBuilder(sql("select 1")).toSQL
> res1: String = SELECT `gen_attr_4` AS `1` FROM (SELECT 1 AS `gen_attr_4`) AS 
> gen_subquery_0
> {code}
> **After**
> {code}
> scala> new org.apache.spark.sql.catalyst.SQLBuilder(sql("select 1")).toSQL
> res1: String = SELECT `gen_attr_0` AS `1` FROM (SELECT 1 AS `gen_attr_0`) AS 
> gen_subquery_0
> scala> new org.apache.spark.sql.catalyst.SQLBuilder(sql("select 1")).toSQL
> res2: String = SELECT `gen_attr_0` AS `1` FROM (SELECT 1 AS `gen_attr_0`) AS 
> gen_subquery_0
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16666) Kryo encoder for custom complex classes

2016-07-26 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-1:
---
Description: 
I'm trying to create a dataset with some geo data using spark and esri. If 
`Foo` only have `Point` field, it'll work but if I add some other fields beyond 
a `Point`, I get ArrayIndexOutOfBoundsException.
{code:scala}
import com.esri.core.geometry.Point
import org.apache.spark.sql.{Encoder, Encoders, SQLContext}
import org.apache.spark.{SparkConf, SparkContext}

object Main {

  case class Foo(position: Point, name: String)

  object MyEncoders {
implicit def PointEncoder: Encoder[Point] = Encoders.kryo[Point]

implicit def FooEncoder: Encoder[Foo] = Encoders.kryo[Foo]
  }

  def main(args: Array[String]): Unit = {
val sc = new SparkContext(new 
SparkConf().setAppName("app").setMaster("local"))
val sqlContext = new SQLContext(sc)
import MyEncoders.{FooEncoder, PointEncoder}
import sqlContext.implicits._
Seq(new Foo(new Point(0, 0), "bar")).toDS.show
  }
}
{code}
{noformat}
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1
at 
org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1$$anonfun$apply$2.apply(Queryable.scala:71)
at 
org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1$$anonfun$apply$2.apply(Queryable.scala:70)
 
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
 
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) 
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) 
at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) 
at 
org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1.apply(Queryable.scala:70)
 
at 
org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1.apply(Queryable.scala:69)
 
at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:73) 
at 
org.apache.spark.sql.execution.Queryable$class.formatString(Queryable.scala:69) 
at org.apache.spark.sql.Dataset.formatString(Dataset.scala:65) 
at org.apache.spark.sql.Dataset.showString(Dataset.scala:263) 
at org.apache.spark.sql.Dataset.show(Dataset.scala:230) 
at org.apache.spark.sql.Dataset.show(Dataset.scala:193) 
at org.apache.spark.sql.Dataset.show(Dataset.scala:201) 
at Main$.main(Main.scala:24) 
at Main.main(Main.scala)
{noformat}

  was:
I'm trying to create a dataset with some geo data using spark and esri. If 
`Foo` only have `Point` field, it'll work but if I add some other fields beyond 
a `Point`, I get ArrayIndexOutOfBoundsException.
{code:scala}
import com.esri.core.geometry.Point
import org.apache.spark.sql.{Encoder, Encoders, SQLContext}
import org.apache.spark.{SparkConf, SparkContext}

object Main {

  case class Foo(position: Point, name: String)

  object MyEncoders {
implicit def PointEncoder: Encoder[Point] = Encoders.kryo[Point]

implicit def FooEncoder: Encoder[Foo] = Encoders.kryo[Foo]
  }

  def main(args: Array[String]): Unit = {
val sc = new SparkContext(new 
SparkConf().setAppName("app").setMaster("local"))
val sqlContext = new SQLContext(sc)
import MyEncoders.{FooEncoder, PointEncoder}
import sqlContext.implicits._
Seq(new Foo(new Point(0, 0), "bar")).toDS.show
  }
}
{code}
{noformat}
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1  at 
org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1$$anonfun$apply$2.apply(Queryable.scala:71)
  at 
org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1$$anonfun$apply$2.apply(Queryable.scala:70)
  at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
  at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)  
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)   at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)  
 at 
org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1.apply(Queryable.scala:70)
   at 
org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1.apply(Queryable.scala:69)
   at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:73) at 
org.apache.spark.sql.execution.Queryable$class.formatString(Queryable.scala:69) 
 at org.apache.spark.sql.Dataset.formatString(Dataset.scala:65)  at 
org.apache.spark.sql.Dataset.showString(Dataset.scala:263)   at 
org.apache.spark.sql.Dataset.show(Dataset.scala:230) at 
org.apache.spark.sql.Dataset.show(Dataset.scala:193) at 
org.apache.spark.sql.Dataset.show(Dataset.scala:201) at

[jira] [Updated] (SPARK-16666) Kryo encoder for custom complex classes

2016-07-26 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-1:
---
Description: 
I'm trying to create a dataset with some geo data using spark and esri. If 
`Foo` only have `Point` field, it'll work but if I add some other fields beyond 
a `Point`, I get ArrayIndexOutOfBoundsException.
{code:scala}
import com.esri.core.geometry.Point
import org.apache.spark.sql.{Encoder, Encoders, SQLContext}
import org.apache.spark.{SparkConf, SparkContext}

object Main {

  case class Foo(position: Point, name: String)

  object MyEncoders {
implicit def PointEncoder: Encoder[Point] = Encoders.kryo[Point]

implicit def FooEncoder: Encoder[Foo] = Encoders.kryo[Foo]
  }

  def main(args: Array[String]): Unit = {
val sc = new SparkContext(new 
SparkConf().setAppName("app").setMaster("local"))
val sqlContext = new SQLContext(sc)
import MyEncoders.{FooEncoder, PointEncoder}
import sqlContext.implicits._
Seq(new Foo(new Point(0, 0), "bar")).toDS.show
  }
}
{code}
{noformat}
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1  at 
org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1$$anonfun$apply$2.apply(Queryable.scala:71)
  at 
org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1$$anonfun$apply$2.apply(Queryable.scala:70)
  at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
  at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)  
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)   at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)  
 at 
org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1.apply(Queryable.scala:70)
   at 
org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1.apply(Queryable.scala:69)
   at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:73) at 
org.apache.spark.sql.execution.Queryable$class.formatString(Queryable.scala:69) 
 at org.apache.spark.sql.Dataset.formatString(Dataset.scala:65)  at 
org.apache.spark.sql.Dataset.showString(Dataset.scala:263)   at 
org.apache.spark.sql.Dataset.show(Dataset.scala:230) at 
org.apache.spark.sql.Dataset.show(Dataset.scala:193) at 
org.apache.spark.sql.Dataset.show(Dataset.scala:201) at 
Main$.main(Main.scala:24)at Main.main(Main.scala)
{noformat}

  was:
I'm trying to create a dataset with some geo data using spark and esri. If 
`Foo` only have `Point` field, it'll work but if I add some other fields beyond 
a `Point`, I get ArrayIndexOutOfBoundsException.
{code:scala}
import com.esri.core.geometry.Point
import org.apache.spark.sql.{Encoder, Encoders, SQLContext}
import org.apache.spark.{SparkConf, SparkContext}

object Main {

  case class Foo(position: Point, name: String)

  object MyEncoders {
implicit def PointEncoder: Encoder[Point] = Encoders.kryo[Point]

implicit def FooEncoder: Encoder[Foo] = Encoders.kryo[Foo]
  }

  def main(args: Array[String]): Unit = {
val sc = new SparkContext(new 
SparkConf().setAppName("app").setMaster("local"))
val sqlContext = new SQLContext(sc)
import MyEncoders.{FooEncoder, PointEncoder}
import sqlContext.implicits._
Seq(new Foo(new Point(0, 0), "bar")).toDS.show
  }
}
{code}
{quote}
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1  at 
org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1$$anonfun$apply$2.apply(Queryable.scala:71)
  at 
org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1$$anonfun$apply$2.apply(Queryable.scala:70)
  at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
  at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)  
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)   at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)  
 at 
org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1.apply(Queryable.scala:70)
   at 
org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1.apply(Queryable.scala:69)
   at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:73) at 
org.apache.spark.sql.execution.Queryable$class.formatString(Queryable.scala:69) 
 at org.apache.spark.sql.Dataset.formatString(Dataset.scala:65)  at 
org.apache.spark.sql.Dataset.showString(Dataset.scala:263)   at 
org.apache.spark.sql.Dataset.show(Dataset.scala:230) at 
org.apache.spark.sql.Dataset.show(Dataset.scala:193) at 
org.apache.spark.sql.Dataset.show(Dataset.scala:201) at

[jira] [Updated] (SPARK-16734) Make sure examples in all language bindings are consistent

2016-07-26 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-16734:
---
Priority: Minor  (was: Major)

> Make sure examples in all language bindings are consistent
> --
>
> Key: SPARK-16734
> URL: https://issues.apache.org/jira/browse/SPARK-16734
> Project: Spark
>  Issue Type: Sub-task
>  Components: Examples, SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16663) desc table should be consistent between data source and hive serde tables

2016-07-26 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-16663.

   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 14302
[https://github.com/apache/spark/pull/14302]

> desc table should be consistent between data source and hive serde tables
> -
>
> Key: SPARK-16663
> URL: https://issues.apache.org/jira/browse/SPARK-16663
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16734) Make sure examples in all language bindings are consistent

2016-07-26 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-16734:
--

 Summary: Make sure examples in all language bindings are consistent
 Key: SPARK-16734
 URL: https://issues.apache.org/jira/browse/SPARK-16734
 Project: Spark
  Issue Type: Sub-task
  Components: Examples, SQL
Affects Versions: 2.0.0
Reporter: Cheng Lian
Assignee: Cheng Lian






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16706) support java map in encoder

2016-07-26 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-16706.

   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 14344
[https://github.com/apache/spark/pull/14344]

> support java map in encoder
> ---
>
> Key: SPARK-16706
> URL: https://issues.apache.org/jira/browse/SPARK-16706
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16698) json parsing regression - "." in keys

2016-07-25 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-16698:
---
Assignee: Hyukjin Kwon

> json parsing regression - "." in keys
> -
>
> Key: SPARK-16698
> URL: https://issues.apache.org/jira/browse/SPARK-16698
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: TobiasP
>Assignee: Hyukjin Kwon
> Fix For: 2.0.1, 2.1.0
>
>
> The commit 83775bc78e183791f75a99cdfbcd68a67ca0d472 "\[SPARK-14158]\[SQL] 
> implement buildReader for json data source" breaks parsing of json files with 
> "." in keys.
> E.g. the test input for spark-solr 
> https://github.com/lucidworks/spark-solr/blob/master/src/test/resources/test-data/events.json
> {noformat}
> scala> 
> sqlContext.read.json("src/test/resources/test-data/events.json").collectAsList
> org.apache.spark.sql.AnalysisException: Unable to resolve params.title_s 
> given [_version_, count_l, doc_id_s, flag_s, id, params.title_s, 
> params.url_s, session_id_s, timestamp_tdt, type_s, tz_timestamp_txt, 
> user_id_s];
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:131)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:131)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:130)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:126)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:742)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at org.apache.spark.sql.types.StructType.foreach(StructType.scala:94)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
>   at org.apache.spark.sql.types.StructType.map(StructType.scala:94)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:126)
>   at 
> org.apache.spark.sql.execution.datasources.FileSourceStrategy$.apply(FileSourceStrategy.scala:80)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
>   at 
> org.apache.spark.sql.execution.SparkStrategies$SpecialLimits$.apply(SparkStrategies.scala:53)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
>   at 
> org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:52)
>   at 
> org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:50)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:57)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:57)
>   at org.apache.spark.sql.Dataset.withCallback(Dataset.scala:2321)
>   at org.apache.spark.sql.Dataset.collectAsList(Dataset.scala:2040)
>   ... 49 elided
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16698) json parsing regression - "." in keys

2016-07-25 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-16698.

   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.1

Issue resolved by pull request 14339
[https://github.com/apache/spark/pull/14339]

> json parsing regression - "." in keys
> -
>
> Key: SPARK-16698
> URL: https://issues.apache.org/jira/browse/SPARK-16698
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: TobiasP
> Fix For: 2.0.1, 2.1.0
>
>
> The commit 83775bc78e183791f75a99cdfbcd68a67ca0d472 "\[SPARK-14158]\[SQL] 
> implement buildReader for json data source" breaks parsing of json files with 
> "." in keys.
> E.g. the test input for spark-solr 
> https://github.com/lucidworks/spark-solr/blob/master/src/test/resources/test-data/events.json
> {noformat}
> scala> 
> sqlContext.read.json("src/test/resources/test-data/events.json").collectAsList
> org.apache.spark.sql.AnalysisException: Unable to resolve params.title_s 
> given [_version_, count_l, doc_id_s, flag_s, id, params.title_s, 
> params.url_s, session_id_s, timestamp_tdt, type_s, tz_timestamp_txt, 
> user_id_s];
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:131)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:131)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:130)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:126)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:742)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at org.apache.spark.sql.types.StructType.foreach(StructType.scala:94)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
>   at org.apache.spark.sql.types.StructType.map(StructType.scala:94)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:126)
>   at 
> org.apache.spark.sql.execution.datasources.FileSourceStrategy$.apply(FileSourceStrategy.scala:80)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
>   at 
> org.apache.spark.sql.execution.SparkStrategies$SpecialLimits$.apply(SparkStrategies.scala:53)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
>   at 
> org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:52)
>   at 
> org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:50)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:57)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:57)
>   at org.apache.spark.sql.Dataset.withCallback(Dataset.scala:2321)
>   at org.apache.spark.sql.Dataset.collectAsList(Dataset.scala:2040)
>   ... 49 elided
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16668) Test parquet reader for row groups containing both dictionary and plain encoded pages

2016-07-25 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-16668:
---
Assignee: Sameer Agarwal

> Test parquet reader for row groups containing both dictionary and plain 
> encoded pages
> -
>
> Key: SPARK-16668
> URL: https://issues.apache.org/jira/browse/SPARK-16668
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Reporter: Sameer Agarwal
>Assignee: Sameer Agarwal
> Fix For: 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16668) Test parquet reader for row groups containing both dictionary and plain encoded pages

2016-07-25 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-16668.

   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 14304
[https://github.com/apache/spark/pull/14304]

> Test parquet reader for row groups containing both dictionary and plain 
> encoded pages
> -
>
> Key: SPARK-16668
> URL: https://issues.apache.org/jira/browse/SPARK-16668
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Reporter: Sameer Agarwal
>Assignee: Sameer Agarwal
> Fix For: 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16691) move BucketSpec to catalyst module and use it in CatalogTable

2016-07-25 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-16691.

   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 14331
[https://github.com/apache/spark/pull/14331]

> move BucketSpec to catalyst module and use it in CatalogTable
> -
>
> Key: SPARK-16691
> URL: https://issues.apache.org/jira/browse/SPARK-16691
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16660) CreateViewCommand should not take CatalogTable

2016-07-25 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-16660.

   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 14297
[https://github.com/apache/spark/pull/14297]

> CreateViewCommand should not take CatalogTable
> --
>
> Key: SPARK-16660
> URL: https://issues.apache.org/jira/browse/SPARK-16660
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16703) Extra space in WindowSpecDefinition SQL representation

2016-07-24 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-16703:
---
Description: 
For a {{WindowSpecDefinition}} whose {{partitionSpec}} is empty, there's an 
extra space in its SQL representation:

{code:sql}

{code}

  was:
For a {{WindowSpecDefinition}} whose {{partitionSpec}} is empty, there's an 
extra space in its SQL representation:

{code:sql}
{code}


> Extra space in WindowSpecDefinition SQL representation
> --
>
> Key: SPARK-16703
> URL: https://issues.apache.org/jira/browse/SPARK-16703
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Minor
>
> For a {{WindowSpecDefinition}} whose {{partitionSpec}} is empty, there's an 
> extra space in its SQL representation:
> {code:sql}
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16703) Extra space in WindowSpecDefinition SQL representation

2016-07-24 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-16703:
---
Description: 
For a {{WindowSpecDefinition}} whose {{partitionSpec}} is empty, there's an 
extra space in its SQL representation:

{code:sql}
( ORDER BY `a` ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
{code}

  was:
For a {{WindowSpecDefinition}} whose {{partitionSpec}} is empty, there's an 
extra space in its SQL representation:

{code:sql}

{code}


> Extra space in WindowSpecDefinition SQL representation
> --
>
> Key: SPARK-16703
> URL: https://issues.apache.org/jira/browse/SPARK-16703
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Minor
>
> For a {{WindowSpecDefinition}} whose {{partitionSpec}} is empty, there's an 
> extra space in its SQL representation:
> {code:sql}
> ( ORDER BY `a` ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16703) Extra space in WindowSpecDefinition SQL representation

2016-07-24 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-16703:
--

 Summary: Extra space in WindowSpecDefinition SQL representation
 Key: SPARK-16703
 URL: https://issues.apache.org/jira/browse/SPARK-16703
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Minor


For a {{WindowSpecDefinition}} whose {{partitionSpec}} is empty, there's an 
extra space in its SQL representation:

{code:sql}
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16646) LEAST doesn't accept numeric arguments with different data types

2016-07-22 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15389141#comment-15389141
 ] 

Cheng Lian commented on SPARK-16646:


Could you please help check Hive's behavior here? Especially cases when we have 
to lose precision. Thanks!

> LEAST doesn't accept numeric arguments with different data types
> 
>
> Key: SPARK-16646
> URL: https://issues.apache.org/jira/browse/SPARK-16646
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Hyukjin Kwon
>
> {code:sql}
> SELECT LEAST(1, 1.5);
> {code}
> {noformat}
> Error: org.apache.spark.sql.AnalysisException: cannot resolve 'least(1, 
> CAST(2.1 AS DECIMAL(2,1)))' due to data type mismatch: The expressions should 
> all have the same type, got LEAST (ArrayBuffer(IntegerType, 
> DecimalType(2,1))).; line 1 pos 7 (state=,code=0)
> {noformat}
> This query works for 1.6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16632) Vectorized parquet reader fails to read certain fields from Hive tables

2016-07-21 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15387791#comment-15387791
 ] 

Cheng Lian commented on SPARK-16632:


Oh, I see, thanks for the explanation.

> Vectorized parquet reader fails to read certain fields from Hive tables
> ---
>
> Key: SPARK-16632
> URL: https://issues.apache.org/jira/browse/SPARK-16632
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: Hive 1.1 (CDH)
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
> Fix For: 2.1.0
>
>
> The vectorized parquet reader fails to read certain tables created by Hive. 
> When the tables have type "tinyint" or "smallint", Catalyst converts those to 
> "ByteType" and "ShortType" respectively. But when Hive writes those tables in 
> parquet format, the parquet schema in the files contains "int32" fields.
> To reproduce, run these commands in the hive shell (or beeline):
> {code}
> create table abyte (value tinyint) stored as parquet;
> create table ashort (value smallint) stored as parquet;
> insert into abyte values (1);
> insert into ashort values (1);
> {code}
> Then query them with Spark 2.0:
> {code}
> spark.sql("select * from abyte").show();
> spark.sql("select * from ashort").show();
> {code}
> You'll see this exception (for the byte case):
> {noformat}
> 16/07/13 12:24:23 ERROR datasources.InsertIntoHadoopFsRelationCommand: 
> Aborting job.
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 
> (TID 3, scm-centos71-iqalat-2.gce.cloudera.com): 
> org.apache.spark.SparkException: Task failed while writing rows
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:261)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getByte(OnHeapColumnVector.java:159)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:253)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
>   at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1325)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:258)
>   ... 8 more
> {noformat}
> This works when you point Spark directly at the files (instead of using the 
> metastore data), or when you disable the vectorized parquet reader.
> The root cause seems to be that Hive creates these tables with a 
> not-so-complete schema:
> {noformat}
> $ parquet-tools schema /tmp/byte.parquet 
> message hive_schema {
>   optional int32 value;
> }
> {noformat}
> There's no indication that the field is a 32-bit field used to store 8-bit 
> values. When the ParquetReadSupport code tries to consolidate both schemas, 
> it just chooses whatever is in the parquet file for primitive types (see 
> ParquetReadSupport.clipParquetType); the vectorized reader uses the catalyst 
> schema, which comes from the Hive metastore, and says it's a byte field, so 
> when it tries to read the data, the byte data stored in "OnHeapColumnVector" 
> is null.
> I have tested a small change to {{ParquetReadSupport.clipParquetType}} that 
> fixes this particular issue,

[jira] [Commented] (SPARK-16646) LEAST doesn't accept numeric arguments with different data types

2016-07-20 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15387172#comment-15387172
 ] 

Cheng Lian commented on SPARK-16646:


Thanks for the help! I'm not working on this.

> LEAST doesn't accept numeric arguments with different data types
> 
>
> Key: SPARK-16646
> URL: https://issues.apache.org/jira/browse/SPARK-16646
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Hyukjin Kwon
>
> {code:sql}
> SELECT LEAST(1, 1.5);
> {code}
> {noformat}
> Error: org.apache.spark.sql.AnalysisException: cannot resolve 'least(1, 
> CAST(2.1 AS DECIMAL(2,1)))' due to data type mismatch: The expressions should 
> all have the same type, got LEAST (ArrayBuffer(IntegerType, 
> DecimalType(2,1))).; line 1 pos 7 (state=,code=0)
> {noformat}
> This query works for 1.6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16646) LEAST doesn't accept numeric arguments with different data types

2016-07-20 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-16646:
---
Reporter: Cheng Lian  (was: liancheng)

> LEAST doesn't accept numeric arguments with different data types
> 
>
> Key: SPARK-16646
> URL: https://issues.apache.org/jira/browse/SPARK-16646
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>
> {code:sql}
> SELECT LEAST(1, 1.5);
> {code}
> {noformat}
> Error: org.apache.spark.sql.AnalysisException: cannot resolve 'least(1, 
> CAST(2.1 AS DECIMAL(2,1)))' due to data type mismatch: The expressions should 
> all have the same type, got LEAST (ArrayBuffer(IntegerType, 
> DecimalType(2,1))).; line 1 pos 7 (state=,code=0)
> {noformat}
> This query works for 1.6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16646) LEAST doesn't accept numeric arguments with different data types

2016-07-20 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-16646:
---
Assignee: Hyukjin Kwon

> LEAST doesn't accept numeric arguments with different data types
> 
>
> Key: SPARK-16646
> URL: https://issues.apache.org/jira/browse/SPARK-16646
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Hyukjin Kwon
>
> {code:sql}
> SELECT LEAST(1, 1.5);
> {code}
> {noformat}
> Error: org.apache.spark.sql.AnalysisException: cannot resolve 'least(1, 
> CAST(2.1 AS DECIMAL(2,1)))' due to data type mismatch: The expressions should 
> all have the same type, got LEAST (ArrayBuffer(IntegerType, 
> DecimalType(2,1))).; line 1 pos 7 (state=,code=0)
> {noformat}
> This query works for 1.6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16648) LAST_VALUE(FALSE) OVER () throws IndexOutOfBoundsException

2016-07-20 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-16648:
---
Reporter: Cheng Lian  (was: liancheng)

> LAST_VALUE(FALSE) OVER () throws IndexOutOfBoundsException
> --
>
> Key: SPARK-16648
> URL: https://issues.apache.org/jira/browse/SPARK-16648
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> The following simple SQL query reproduces this issue:
> {code:sql}
> SELECT LAST_VALUE(FALSE) OVER ();
> {code}
> Exception thrown:
> {noformat}
> java.lang.IndexOutOfBoundsException: 0
>   at 
> scala.collection.mutable.ResizableArray$class.apply(ResizableArray.scala:43)
>   at scala.collection.mutable.ArrayBuffer.apply(ArrayBuffer.scala:48)
>   at scala.collection.mutable.ArrayBuffer.remove(ArrayBuffer.scala:169)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:244)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:214)
>   at 
> org.apache.spark.sql.catalyst.analysis.TypeCoercion$ImplicitTypeCasts$$anonfun$apply$12.applyOrElse(TypeCoercion.scala:637)
>   at 
> org.apache.spark.sql.catalyst.analysis.TypeCoercion$ImplicitTypeCasts$$anonfun$apply$12.applyOrElse(TypeCoercion.scala:615)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionDown$1(QueryPlan.scala:156)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:166)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:170)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:170)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$4.apply(QueryPlan.scala:175)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:175)
>   at 
>

< 1 2 3 4 5 6 7 8 9 10 >

101 - 200 of 2024 matches

Mail list logo